IT Guide to Big Data Infrastructure on ZeroStack · 2018-07-25 · siloed management of servers, storage, and networks ... deploy Cassandra or MongoDB, you have to put in effort to

@ZeroStackInc [email protected] www.zerostack.com

IT Guide to Big Data Infrastructure on ZeroStack

White Paper

Copyright © 2018 ZeroStack, Inc. All rights reserved. This product is protected by U.S. and international copyright and intellectual property laws. ZeroStack is a registered trademark or trademark of ZeroStack, Inc. in the United States and/or other jurisdictions. All other marks and names mentioned herein may be trademarks of their respective companies.

@ZeroStackInc [email protected] www.zerostack.comWhite Paper: IT Guide to Big Data Infrastructure on ZeroStack

Copyright © 2018 ZeroStack, Inc. All rights reserved. This product is protected by U.S. and international copyright and intellectual property laws. ZeroStack is a registered trademark or trademark of ZeroStack, Inc. in the United States and/or other jurisdictions. All other marks and names mentioned herein may be trademarks of their respective companies. 2

IntroductionThe word “Big” in Big Data doesn’t even come close to capturing what is happening today in our industry and what is yet to come. The volume, velocity, and variety of data that is being generated has overwhelmed the capabilities of infrastructure and analytics we have today.

We are now experiencing Moore’s law for data growth: data is doubling every 18 months.

This is being fueled by emerging use cases such as Autonomous cars, Internet of Things, Cyber Security Data Surveillance, Mobile and Real Time video processing, Blockchains, Bitcoin Mining, etc.

Data scientists typically may have to combine data from various sources with different volume, variety, and velocity needs simultaneously and that in turn puts different demands on processing power, storage & network performance, latencies etc. Here is a quick look at the different types of Big Data sources:

Unstructured data: This is the type of data generated by sources such as social media, log files, and sensor data. Such data is not very structured and hence is

IDC forecasts that by 2025, the global datasphere will grow to 163 zettabytes (that is a trillion gigabytes). That’s ten times the data generated in 2016.

generally not amenable to traditional database analysis methods. A large variety of Big Data tools, techniques, and approaches have emerged in the last few years to ingest, analyze, and extract customer sentiment from social media data. Newer approaches include Natural Language Processing, News Analytics, unstructured text analysis etc.

Semi-structured data: Some unstructured data may in fact have some structure to them. Examples include Email, Call Center logs, IoT data. Some in the industry have coined a new term Semi-structured data to describe these data sources. These may require a combination of traditional databases and newer Big Data tools to extract useful insights from these types of data.

180

160

140

120

100

80

60

40

20

02010 2011 2012 2013 2014 2015 2016 2017 2018 2019 2020 2021 2022 2023 2024 2025

Embedded

Productivity data

Non-Entertainment image/video

Entertainment

ZETT

A BY

TES

Source: IDC/Seagate Data

https://www.seagate.com/files/www-content/our-story/trends/files/data-age-2025-infographic-2017.pdf

https://link.springer.com/article/10.1007/s00146-014-0549-4



Streaming data brings in the dimension of higher velocity and real-time processing constraints. Velocity of data varies widely depending on the type of application, IoT data tends to be small packets of data regularly streamed at low velocity. On the other hands, 4K video

streams stretch the velocity to the highest end of the spectrum. In order to produce useful results and value from their analysis.

The alluring promise of these new use cases—and associated emerging technologies and tools—is that they can generate useful insights faster so that companies can take actions to achieve better business outcomes, improve customer experience, and gain significant competitive advantage.

While data scientists are dealing with this complexity of how to derive value from diverse data sources, IT practitioners need to figure out the most efficient way to deal with the infrastructure requirements of big data projects. Traditional bare-metal infrastructure with its siloed management of servers, storage, and networks is not flexible enough to tackle the dynamic nature of the new Big Data workloads. This is where cloud-based systems shine. However, many challenges remain to be addressed in the areas of workload scaling, performance and latency, data migration, bandwidth limitations, and application architectures.

There are many pain points that companies experience when they try to deploy and run Big Data applications in their complex environments and/or use cloud platforms both public or private, and there are also some best practices companies can use to address those pain points.

Long Commute from Storage to ComputeAs data amounts grow from terabyte to petabyte and beyond, the time it takes to transport this data closer to compute and perform data processing and analytics takes longer and longer, impeding the agility of the organization.

Not only is this a classic lock-in scenario, but it is also the antithesis of other key emerging trends:

Edge Computing and Artificial Intelligence, especially for use cases such as IoT, 5G, image/speech recognition, Blockchain, and others where there is a need to place processing and data closer to each other and/or closer to where the user or device is.

Edge computing delivers faster data analytics results with the data being closer to processing while simultaneously reducing the cost of transporting of data to the cloud.

Artificial Intelligence systems are more effective the more data they are given. For example, in Deep Learning, the more cases (data) you give to the system, the more it learns and the more its results get accurate.

Almost 70% of Fortune 1000 firms rate big data as important to their businesses; over 60% already have at least one big data project in place.

Public cloud vendors want to get your data into their cloud and go to extreme lengths to get it. And bias data transfer fees toward taking data out.



This is a case where you need massive parallel processing (e.g., using GPUs) of large data sets. Big Data analytics and AI can complement each other to improve speed of processing and produce more useful and relevant results.

To address the need to get data to where the compute is or vice versa, IT leaders should look for hyper-converged scale-out solutions that bring together compute, storage, and networking, thus reducing data I/O latency and improving data processing and analytics times. For even better performance, they should look for solutions that can bring the computing units (VMs or containers) as close to the physical storage as possible, without losing the manageability of the storage solution and while maintaining multi-tenancy across the cluster. For example, a Hadoop Data Node VM running on the same physical host and accessing local SSDs will experience the highest performance benefits and faster results overall without impacting other workloads running within other tenants.

IT leaders can take advantage of many emerging memory technologies such as persistent memory (a new memory technology between DRAM and flash that will be non-volatile, with low latency and higher capacity than DRAMs), NVMe, and faster flash drives. With prices falling rapidly, there seems little need for spinning disks for primary storage.

IT administrators should implement a central way to manage all the edge computing sites, with the ability to deploy and manage multiple data processing clusters within those sites. Access rights to each of these environments should be managed through strict BU-level and Project-level RBAC and security controls.

Distributed Teams, Local Performance NeedsFor Data Science development and testing use cases, companies do not build a single huge data processing cluster in a centralized data center for all of their big data teams spread around the world. Building such a cluster in one location has DR implications, not to mention

latency and country-specific data regulation challenges. Typically, companies want to build out separate local/edge clusters based on location, type of application, data locality requirements, and the need for separate development, test, and production environments.

IT administrators should implement a central way to manage diverse infrastructures in multiple sites, with the ability to deploy and manage multiple data processing clusters within those sites. Access rights to each of these environments should be managed through strict BU-level and project-level RBAC and security controls.

Stuck on Bare Metal and Its Silo InefficienciesMost companies still run Big Data workloads, particularly Hadoop-based workloads, on bare metal. This is obviously not as scalable, elastic, or flexible as a virtual or cloud platform. Traditional bare metal environments are famous for creating silos where various specialist teams (storage, networking, security) form fiefdoms around their respective functional areas. Silos impede velocity because they lead to complexity of operations, lack of consistency in the environment, and lack of automation. Automating across silos turns into to an exercise of custom scripts and lot of “glue and duct tape,” which makes maintenance and change management complex, slow, and error-prone.

Having a central pane of glass for management becomes crucial for operational efficiency, simplifying deployment, and upgrading these clusters.



A virtualized environment for big-data allows data scientists to create their own Hadoop, Spark or Cassandra clusters and evaluate their algorithms. These clusters need to be self-service, elastic and high performing. IT should be able to control the resource allocation to data scientists and teams using quotas and role-based access control.

Big Data Tools Explosion and Deployment ComplexityIn the past decade, technologies such as Hadoop and MapReduce became common frameworks to speed up processing of large datasets by breaking them up into small fragments, running them in distributed farms of storage and processors clusters, and then collating the results back for consumption. Companies like Cloudera, Hortonworks and others have addressed many of the challenges associated with scheduling, cluster management, resource and data sharing, and performance tuning of these tools. And typically, such deployments are optimized to run on bare metal or on virtualization platforms like VMware and therefore tend to remain in their own silo because of the complexity of deploying and operating these environments.

Modern big data use cases, however, need a whole bunch of other technologies and tools. You have Docker. You have Kubernetes. You have Spark. You have NoSQL Databases such as Cassandra and MongoDB. And when you get into machine learning you have TensorFlow, etc.

Deploying Hadoop, which is quite complex, is one thing, arguably made relatively easy by companies

like Cloudera and Hortonworks, but then if you need to deploy Cassandra or MongoDB, you have to put in effort to write Ansible or Puppet or Chef scripts to deploy. And depending on the target platform (bare metal, VMware, Microsoft), you will need to maintain and run multiple scripts. You then have to figure out how to network the Hadoop cluster with the Cassandra cluster and of course, inevitably, deal with DNS services, load balancers, firewalls, etc. Add other Big Data tools to be deployed, managed, and integrated, and you will begin to appreciate the challenge.

This greatly simplifies the IT burden when it comes to provisioning the underlying infrastructure resources, and end users can simply deploy the tools they want and need with a single click and have the ability to use APIs to automate their deployment, provisioning, and configuration challenges.

One Big Data Cluster Doesn’t Address All NeedsOrganizations have diverse Big Data teams, production and R&D portfolios, and sometimes conflicting requirements for performance, data locality, cost, or specialized hardware resources. One single, standardized Data cluster is not going to meet all of those needs. Companies will need to deploy multiple, independent Big Data clusters with possibly different

Look for an orchestration platform that can deal both with bare metal and virtual environments.

Select a unifying platform that can not only deploy multiple Big Data tools and platforms from a curated “application and big data catalog.”



underlying CPU, memory, and storage footprints. One cluster could be dedicated and fine-tuned for a Hadoop deployment with high local storage IOPS requirements, another one may be running Spark jobs with more CPU and memory bound configurations, and others like machine learning will need GPU infrastructure. Deploying and managing the complexity of such multiple diverse clusters will place a high operational overhead on IT team, reducing their ability to respond quickly to Big Data user requests, and making it difficult to manage costs and maintain operational efficiency.

To address this pain point, the IT team should again have a unified orchestration/management platform and be able to set up logical business units that can be assigned to different Big Data teams. This way, each team gets full self-service capability within quota limits imposed by the IT staff, and each team can automatically deploy its own Big Data tools with a few clicks, independently of other teams.

Skyrocketing IT Operations Costs Developing, deploying, and operating large-scale enterprise big data clusters can get complex, especially if it involves multiple sites, multiple teams, and diverse infrastructure, as we have seen in previous pain points.

The operational overhead of these systems can be expensive and manually time-consuming. For example, IT operations teams still need to set up firewalls, load balancers, DNS services, and VPN services, to name a few. They still need to manage infrastructure

operations such as physical host maintenance, disk additions/removals/replacements, and physical host additions/removals/replacements. They still need to do capacity planning, and they still need to monitor utilization, allocation, and performance of compute, storage, and networking.

IT teams should look for solution that addresses this operational overhead through automation and through the use of modern SaaS-based management portals that help the teams optimize sizing, perform predictive capacity planning, and implement seamless failure management.

Consistent Policy-driven Security and Customization RequirementsEnterprises have policies around using their specifically hardened and approved gold images of operating systems. The operating systems often need to have security configurations, databases, and other management tools installed before they can be used. Running these on public cloud may not be allowed, or they may run very slowly.

The solution is to enable an on-premises data center image store where enterprises can create customized gold images. Using fine-grained RBAC, the IT team can share these images selectively with various development teams around the world, based on the local security, regulatory, and performance requirements. The local Big Data deployments are then carried out using these gold images to provide the underlying infrastructure to run containers.

DR Strategy for Edge Computing and Big Data Clusters Any critical application and the data associated with it needs to be protected from natural disasters regardless of whether or not these apps are based on containers. None of the existing solutions provide an out-of-the-box disaster recovery feature for critical edge computing clusters or Big Data analytics applications. Customers are left to cobble together their own DR strategy.

Each team can automatically deploy its own Big Data Applications with a few clicks; independently of other teams.



As part of a platform’s multi-site capabilities, IT teams should be able to perform remote data replication and disaster recovery between remote geographically-separated sites. This protects persistent data and databases used by these clusters.

The ZeroStack Solution for Solving Big Data Workload Challenges The ZeroStack Cloud Platform provides a virtualized environment where data scientists can spin up multiple Big Data application clusters on demand and scale them as needed. These clusters can be geographically closer to the data scientists and other users as well as co-located closer to data sources, and can be networked onto the same high-speed local area networks for faster data ingestion and processing. Multiple clusters can be managed and monitored from a single web-based interface for any-time, any-place, any-device access. This is particularly useful for edge computing use cases.

The platform also allows optimal utilization of resources and performance guarantees to run these applications. ZeroStack has unique local storage capabilities to avoid double replication of data both at the infrastructure and application level, specifically designed for Big Data application use cases. IT can allocate projects with specific quotas to one or more users to allow them to work independently.

ZeroStack Architecture Ideally Suited for Big Data ProjectsZeroStack solution consists of two key components:

1. A cloud operating system called Z-COS which is installed on on-premises industry standard servers

2. A monitoring, management and orchestration portal called Z-Brain running on ZeroStack’s private cloud

ZeroStack Operating SystemThe ZeroStack operating system (Z-COS) is a very stripped down version of Linux that is optimized to run VMs. It consists of a built-in KVM hypervisor, drivers for supported storage and networking devices, some

key OpenStack services and ZeroStack code to create a cloud cluster. Z-COS can be installed on any industry standard x86 to create a hyper-converged scale-out system starting with a minimum of 4 servers. Capacity can be added on demand. Multiple clusters in different sites are managed through the centralized Z-Brain.

The architecture incorporates local storage for performance and distributed storage with replication factor for high availability.

With this design, customers can deploy multiple big data clusters each configured with specific CPU/GPU, memory, and storage footprints allowing big data scientists to use the most appropriate infrastructure for their specific workloads. Furthermore, even within the cluster, using workload placement policies and affinity/anti-affinity rules, users can direct their workloads to co-reside (or not) on specific hosts and connect with specified storage pools thus providing a level of control of performance and the underlying infrastructure resources.

Clustering and Self-healingBut the real power of the architecture in the distributed control plane that automatically detects failures in a cluster of servers and heals itself i.e. it automatically brings up the required cloud services on other available servers if one of the servers in the cluster fail. The upshot of this that, the workload does not experience any disruption or downtime if there are failures in the underlying physical infrastructure. This clustering and self-healing design extends to the storage system which is particularly important in Big Data scenarios where data integrity and availability is crucial for successful analytics outcomes.

ZeroStack automatically detects failures and heals itself without impacting customer workloads.



SSD1

/dev/sda1/dev/sda2 /dev/sdb1 /dev/sdc1 /dev/sdd1 /dev/sde1 /dev/sdf1

SSD2 HDD1 HDD2 HDD3 HDD4

Figure 1: Disk layout and partitions on each server

Software-Defined StorageZ-COS takes over all the local disks attached to a server. Typically it expects both SSDs and HDDs to be present in a server. On first disk, it creates a LVM partition of about 300GB. This partition is used to install the host operating system, ZeroStack software and also for logs. The partition is further divided in to multiple logical volumes for operating system, data and logs. It creates two operating system volumes to allow for seamless upgrades from one version to another. Having two volumes allows our software to always install the latest code in a separate partition and reboot in to that partition after upgrade. This allows for a non-disruptive upgrade and also an easy rollback procedure in case of any failure. The remaining disks are also formatted by Z-COS and a single partition is created on each disk to be used either as a local disk or as part of a shared storage pool.

Figure 1 shows the typical disk layout, once Z-COS is installed on a server with 2 SSDs and 4 HDDs.

Here /dev/sda1 has a LVM with different volumes for OS, software, data and logs. Once Z-COS is installed, during cloud create process each disk is either put in to a local storage pool or a clustered storage pool. A disk that is in local pool, provides local storage to VMs and the data on that pool is not replicated. A disk in clustered storage pool, provides a pool with replicated data. So any virtual disk on that pool is protected against server and disk failures. Right now, we do 2 more replicas for each block in that pool. So it can tolerate two disk failures simultaneously. Also these replicas are done across servers, so that it can also tolerate server failures.

Figure 2 below shows a possible configuration of disks in a 4 node cluster with 24 drives with four pools: Local SSD, Local HDD, Shared SSD, Shared HDD.

Here one SSD per host is part of local pool, which is not replicated. One SSD per host is part of shared pool, which creates a shared SSD pool with 4 drives. Similarly, one HDD is part of local HDD pool per host and three are part of shared HDD pool. So there are total of 12 drives in a shared HDD pool.

User can further configure and drop the local pools if they want. Also, if there are no HDDs or SSDs in the servers, the Z-COS software will skip creating those pools. This allows complete flexibility in terms of leveraging local disks to create different storage back-ends for various workloads. Here local pools are useful either for test/dev workloads or for NoSQL stores that do their own replication at the application level. For example, Hadoop, Cassandra and other such workloads do not need shared and replicated storage underneath them.



Host 1

Shared SSD pool

Local SSD(not replicated)

Local HDD(not replicated)

Shared HDD pool

Host 2 Host 3 Host 4

Figure 2: Configuration of disks in a 4 node cluster

Z-Brain: Web-Managed multi-site, multi-cluster portalThe on-premises servers can be on multiple sites or deployed as separate clusters, for example in different racks in the data center. These clouds can be configured, consumed, and managed from a single pane of glass using Z-Brain.

The Z-Brain itself is a self-driving cloud that is built using scale-out Big Data principles. The Z-Brain consists of a large cluster of VMs running with local storage creating a large data storage cluster to collect, store and analyze telemetry information from the on-premise clouds. The Z-Brain leverages Hadoop, Spark, Cassandra, Redis, Etcd, to carry out time series analytics, predictive models that help users to make intelligent decisions around capacity planning, troubleshooting, and placements of applications. imum.

The management, upgrade and availability of this platform is completely handled by ZeroStack team.

Built-in Big Data App StoreZeroStack supports several Big Data applications via its built-in App Store. This built-in App Store offers pre-built application templates that enable customers to deploy Big Data applications with ease. Some of the example templates include the following;

• Big data applications such as Apache Hadoop, Cloudera Express, and Spark

• SQL and NoSQL databases such as Cassandra, Redis, MongoDB



• Monitoring and data analysis tools such as ELK, Splunk

• Application servers such as Apache and Nginx

• Container tools such as Kubernetes and Docker

Users can “import” these templates to their ZeroStack private cloud with a few click and then deploy them. These templates have configuration options that allow for storage, networking, and compute optimization as needed for a given environment. Users can also create and upload their own custom Big Data application templates to the App Store.

Cloudera App deployment

Here is an example of template in the App store, import it into your business unit library:

Provide a few parameters for the deployment such as flavor type, size and type of disks, mySQL password, and network information.



Within a few minutes, the Cloudera app will be deployed and a link to the cloudera manager is shown on the ZeroStack App page:

Login to cloudera manager, commision the ZeroStack nodes and deploy the required packages.



In less than 10 minutes, you have a cloudera deployment running and ready to be used for real analytics work.



Summary of ZeroStack Benefits for Big Data DeploymentsThe following out-of-the-box capabilities solves many of the challenges outlined earlier in this paper

Resource sharing using virtualizationEliminate physical silos and consolidate multiple big-data applications on a single platform. 50 percent lower CapEx due to resource sharing, high utilization, and 90 percent lower Opex with self-service.

Faster Time to ValueProvide self-service deployment of Big Data applications like Hadoop, Spark, Cassandra to development and R&D teams. They can deploy applications within minutes.

Resource ManagementControl consumption of resources using projects with quotas and policies governed by IT. Monitor over-commitment and add capacity using built-in capacity planning indicators. Get insights to improve efficiency and performance based on actual application stats and machine learning. Optimize capacity using long term analytics.

Scale on DemandScale infrastructure to meet compute performance and data growth. Build with one server at a time and grow based on actual usage.

Eliminate operational complexityWith cloud-based monitoring and analytics in the web-based Z-Brain, customers do not need any local infrastructure monitoring solution. IT teams do not need any certificates or expertise to operate infrastructure. This reduces total cost of ownership (TCO) by 50 percent while cutting operational complexity by 90 percent.

With the ZeroStack Cloud Platform, enterprises can deploy, operate and manage Big Data projects with high performance and low overhead. On-premises cloud is the key to centralized management, and ZeroStack is the premier of on-premises cloud platforms.

Documents

IT Guide to Big Data Infrastructure on ZeroStack · 2018-07-25 · siloed management of servers, storage, and networks ... deploy Cassandra or MongoDB, you have to put in effort to