Platfora Deployment Planning Guide · Platfora Deployment Planning Guide - About Platfora Deployments Page 8 Platfora connects to the Hadoop cluster managed by your organization,

Platfora Deployment PlanningGuide

Version 5.3

Copyright Platfora 2016

Last Updated: 1:48 p.m. August 12, 2016

ContentsDocument Conventions............................................................................................. 4Contact Platfora Support...........................................................................................5Copyright Notices...................................................................................................... 5

Chapter 1: About Platfora Deployments.................................................................... 7Platfora Deployment Architectures............................................................................7

On-Premise Hadoop Deployments...................................................................... 7Amazon AWS Cloud Deployments......................................................................9Google Cloud Platform Deployments................................................................ 10

Platfora Server Architecture....................................................................................10FAQs—Platfora Deployments................................................................................. 13

Chapter 2: Supported Environments and Versions................................................16

Chapter 3: System Requirements (On-Premise)..................................................... 18Platfora Server Requirements.................................................................................18Hadoop Resource Requirements............................................................................19

Chapter 4: System Requirements (AWS Cloud)......................................................21Platfora EC2 Instance Requirements......................................................................21Amazon EMR Instance Requirements....................................................................22AWS Security Settings for Platfora.........................................................................23

Amazon AWS Virtual Private Cloud (VPC)....................................................... 23IAM User and IAM Roles for Platfora................................................................24EC2 Security Group Settings............................................................................ 29

Chapter 5: System Requirements (GCP Cloud)...................................................... 30Platfora Compute Engine Machine Requirements..................................................30Google Dataproc Machine Requirements...............................................................31GCP Security Settings for Platfora......................................................................... 32

Chapter 6: Port Configuration Requirements..........................................................34Ports to Open on Platfora Nodes........................................................................... 34Ports to Open on Hadoop Nodes........................................................................... 35

Chapter 7: Browser Requirements........................................................................... 37

Appendix A: Hardware Specifications for Platfora Nodes..................................... 38

Platfora Deployment Planning Guide - Contents

Page 3

Appendix B: EC2 Considerations for Platfora Instances....................................... 39

Page 4

PrefaceThis guide provides information about what you need to consider when deploying a new Platfora®

cluster. This guide is intended for system and Hadoop administrators who are responsible for procuringand managing server resources. Knowledge of Linux system administration, network administration andHadoop administration is recommended.

Document ConventionsThis documentation uses certain text conventions for language syntax and code examples.

Convention Usage Example

$ Command-line prompt -proceeds a command to beentered in a command-lineterminal session.

$ ls

$ sudo Command-line promptfor a command thatrequires root permissions(commands will be prefixedwith sudo).

$ sudo yum install open-jdk-1.7

UPPERCASE Function names andkeywords are shown in alluppercase for readability,but keywords are case-insensitive (can be writtenin upper or lower case).

SUM(page_views)

italics Italics indicate a user-supplied argument orvariable.

SUM(field_name)

[ ] (squarebrackets)

Square brackets denoteoptional syntax items.

CONCAT(string_expression[,...])

...(elipsis)

An elipsis denotes a syntaxitem that can be repeatedany number of times.

CONCAT(string_expression[,...])

Platfora Deployment Planning Guide - Introduction

Page 5

Contact Platfora Support

For technical support, you can send an email to:

[email protected]

Or visit the Platfora support site for the most up-to-date product news, knowledge base articles, andproduct tips.

http://support.platfora.com

To access the support portal, you must have a valid support agreement with Platfora. Please contactyour Platfora sales representative for details about obtaining a valid support agreement or with questionsabout your account.

Copyright Notices

Copyright © 2012-16 Platfora Corporation. All rights reserved.

Platfora believes the information in this publication is accurate as of its publication date. Theinformation is subject to change without notice.

THE INFORMATION IN THIS PUBLICATION IS PROVIDED “AS IS.” PLATFORACORPORATION MAKES NO REPRESENTATIONS OR WARRANTIES OF ANY KIND WITHRESPECT TO THE INFORMATION IN THIS PUBLICATION, AND SPECIFICALLY DISCLAIMSIMPLIED WARRANTIES OF MERCHANTABILITY OR FITNESS FOR A PARTICULARPURPOSE.

Use, copying, and distribution of any Platfora software described in this publication requires an

applicable software license. Platfora®, You Should Know™, Interest Driven Pipeline™, Fractal Cache™,

and Adaptive Job Synthesis™ are trademarks of the Platfora Corporation. Apache Hadoop™ and Apache

Hive™ are trademarks of the Apache Software Foundation. All other trademarks used herein are theproperty of their respective owners.

Embedded Software Copyrights and License Agreements

Platfora contains the following open source and third-party proprietary software subject to theirrespective copyrights and license agreements:

• Apache Hive PDK

• dom4j

• freemarker

• GeoNames

• Google Maps API

• Apache Jandex

mailto://[email protected]

http://support.platfora.com

http://www.apache.org/licenses

http://dom4j.sourceforge.net/dom4j-1.6.1/license.html

http://freemarker.org/docs/app_license.html

http://www.geonames.org/about.html

http://www.google.com/help/terms_maps.html

http://www.apache.org/licenses/LICENSE-2.0

Platfora Deployment Planning Guide - Introduction

Page 6

• Apache POI

• javassist

• javax.servlet

• Mortbay Jetty 6.1.26

• OWASP CSRFGuard 3

• PostgreSQL JDBC 9.1-901

• Scala

• sjsxp : 1.0.1

• Unboundid

• Tableau

• jBCrypt

• SimpleSlider

http://www.apache.org/licenses/LICENSE-2.0

http://www.eclipse.org/jetty/licenses.php

https://www.owasp.org/index.php/Category:OWASP_CSRFGuard_Project

http://jdbc.postgresql.org/about/license.html

http://www.scala-lang.org/node/146

https://www.unboundid.com/products/ldapsdk/docs/index.php

http://www.tableau.com/legal

http://www.mindrot.org/files/jBCrypt/LICENSE

https://github.com/ruyadorno/SimpleSlider#license

Page 7

Chapter

1About Platfora DeploymentsPlatfora runs on dedicated servers in the same network as your Hadoop deployment, which can be in an on-premise data center or in the cloud. Platfora uses the data processing services of Hadoop to process and preparedata for analysis. Platfora uses the data storage services of Hadoop to access the raw data and to store the outputof the optimized data it prepares. This section explains how Platfora is deployed and the basics of the Platfora/Hadoop server architecture.

Topics:• Platfora Deployment Architectures

• Platfora Server Architecture

• FAQs—Platfora Deployments

Platfora Deployment Architectures

The Platfora software runs on a scale-out cluster of servers. These servers can be physical servers in anon-premise data center or virtual server instances in the cloud. Platfora uses native Hadoop protocolsto connect to the distributed file system and data processing services of Hadoop. Platfora should bedeployed on dedicated machines with low-latency connections to these Hadoop cluster services. Thissection explains how Platfora is deployed in your network environment, using either an on-premise,Google Dataproc cloud, or AWS cloud deployment of Hadoop.

On-Premise Hadoop Deployments

An on-premise Hadoop deployment means that you already have an existing Hadoop installation in yourdata center (either a physical data center or a virtual private cloud).

Platfora Deployment Planning Guide - About Platfora Deployments

Page 8

Platfora connects to the Hadoop cluster managed by your organization, and the majority of yourorganization's data is stored in the distributed file system of this primary Hadoop cluster.

For on-premise Hadoop deployments, the Platfora servers should be on their own dedicated hardwareco-located in the same data center as your Hadoop cluster. A data center can be a physical location withactual hardware resources, or a virtual private cloud environment with virtual server instances (such asRackspace or Amazon EC2). Platfora recommends putting the Platfora servers on a network with at least1 Gbps connectivity to the Hadoop nodes.

Platfora users access the Platfora master node using an HTML5-compliant web browser. The Platforamaster node accesses the HDFS NameNode and the MapReduce JobTracker or YARN ResourceManager using native Hadoop protocols. The Platfora worker nodes access the HDFS DataNodesdirectly. If using a firewall, Platfora recommends placing the Platfora servers on the same side of thefirewall as your Hadoop cluster.

Platfora software can run on a wide variety of server configurations – on as little as one server or scaleacross multiple servers. Since Platfora runs best with all of the active lenses readily available in RAM,Platfora recommends obtaining servers optimized for higher RAM capacity and a minimum of 8 CPUs.


Page 9

Amazon AWS Cloud Deployments

An Amazon Web Services (AWS) cloud deployment means that you do not have a persistent Hadoopcluster. Instead, your organization uses Amazon S3 for raw data storage and Amazon EMR for on-demand Hadoop data processing.

In an Amazon AWS cloud deployment, the Platfora server instances are deployed on dedicated, high-memory EC2 instances. Your organization’s raw data is managed in Amazon's Simple Storage Service(S3). Platfora uses Amazon Elastic MapReduce (EMR) to run its data processing jobs (lens builds). Theresults of the lens build jobs are then written back to S3.


Page 10

Google Cloud Platform Deployments

A Google Cloud Platform (GCP) cloud deployment means that you do not have a persistent Hadoopcluster. Instead, your organization uses Google Cloud Storage for raw data storage and Google CloudDataproc for on-demand Hadoop data processing.

In a Google Cloud Platform deployment, the Platfora server instances are deployed on dedicated, high-memory Google Compute Engine instances. Your organization’s raw data is managed in Google's CloudStorage. Platfora uses Google Cloud Dataproc to run its data processing jobs (lens builds). The results ofthe lens build jobs are then written back to Google Cloud Storage.

Platfora Server Architecture

Platfora connects to an existing Hadoop implementation, and makes the raw data residing in Hadoopaccessible to users. The Platfora server has a number of services that work together with Hadoop's


Page 11

services to access the raw data, prepare it for analysis, and present the results to users. This topic helpsyou understand the main components of the Platfora server architecture.

The Platfora Master Node

You can have a fully-functioning Platfora installation with just one node—the master node. The masternode manages the following Platfora services:

• Metadata Catalog - Platfora's metadata catalog holds all of the information about the data managedby Platfora (the datasets, lenses, vizboards and so on). The metadata catalog is a relational databasethat runs on the Platfora master node, but is accessed by all nodes in the Platfora cluster.

• Lens Builder - The lens builder interfaces with the data processing services of Hadoop. It translatesdata requests from the Platfora application into a series of custom MapReduce jobs, which it thensubmits to the Hadoop Job Tracker or Resource Manager for execution. After the requested data hasbeen extracted and transformed in Hadoop, the job results are written back to the Hadoop file systemin Platfora's proprietary file format called a lens.

• On-Disk Storage - Finished lenses are immediately copied from the Hadoop file system to on-diskstorage of the Platfora nodes. The data of a lens is distributed across all of the available worker nodesin a Platfora cluster.

• In-Memory Query Engine - When users explore and analyze data in Platfora, they are actuallygenerating queries that run against a lens. The result of a lens query is rendered as a visualizationin Platfora. When users construct visualizations, they choose a lens to work with. Choosing a lens


Page 12

loads its data into Platfora's in-memory query engine. The in-memory query engine has two kinds ofprocesses that work on a query:

1. Query Coordinator - The query coordinator process runs on the master node only, and translatesactions made in the Platfora application into queries. The coordinator sends the query to theworkers for processing, then consolidates the partial results from each worker into a final result.

2. Query Worker - The query worker process typically runs on the worker nodes, but the mastermay also serve as a worker in some cases. A query worker process works on its portion of lensdata for a given query.

• Web Application Server - Platfora's user interface runs as a web application in your network. Usersconnect to Platfora using any HTML5-compliant browser. Through the browser, users interact withdata in Hadoop as easily as browsing a web site.

The Platfora Worker Nodes

The Platfora worker nodes are used to distribute lens storage capacity and query processing workload.As users work with more and bigger lenses in Platfora, more memory and processing power is needed torender visualizations quickly. Administrators can add additional worker nodes to scale up lens storagecapacity and performance. By using the resources of multiple machines to store and process lens data,Platfora can handle true 'big data' query workloads.


Page 13

FAQs—Platfora Deployments

Got questions about what you need to get Platfora up and running? Want to know how Platfora isdeployed in your data center environment and how it works with Hadoop? This topic answers the mostfrequently asked questions (FAQs) about Platfora installation and deployment.

What do I need before I can install Platfora?

Before you can install Platfora, you will need:

• Hadoop—Platfora needs access to an installed and running Hadoop cluster, or to a Google CloudPlatform account with Google Cloud Storage and Google Cloud Dataproc enabled, or to an AmazonWeb Services (AWS) account with Amazon S3 (Simple Storage Service) and EMR (ElasticMapReduce) enabled.

• Linux Server(s)—You will need one or more dedicated servers running a supported Linux operatingsystem on which to install Platfora. The Platfora server(s) should be in the same data center (orregion) as your Hadoop distribution, but not on the same machines.

• Platfora Binaries—A Platfora customer support representative can give you the download link tothe Platfora installation package for your chosen Hadoop distribution. Platfora provides both rpm andtar installer packages.

• Platfora License—A Platfora customer support representative must issue you a license file. Trialperiod licenses are available upon request for pilot installations.

• Platfora Installation Guide—You will need the Platfora installation guide that covers your specificHadoop distribution. The setup steps vary slightly depending on the version of Hadoop you are using.

What are the high-level steps involved in installing Platfora?

Every Platfora installation involves these basic steps, although the details will vary slightly dependingon the Hadoop distribution you are using:

• Configure Hadoop for Platfora Access—Make sure that the Platfora server(s) can access yourHadoop services over the network and that Platfora has write access to a designated directory inthe Hadoop file system. Obtain the required connection details for your Hadoop services (Platforaconnects to Hadoop during setup).

• Install Prerequisites on all Platfora Nodes—Make sure the Platfora servers have the requireddependencies before installing Platfora. If using the rpm installer, Platfora provides a base packagethat includes the dependencies. If using the tar installer, you will need to manually install thedependent software yourself.

• Install the Platfora Software on the Master—Install the Platfora binaries on the master node.

• Setup the Platfora Master—Run the setup utility to configure the Platfora master server andconnect it to your Hadoop services.

• Start Platfora—After setup completes, start the Platfora server. You should now have a fully-functioning single-node Platfora installation.


Page 14

• Run Tests and Load the Tutorial Data—After setup completes, you may want to run some tests tomake sure that Platfora is properly configured and can access your Hadoop cluster. One way to testeverything is to load the tutorial data that comes with your Platfora installation. This will put somedata in Hadoop and build a small lens to make sure everything is working.

• Add Platfora Worker Nodes—Once you have the Platfora master node up and running, you can useit to add Platfora worker nodes to the cluster. The master node is always used to install and managethe worker nodes.

Is there a trial version of Platfora?

Platfora does not currently have a trial version available for download. You can contact PlatforaCustomer Support to arrange for a pilot or trial installation.

Why would I need multiple Platfora nodes?

When users work with lens data in Platfora, that data is loaded into memory so that queries (vizzes) arefast and responsive. If there is more lens data than can fit into memory, then some queries may be slowor not be able to run at all. Adding more nodes to your Platfora cluster makes more disk, memory andCPU available to store and process lens data.

How many Platfora nodes would I need?

Platfora is intended for big data query workloads, and performs best when using the resources ofmultiple machines. Although you can have a fully-functioning Platfora installation with just one node, amulti-node installation is necessary for optimal performance and bigger lens sizes.

The ideal number of Platfora nodes really depends on a lot of factors: lens size, lens quantity, datavariety, and number of concurrent users (to name a few). Your Platfora account representative will helpyou determine the number of nodes that best fits your unique data requirements. You can also scale upyour Platfora cluster as your data and usage grows.

How does Platfora interact with Hadoop?

Platfora uses the powerful distributed storage and processing features of Hadoop, but masks thecomplexity of working with HDFS and MapReduce by providing an easy-to-use web interface.

Platfora uses Hadoop to access the raw data stored in its distributed file system (DFS) and makes thedata visible to Platfora users. It uses the data processing services of Hadoop (MapReduce) to pullrequested data and prepare it for analysis. The result of these processing jobs is the Platfora lens.Platfora lenses are stored in the Hadoop distributed file system, as well as copied over to the Platforaservers.

Can Platfora connect to more than one source system?

When you install Platfora, you connect it to one Hadoop distribution. This is the primary source systemthat Platfora uses to access the source data and process its lens builds.

You can create data sources that point to external sources (such as a cloud storage service or a relationaldatabase). However, this external data must be pulled over to the primary Hadoop source system during


Page 15

lens build processing. To avoid moving large amounts of data over the network, Platfora recommendsusing external data sources for smaller, supplemental datasets only.

What does Platfora do to the data in Hadoop?

Platfora reads the raw data, but does not edit, update, or delete it in place. It makes a copy of therequested portion of the data when it builds a lens, and does its lens processing on the copied data. Youroriginal data remains intact and unaltered.

How does Platfora keep my data secure?

Platfora's role-based security allows you to control who can authenticate to the Platfora application andwhat actions they can perform. You can maintain user credentials within the Platfora application, orconfigure Platfora to use an external LDAP directory service to authenticate users.

To authorize access to the raw data, you can either manage data access permissions within the Platforaapplication itself, or you can configure Platfora to use Kerberos authorization check the HDFS filesystem permissions.

How does Platfora handle redundancy and high availability?

Platfora relies on Hadoop for redundancy and high-availability of the raw data itself.

The Platfora worker nodes are fully redundant and highly available. The worker nodes process the lensqueries submitted to the Platfora application. Lens data is distributed and replicated across all of theworker nodes in the Platfora cluster. Depending on the number of worker nodes you have, you can lose anode and still continue processing queries without interruption of service.

A redundant Platfora master node involves taking routine backups of the metadata catalog database soyou can restore the master node if needed.

Page 16

Chapter

2Supported Environments and VersionsThis section lists the environments and versions that Platfora supports.

Hadoop and Hive Versions

This section lists the Hadoop distributions and versions that are compatible with the Platfora installationpackages. If using Hive as a data source for Platfora, the version of Hive must be compatible with theversion of Hadoop you are using.

HadoopDistro

Version HiveVersion

M/RVersion

Platfora Package

CDH 5.3.1+ 0.13.1 YARN cdh52

CDH 5.4 1.1 YARN cdh54



Cloudera 5


HDP 2.2.x 0.14.0 YARN hadoop_2_6_0_hive_0_14_0


Hortonworks


MapR 4.0.2 0.13.0 YARN mapr402

MapR 4.1.0 0.13.0 YARN mapr41

MapR 5.0.0 1.1 YARN mapr5

MapR

MapR 5.1.0 1.1 YARN mapr51

Pivotal Labs PivotalHD 3.0 0.14.0 YARN hadoop_2_6_0_hive_0_14_0

Amazon EMR(AMI 3.10.x)

Hadoop 2.4.0 0.13.1 YARN hadoop_2_4_0_hive_0_13_0

GoogleDataproc (1.0)

Hadoop 2.7.2 1.2.1 YARN hadoop_2_7_2_hive_1_2_1

Platfora Deployment Planning Guide - Supported Environments and Versions

Page 17

Operating Systems

Operating System Supported Versions

Red Hat Enterprise Linux 6.2, 6.3, 6.4, 6.5, and 6.6

CentOS 6.2, 6.3, 6.4, 6.5, and 6.6

Scientific Linux 6.2

Amazon Linux AMI AMI 2014.03 and AMI 2015.03

Ubuntu 12.04.1 LTS

Oracle Linux 6.x

Web Browsers

Web Browser Supported Versions

Chrome Latest version (Evergreen) and three previous releases

Firefox 25.0.x or higher

Safari 6.1+ and 7.x

IE 11 (Windows 7, Windows 8, Windows 10)Internet Explorer with theCompatibility View featuredisabled IE 10 (Windows 7 and Windows 8)

Platfora supports these web browsers on desktop machines only.

Platfora recommends using a screen resolution width of 1400 pixels or greater for viewing some pagesin the Platfora web application.

Java

• java-1.7.0-openjdk (recommended)

• Java 1.7.0 Sun/Oracle

Python

• Python 2.6.8, 2.7.1, 2.7.3, 2.7.4, 2.7.5, 2.7.6, 2.7.7, 2.7.8 only

Postgres Database

• PostgreSQL 9.2.1-1, 9.2.1-1.28 (on Amazon AMI), 9.2.5, 9.2.7

Page 18

Chapter

3System Requirements (On-Premise)The Platfora software runs on a scale-out cluster of servers. You can install Platfora on a single node to start,and then scale up storage and processing capacity by adding additional nodes. Platfora requires access to anexisting, compatible Hadoop implementation in order to start. Users then access the Platfora application using acompatible web browser client. This section describes the system requirements for on-premise deployments ofthe Platfora servers, Hadoop source systems, network connectivity, and web browser clients.

Topics:• Platfora Server Requirements

• Hadoop Resource Requirements

Platfora Server RequirementsPlatfora recommends the following minimum system requirements for Platfora servers. For multi-nodeinstallations, the master server and all worker servers must be the same operating system (OS) andsystem configuration (same amount of memory, CPU, etc.).

64-bit OperatingSystem or AmazonMachine Image(AMIs)

CentOS 6.2-6.5 (7.0 is not supported)RHEL 6.2-6.5 (7.0 is not supported)Scientific Linux 6.2Amazon Linux AMI 2014.03+Oracle Enterprise Linux 6.xUbuntu 12.04.1 LTS or higherSecurity-Enhanced Linux 6.21

Software Java 1.7Python 2.6.8, 2.7.1, 2.7.3 through 2.7.6 (3.0 not supported)PostgreSQL 9.2.1-1, 9.2.5, 9.2.7 or 9.3 (master only)OpenSSL 1.0.1 or higher2

Unix Utilities rsync, ssh, scp, cp, tar, tail, sysctl, ntp, wget

1 If you wish to install Security-Enhanced Linux, refer to Platfora's Support site forinstallation instructions.

2 Only required if you want to enable SSL for secure communications between Platforaservers

http://goo.gl/UprGXr

Platfora Deployment Planning Guide - System Requirements (On-Premise)

Page 19

Memory 64 GB minimum, 256 recommended

The server needs enough memory to accommodateactively used lens data. Additionally, it needs 1-2 GBreserved for normal operations and the lens query engineworkspace.

CPU 8 cores minimum, 16 recommended

Disk All Platfora nodes (master or worker) require 300 MB for thePlatfora installation. Every node requires high-speed local storageand a local disk cache configured as a single logical volume.Hardware RAID is recommended for the best performance.All nodes combined require appropriate free space for aggregateddata structures (Platfora lenses). At a minimum, you will needtwice the amount of disk space as the amount of system memory.The Platfora master node requires an additional, approximately850 MB for metadata catalog (dataset definitions, vizboard andvisualization definitions, lens definitions, etc.)

Network 1 Gbps reliable network connectivity between Platfora masterserver and query processing servers1 Gbps reliable network connectivity between Platfora masterserver and Hadoop NameNode and JobTracker/ResourceManagernodeNetwork bandwidth should be comparable to the amount ofmemory on the Platfora master server

Hadoop Resource Requirements

Platfora must be able to connect to an existing Hadoop installation. Platfora also requires permissionsand resources in the Hadoop source system. This section describes the Hadoop resource requirements forPlatfora.

Platfora uses the remote Distributed File System (DFS) of the Hadoop cluster for persistent storage andas the primary data source. Optionally, you can also configure Platfora to use a Hive metastore server asa data source.

Platfora Deployment Planning Guide - System Requirements (On-Premise)

Page 20

Platfora uses the Hadoop MapReduce services to process data and build lenses. For larger lens builds tosucceed, Platfora requires minimum resources on the Hadoop cluster for MapReduce tasks.

DFS Disk Space Platfora requires a designated persistent storage directory in theremote distributed file system (DFS) with appropriate free space forPlatfora system files and data structures (lenses). The location isconfigurable.

DFS Permissions The platfora system user needs read permissions to source datadirectories and files.The platfora system user needs write permissions to Platfora'spersistent storage directory on DFS.

MapReducePermissions

The platfora system user needs to be added to the submit-jobsand administer-jobs access control list (or added to a group that hasthese permissions).

DFS Resources Minimum Open File Limit = 5000

MapReduceResources

Minimum Memory for Task Processes = 1 GB

Page 21

Chapter

4System Requirements (AWS Cloud)This section describes the system requirements for customers who plan to use Amazon Web Services (AWS) astheir installation environment for Platfora, and Simple Storage Service (S3) and Elastic MapReduce (EMR) andas their Hadoop distributed data storage and processing services.

Topics:• Platfora EC2 Instance Requirements

• Amazon EMR Instance Requirements

• AWS Security Settings for Platfora

Platfora EC2 Instance RequirementsPlatfora recommends the following system requirements for Amazon EC2 instances that will serve asPlatfora server nodes. For multi-node installations, the master server instance and all worker serverinstances must be the same configuration (same EC2 instance type, storage configuration, networkconfiguration, etc.).

Amazon MachineImages (AMIs)

Amazon Linux AMI 2014.03.x or higherRed Hat Enterprise Linux 6.2 - 6.5Ubuntu Server 12.04.1 LTS or higher

EC2 Instance Type Small to Medium Lens Sizes: c3.8xlargeMedium to Large Lens Sizes, 10+ Platfora nodes: r3.8xlargeMedium to Large Lens Sizes, 1-9 Platfora nodes: i2.8xlarge

Root Device Volume(EBS)

Recommended Size = 1 TBType = General Purpose (SSD)

Additional EBSVolumes

Optional. Additional EBS volumes can be attached to an EC2instance after launch time, and can be used to increase lenscache storage capacity if needed. EBS volumes are less expensivethan Instance Store volumes, and the data is persistent betweenshutdowns.

Platfora Deployment Planning Guide - System Requirements (AWS Cloud)

Page 22

Instance StoreVolume (Ephemeral)

Optional. You may choose to add instance store volumes for thePlatfora lens cache instead of using EBS volumes. This costs more,but offers slightly faster performance. Instance store volumes canonly be attached to an EC2 instance at launch time, and the datais not saved when the instance shuts down. The size of an instancestore volume depends on the instance type:c3.8xlarge: 2 x 320 GB SSD (640 GB)r3.8xlarge: 2 x 320 GB SSD (640 GB)i2.8xlarge: 8 x 800 GB SSD (6400 GB)

EnhancedNetworking

yes (requires use of VPC instead of EC2-Classic)

EBS OptimizedInstance

yes (the 8xlarge instance types are EBS optimized instances bydefault)

Availability Zone yes (use same zone for all nodes in the Platfora cluster)

Placement Group yes (use same placement group for all nodes in the Platforacluster)

IAM User yes (create a dedicated Platfora IAM User in your AWS account)

Other RequiredSoftware

Java 1.7Python 2.7.8 through 2.7.9 (3.0 not supported)(master node only) PostgreSQL 9.2.1-1.28 (AMZN), 9.2.5, 9.2.7 or9.3OpenSSL 1.0.1 or higher3

Required UnixUtilities

rsync, ssh, scp, cp, tar, tail, sysctl, ntp, wget

Amazon EMR Instance RequirementsPlatfora launches an Elastic MapReduce (EMR) cluster when it builds a lens. This section describes therecommended requirements for the EMR instances that are launched by Platfora.

Amazon EMR is Hadoop as a web service. Platfora uses the EMR Hadoop cluster to process its lensbuilds. Since the EMR Hadoop cluster is only instantiated as needed, the source data does not residein the Hadoop Distributed File System (HDFS) of the EMR Hadoop cluster. The source data is insteadstored on Amazon S3. Data is copied from S3 to EMR for data processing, then the results are writtenback to S3 when the job completes.



Page 23

At the start of a lens build job, the raw source data is copied from S3 to the local HDFS file system onthe EMR nodes. The EMR instances must have enough local instance storage to support the input sourcedataset and the temporary workspace for intermediate lens build job results. Also consider that the localHDFS of the EMR cluster replicates the data to ensure redundancy and high availability during lensbuild processing.

Platfora recommends the i2.4xlarge instance type for EMR data nodes and the m3.xlarge for the EMRname node. The i2.4xlarge offers a great balance between total local disk space, CPU power, and per-node memory size.

Hadoop Version 2.4.0

AMI Version Amazon EMR 3 (AMI 3.10)

EMR NameNodeInstance Type

m3.xlarge

EMR DataNodeInstance Type

i2.4xlarge

Number of EMRDataNodes

The number of nodes you will need to complete a lens builddepends on the following factors:

• The size of the raw dataset in S3 that is considered as input tothe lens build.

• The replication factor of HDFS. EMR clusters of 1-4 nodes havea replication factor of 1, 5-9 nodes have a replication factor of2, and over 10 nodes have a replication factor of 3.

• Temporary work space for intermediate lens build results -about 20-30% of total disk space.

AWS Security Settings for PlatforaAmazon Web Services (AWS) has a number of security features that you can use to protect your AWSaccount and cloud server instances. This section contains security setting recommendations if you planto use Amazon Elastic MapReduce (EMR) as the Hadoop implementation for your Platfora cluster.

Amazon AWS Virtual Private Cloud (VPC)

To use Amazon EMR for Hadoop data processing, Platfora must be able to launch an EMR cluster in apublic subnet. Administrators do this by provisioning an Amazon VPC with a public subnet, and thenspecifying the subnet identifier in Platfora. Platfora must create the EMR cluster on an Internet-facingsubnet to allow the AWS EMR Provisioning Service to reach the EMR cluster.

Additionally, you must ensure the Platfora server can communicate with the Amazon EMR cluster. Ifthe Platfora server is on the same subnet as the Amazon EMR cluster, this happens automatically. If


Page 24

the Platfora server and the EMR cluster are on different VPC subnets, then a route between the subnetsneeds to be added to the Route table(s) so that communication can occur between the two subnets. Also,if the VPC uses Access Control Lists (ACLs), then those ACLs must be modified to allow traffic fromPlatfora to Hadoop.

The subnet identifier cannot exceed 255 characters in length.

After the Amazon VPC has been provisioned, specify its subnet identifier in theplatfora.emr.subnet.id Platfora configuration property.

For more information on setting up and using an Amazon VPC with Amazon EMR, see http://docs.aws.amazon.com/ElasticMapReduce/latest/DeveloperGuide/emr-plan-vpc-subnet.html.

IAM User and IAM Roles for Platfora

AWS Identity and Access Management (IAM) allows you to create users, groups, and roles to controlaccess to AWS services and resources. Platfora recommends creating an IAM User account and twoIAM Roles specifically for use by Platfora.

Platfora uses a combination of an IAM User and IAM Roles to communicate with Amazon AWS and tocreate an EMR cluster. An Amazon AWS administrator needs to create a platfora IAM User and twoIAM Roles specifically for use by Platfora. Then a Platfora system administrator needs to enter someinformation about that user and those roles in Platfora.

The Platfora server uses security credentials of the platfora IAM User to request Amazon AWS tocreate an Amazon EMR cluster. Once that request is approved, the platfora IAM User then passes anIAM Role to actually launch an EMR cluster, and then uses another IAM Role to start EC2 instances inthe EMR cluster. You must specify these roles in Platfora.

For more details on creating the user and roles, see Create IAM User for Platfora and Create IAM Rolesfor Platfora.

Create IAM User for Platfora

The Amazon AWS administrator can create a new platfora user in the IAM Management Consoleof your AWS account. After creating the user, download the AWS credentials for this user. The Platfora

http://docs.aws.amazon.com/ElasticMapReduce/latest/DeveloperGuide/emr-plan-vpc-subnet.html

http://docs.aws.amazon.com/ElasticMapReduce/latest/DeveloperGuide/emr-plan-vpc-subnet.html


Page 25

system administrator will need the Access Key Id and Secret Access Key when you initialize Platforafor use with Amazon EMR.

The security policy for the platfora IAM User must have (at a minimum) the permissions listed in thefollowing sample policy:

{ "Version": "2012-10-17", "Statement": [ { "Action": [ "iam:ListRoles", "iam:PassRole", "elasticmapreduce:*", "s3:GetBucketLocation", "s3:ListAllMyBuckets"


Page 26

], "Effect": "Allow", "Resource": "*" }, { "Effect": "Allow", "Action": [ "s3:ListBucket" ], "Resource": [ "arn:aws:s3:::Bucket_defined_in_core-site.xml", "arn:aws:s3:::Datasource_Bucket_1", "arn:aws:s3:::Datasource_Bucket_n" ] }, { "Effect": "Allow", "Action": [ "s3:PutObject", "s3:Get*", "s3:DeleteObject", ], "Resource": [ "arn:aws:s3:::Bucket_defined_in_core-site.xml/*" ] }, { "Effect": "Allow", "Action": [ "s3:Get*" ], "Resource": [ "arn:aws:s3:::Datasource_Bucket_1/path/to/files/*", "arn:aws:s3:::Datasource_Bucket_n/*" ] } ]}

Under Permissions for this user, attach a security policy that contains the permissions listed above.These permissions allow the platfora IAM User to pass an IAM Role to launch the EMR cluster,start an EMR cluster, and access S3 for source data during data ingest.

Create IAM Roles for Platfora

Amazon requires all AWS users to use IAM Roles to launch EMR clusters. One IAM Role is used tostart the Amazon EMR service, and the other role is used by the EC2 instances in the EMR cluster.Amazon AWS offers some default IAM Roles for these services. However, Platfora recommendscreating custom IAM Roles specifically for use by Platfora instead.


Page 27

The Amazon AWS administrator can create the IAM Roles in the IAM Management Console of yourAWS account. Create a role for each of the following EMR cluster services, and specify them in Platforausing the specified configuration properties:

• Amazon EMR service (service role). In Amazon AWS, create an IAM Role and attach a securitypolicy that contains at a minimum the permissions specified below. Enter this IAM Role name inthe platfora.emr.service.role Platfora configuration property. The custom role you definecorresponds to the default IAM Role Amazon offers called EMR_DefaultRole.

• EC2 instances (instance profile) in the Amazon EMR cluster. In Amazon AWS, create an IAMRole and attach a security policy that contains at a minimum the permissions specified below.Enter this IAM Role name in the platfora.emr.jobflow.role Platfora configurationproperty. The custom role you define corresponds to the default IAM Role Amazon offers calledEMR_EC2_DefaultRole.

The security policy for the Amazon EMR service (service role) IAM Role must have (at a minimum) thepermissions listed in the following sample policy:

{ "Version": "2012-10-17", "Statement": [ { "Action": [ "ec2:AuthorizeSecurityGroupIngress", "ec2:CancelSpotInstanceRequests", "ec2:CreateSecurityGroup", "ec2:CreateTags", "ec2:DeleteTags", "ec2:Describe*", "ec2:ModifyImageAttribute", "ec2:ModifyInstanceAttribute", "ec2:RequestSpotInstances", "ec2:RunInstances", "ec2:TerminateInstances" ], "Effect": "Allow", "Resource": "*" }, { "Action": [ "iam:PassRole", "iam:ListRolePolicies", "iam:GetRole", "iam:GetRolePolicy", "iam:ListInstanceProfiles" ], "Effect": "Allow", "Resource": "*" }, { "Effect": "Allow", "Action": [


Page 28

"s3:Get*" ], "Resource": "arn:aws:s3:::Bucket_defined_in_core-site.xml/*" } ]}

The security policy for the EC2 instances (instance profile) IAM Role must have (at a minimum) thepermissions listed in the following sample policy:

{ "Version": "2012-10-17", "Statement": [ { "Effect": "Allow", "Resource": "*", "Action": [ "ec2:Describe*", "elasticmapreduce:Describe*", "elasticmapreduce:ListBootstrapActions", "elasticmapreduce:ListClusters", "elasticmapreduce:ListInstanceGroups", "elasticmapreduce:ListInstances", "elasticmapreduce:ListSteps", "s3:ListAllMyBuckets" ] }, { "Effect": "Allow", "Action": [ "s3:ListBucket" ], "Resource": [ "arn:aws:s3:::Bucket_defined_in_core-site.xml", "arn:aws:s3:::Datasource_Bucket_1", "arn:aws:s3:::Datasource_Bucket_n" ] }, { "Effect": "Allow", "Action": [ "s3:PutObject", "s3:Get*", "s3:DeleteObject" ], "Resource": [ "arn:aws:s3:::Bucket_defined_in_core-site.xml/*", ] }, { "Effect": "Allow", "Action": [


Page 29

"s3:Get*", "s3:List*" ], "Resource": [ "arn:aws:s3:::Datasource_Bucket_1/path/to/files/*", "arn:aws:s3:::Datasource_Bucket_n/*", "arn:aws:s3:::*elasticmapreduce/*" ] } ]}

Verify that the permissions for and access to Amazon resources (especially S3) forthe EC2 instances role are the same or greater than the permissions and accessassigned to the platfora IAM User. For example, if the platfora IAM User canaccess an Amazon S3 bucket, but the EC2 instances role cannot, then lens buildsthat rely on that S3 bucket will fail.

For more information on using IAM Roles for EMR, see http://docs.aws.amazon.com/ElasticMapReduce/latest/DeveloperGuide/emr-iam-roles.html.

EC2 Security Group Settings

EC2 security groups allow you to specify firewalling rules for your Amazon elastic cloud computing(EC2) server instances.

EC2 security group rules are independent of, and in addition to, the software firewalling provided by theinstance's operating system. Security groups must be defined before you create an EC2 instance.

The security group configured for the Platfora server instance must permit connections from your usernetwork to the Platfora web application server port (8001 by default). You also may want to open theEMR Hadoop ResourceManager and JobHistory web ports so that you can monitor and troubleshootYARN jobs executed by Platfora.

An example security group configuration for a Platfora server instance would look something like thefollowing:

http://docs.aws.amazon.com/ElasticMapReduce/latest/DeveloperGuide/emr-iam-roles.html

http://docs.aws.amazon.com/ElasticMapReduce/latest/DeveloperGuide/emr-iam-roles.html

Page 30

Chapter

5System Requirements (GCP Cloud)This section describes the system requirements for customers who plan to use Google Cloud Platform (GCP) astheir installation environment for Platfora, and Google Cloud Storage (GCS) and Cloud Dataproc and as theirHadoop distributed data storage and processing services.

Topics:• Platfora Compute Engine Machine Requirements

• Google Dataproc Machine Requirements

• GCP Security Settings for Platfora

Platfora Compute Engine Machine RequirementsPlatfora recommends the following system requirements for Google Compute Engine machines that willserve as Platfora server nodes. For multi-node installations, the master machine and all worker machinesmust be the same configuration (same Compute Engine machine type, storage configuration, networkconfiguration, etc.).

Machine Boot DiskOperating System

Debian GNU/Linux 8 (jessie)Debian GNU/Linux 7 (wheezy)CentOS 6Ubuntu 14.04 LTSRed Hat Enterprise Linux 6

Compute EngineMachine Type

Small to Medium Lens Sizes: Custom: 32 vCPUs and 64 GB ofMemory (RAM)Medium to Large Lens Sizes, 1+ Platfora nodes: n1-highmem-32

Boot Disk Drive Recommended Size = 1 TBType = SSD Persistent Disk

Additional Disks Optional. Additional disks can be attached to a Compute Enginemachine after launch time, and can be used to increase lens cachestorage capacity if needed. Standard Persistent Disks are lessexpensive than SSD Persistent Disks, and the data is persistentbetween shutdowns.

Platfora Deployment Planning Guide - System Requirements (GCP Cloud)

Page 31

Zone yes (use the same zone for all nodes in the Platfora cluster)

Google ServiceAccount

yes (create a dedicated Service Account for Platfora in your GoogleCloud Platform account)

Other RequiredSoftware

Java 1.7Python 2.7.8 through 2.7.9 (3.0 not supported)(master node only) PostgreSQL 9.2.5, 9.2.7, or 9.3OpenSSL 1.0.1 or higher4

Required UnixUtilities

rsync, ssh, scp, cp, tar, tail, sysctl, ntp, wget

Google Dataproc Machine RequirementsPlatfora launches a Google Cloud Dataproc cluster when it builds a lens. This section describes therecommended requirements for the Dataproc machines that are launched by Platfora.

Google Cloud Dataproc is Hadoop as a web service. Platfora uses the Dataproc Hadoop cluster toprocess its lens builds. Since the Dataproc Hadoop cluster is only instantiated as needed, the source datadoes not reside in the Hadoop Distributed File System (HDFS) of the Dataproc Hadoop cluster. Thesource data is instead stored on Google Cloud Storage (GCS). Data is copied from GCS to Dataproc fordata processing, then the results are written back to GCS when the job completes.

At the start of a lens build job, the raw source data is copied from GCS to the local HDFS file systemon the Dataproc nodes. The Dataproc machines must have enough local machine storage to supportthe input source dataset and the temporary workspace for intermediate lens build job results. Alsoconsider that the local HDFS of the Dataproc cluster replicates the data to ensure redundancy and highavailability during lens build processing.

Platfora recommends the n1-highmem-16 machine type for Dataproc data nodes and the n1-standard-4for the Dataproc name node. The n1-highmem-16 machine type offers a great balance between totallocal disk space, CPU power, and per-node memory size.

Hadoop Version 2.7.2

Dataproc SoftwareVersion

Dataproc 1.0

Dataproc NameNodeMachine Type

n1-standard-4

Dataproc DataNodeMachine Type

n1-highmem-16



Page 32

Number of DataprocDataNodes

The number of nodes you will need to complete a lens builddepends on the following factors:

• The size of the raw dataset in GCS that is considered as inputto the lens build.

• The replication factor of HDFS. Dataproc clusters of 1-4 nodeshave a replication factor of 1, 5-9 nodes have a replicationfactor of 2, and over 10 nodes have a replication factor of 3.

• Temporary work space for intermediate lens build results—about 20-30% of total disk space.

The number of worker nodes in a Dataproc cluster mustbe a value of two or higher.

GCP Security Settings for PlatforaGoogle Cloud Platform has a number of security features that you can use to protect your Google CloudPlatform account and cloud server machines. This section contains security setting recommendations ifyou plan to use Google Cloud Dataproc as the Hadoop implementation for your Platfora cluster.

Google Cloud Service Account for Platfora

A service account is a special Google account that can be used by applications to access Google servicesprogrammatically. To use any of the Google services (Dataproc, Storage, or BigQuery), you must createa Google service account in your Google Cloud Platform account that is used by Platfora. You willspecify this service account for the Compute Engine machines used for the Platfora cluster. Platfora usesthe service account when it accesses other Google services.

At a minimum, the service account must meet the following requirements:

• Read access for every Google Cloud Storage bucket that Platfora needs to access.

• Write access to the Google Cloud Storage bucket where Platfora writes lens build files.

Additionally, Google Cloud Platform creates all Dataproc clusters in the default service account. If youuse Dataproc as your Hadoop environment, the default service account must have Edit permission to theGoogle Project. (This is required for Google Cloud Dataproc. Contact Google Support for any questionsabout this requirement.)

Make sure that no Google Cloud Storage bucket access control lists (ACLs) preventthe Platfora service account from accessing the Storage bucket folders it needs.

For more information on Google service accounts, see https://cloud.google.com/iam/docs/service-accounts.

https://cloud.google.com/iam/docs/service-accounts

https://cloud.google.com/iam/docs/service-accounts


Page 33

Google Cloud Subnetwork for Platfora

Google Cloud Platform allows you to define a network in which all machine instances are located. Youcan segment the IP addresses in a GCP network into subnets, which GCP calls subnetworks.

To use any of the Google services (Dataproc, Storage, or BigQuery), you must create a Google CloudPlatform subnetwork and use that subnetwork name when configuring Platfora. You must ensure thefollowing are true:

• All nodes of the Platfora cluster are in the same subnetwork.

• The Dataproc cluster is configured to launch in the same subnetwork as the Platfora cluster.(platfora.gcp.dataproc.subnet.name configuration property)

• The Firewall rules in the subnetwork allow each node of the Platfora cluster to communicate with theother Platfora nodes and the nodes in the Dataproc cluster.

For more information on Google networks, see https://cloud.google.com/compute/docs/networking#before-you-begin.

https://cloud.google.com/compute/docs/networking#before-you-begin

https://cloud.google.com/compute/docs/networking#before-you-begin

Page 34

Chapter

6Port Configuration RequirementsYou must open ports in the firewall of your Platfora nodes to allow client access and intra-clustercommunications. You also must open ports within your Hadoop cluster to allow access from Platfora. Thissection lists the default ports required.

Topics:• Ports to Open on Platfora Nodes

• Ports to Open on Hadoop Nodes

Ports to Open on Platfora Nodes

Your Platfora master node must allow HTTP connections from your user network. All nodes must allowconnections from the other Platfora nodes in a multi-node cluster.

On Amazon EC2 instances, you must configure the port firewall rules on thePlatfora server instances in addition to the EC2 Security Group Settings.

Platfora Service Default Port Allow connections from…

Master Web Services Port(HTTP)

8001 External user networkPlatfora worker serverslocalhost

Secure Master Web ServicesPort (HTTPS)

8443 External user networkPlatfora worker serverslocalhost

Master Server ManagementPort

8002 Platfora worker serverslocalhost

Worker Server ManagementPort

8002 Platfora master serverother Platfora worker serverslocalhost

Platfora Deployment Planning Guide - Port Configuration Requirements

Page 35

Platfora Service Default Port Allow connections from…

Master Data Port 8003 Platfora worker serverslocalhost

Spark UI 4040 External user network (optionalfor troubleshooting Spark jobs)

Worker Data Port 8003 Platfora master serverother Platfora worker serverslocalhost

Master PostgreSQL DatabasePort

5432 Platfora worker serverslocalhost

Spark Ephemeral Port Range Depends on the OS. ForCentOS and Ubuntu, it is32768 to 61000.

All nodes in the Hadoopcluster, Dataproc cluster, orEMR cluster

Ports to Open on Hadoop Nodes

Platfora must be able to access certain services of your Hadoop cluster. This section lists the Hadoopservices Platfora needs to access and the default ports for those services.

Note that this only applies to on-premise Hadoop deployments or to self-managed Hadoop deploymentsin a virtual private cloud, not to Google Cloud Dataproc or Amazon Elastic MapReduce (EMR).

Default Ports by Hadoop DistroHadoop Service

CDH, HDP,Pivotal

MapR

Allow connectionsfrom…

HDFS NameNode 8020 N/A Platfora master andworker servers

HDFS DataNodes 50010 N/A Platfora master andworker servers

MapRFS CLDB N/A 7222 Platfora master andworker servers

MapRFS DataNodes N/A 5660 Platfora master andworker servers

YARN ResourceManager 8032 8032 Platfora master server

Platfora Deployment Planning Guide - Port Configuration Requirements

Page 36

Default Ports by Hadoop DistroHadoop Service

CDH, HDP,Pivotal

MapR

Allow connectionsfrom…

YARN ResourceManagerWeb UI

8088 8088 External usernetwork (optional fortroubleshooting)

YARN Job History Server 10020 10020 Platfora master server

YARN Job History ServerWeb UI

19888 19888 External usernetwork (optional fortroubleshooting)

YARN Application Master Depends onmapred-site.xml5

Depends onmapred-site.xml6

Platfora master server

HiveServer Thrift Port 9083 9083 Platfora master server

Hive Metastore DB Port7 Depends on thedatabase used8

Depends on thedatabase used9


Spark Server ephemeral portrange

ephemeral portrange


To limit the ephemeral port range, see your Linux operating systemdocumentation about changing the net.ipv4.ip_local_port_range OSsetting.

5 See yarn.app.mapreduce.am.job.client.port-range property in mapred-site.xml6 See yarn.app.mapreduce.am.job.client.port-range property in mapred-site.xml7 If connecting to Hive directly using JDBC8 For example, MySQL is 3306, and Postgres is 7432.9 For example, MySQL is 3306, and Postgres is 7432.

Page 37

Chapter

7Browser RequirementsUsers can connect to the Platfora web application using the latest HTML5-compliant web browsers. Platforasupports the following releases of the following web browsers:

Web Browsers

Web Browser Supported Versions

Chrome Latest version (Evergreen) and three previous releases

Firefox 25.0.x or higher

Safari 6.1+ and 7.x

IE 11 (Windows 7, Windows 8, Windows 10)Internet Explorer with theCompatibility View featuredisabled IE 10 (Windows 7 and Windows 8)

Platfora supports these web browsers on desktop machines only.

Platfora recommends using a screen resolution width of 1400 pixels or greater for viewing some pagesin the Platfora web application.

Page 38

Appendix

AHardware Specifications for Platfora Nodes

This section shows some example hardware configurations that have worked well in other Platforadeployments.

To achieve the best performance and lowest operating cost, Platfora recommends that all servers in the Platforacluster have the same configuration. At a minimum, all servers in the Platfora cluster should have an identicalRAM capacity and the same number of CPU cores.

Platfora software can be deployed on either rack or blade servers. Typical Platfora server configurations havespecifications similar to:

Rack Server Specs Blade Server Specs

CPU: 2x E5-2440 2.40GHz 6-cores CPU: 2x E5-2470 2.30GHz 8-cores

RAM: 12x 16GB RAM (192GB total) RAM: 12x16GB RAM (192GB total)

Disk: 8x 300GB 10K SAS 2.5” HDDs Disk: 2x 900GB 10K SATA 2.5” HDDs

Network: 1x Gbps NIC

Page 39

Appendix

BEC2 Considerations for Platfora InstancesThis section explains what to consider when using Amazon Elastic Compute Cloud (EC2) instances to deploy aproduction Platfora cluster.

EC2 Storage Considerations

When you launch an Amazon EC2 instance, you have several choices with regards to the storage thatyou can attach to the instance. There are two main types of storage available: Elastic Block Store (EBS)and Instance Store (Ephemeral). The type and capacity of storage available depends on the instance typeyou choose.

• The Root Device Volume - All instances have a root device volume, which is backed by either EBSor Instance storage. Platfora recommends EBS-backed instance types; they launch faster and usepersistent storage.

Root device volumes for Platfora nodes should always be increased to the maximum size (1TB). This ensures adequate space for the Platfora installation and logs. When using the Platforarecommended 8xlarge instance types, general purpose (SSD) EBS volumes also guarantee 3,000IOPS.

• EBS Volumes - Amazon EBS volumes are highly available and reliable storage volumes that can beattached to any running instance that is in the same Availability Zone. Amazon EBS volumes that areattached to an Amazon EC2 instance are exposed as storage volumes that persist independently fromthe life of the instance. Also with Amazon EBS, you only pay for what you use, making it a cost-effective choice.

Platfora recommends General Purpose (SSD) EBS volumes. For maximum performance, you canchoose Provisioned IOPS EBS volumes instead.

If you choose an instance type that is not EBS optimized by default, make sure to choose EBSOptimized Instance at launch time. This ensures that the instance has a dedicated connection to theEBS volume, which reduces overall latency and maximizes throughput. The Platfora recommended8xlarge instance types are already EBS optimized instances.

• Instance Store Volumes - Ephemeral storage is ideal for temporary storage of information thatchanges frequently, such as caches, or for data that is replicated across multiple instances. Instancesthat use EBS for the root device do not, by default, have instance store volumes available at boottime. Also, you can't attach instance store volumes after you've launched an instance. Therefore, ifyou want your Amazon EBS-backed instance to use instance store volumes, you must specify themwhen you first launch your instance.

Platfora Deployment Planning Guide - EC2 Considerations for Platfora Instances

Page 40

The choice to add instance store volumes to Platfora nodes depends on price, performance, andpersistence of the data. Ephemeral storage allows data to be read faster from disk, but is also moreexpensive. Also, the data stored on these volumes is not persistent - it will be lost if the instance isshutdown or terminated.

If you do decide to use ephemeral drives for the Platfora cache directories, use RAID 0 (Stripe).This ensures Platfora has access to the maximum possible disk space and will also yield the highestperformance. Remember, ephemeral drives are temporary storage, so there is no need to use RAID 1.When the instance is stopped, the data is not saved.

In Platfora, the PLATFORA_DATA/dfscache and PLATFORA_DATA/fsCache directories canbe mapped to instance store volumes (if you decide to use them). These are the only directories of aPlatfora installation that should use ephemeral storage. Lens data is backed up in S3, so the loss ofany cached data is temporary.

EC2 Network Considerations

• Placement Groups - All Platfora server instances should be launched within the same Amazon EC2Placement Group. A placement group is a logical grouping of instances within a single AvailabilityZone. Using placement groups enables applications to participate in a low-latency, 10 Gbps networkconnectivity. Placement groups are recommended for applications that benefit from low networklatency, high network throughput, or both. See the Amazon EC2 Documentation on PlacementGroups.

• Enhanced Networking - To enable enhanced networking, you must launch each instance in the sameAmazon EC2 virtual private cloud (VPC). You can't enable enhanced networking if the instanceis in EC2-Classic. For more information, see the Amazon VPC User Guide and the Amazon EC2Documentation on Enhanced Networking.

http://docs.aws.amazon.com/AWSEC2/latest/UserGuide/placement-groups.html

http://docs.aws.amazon.com/AWSEC2/latest/UserGuide/placement-groups.html

http://docs.aws.amazon.com/AmazonVPC/latest/UserGuide/

http://docs.aws.amazon.com/AWSEC2/latest/UserGuide/enhanced-networking.html

http://docs.aws.amazon.com/AWSEC2/latest/UserGuide/enhanced-networking.html

Documents

Platfora Deployment Planning Guide · Platfora Deployment Planning Guide - About Platfora Deployments Page 8 Platfora connects to the Hadoop cluster managed by your organization,