44
Running Applications on BlueData EPIC VERSION 2.1

Running Applications on BlueData EPIC...-The sample_text.txt file is a plain text file that you will use as input data for all of the jobs you will be running, except Impala.-The hadoop-custom-jar

  • Upload
    others

  • View
    1

  • Download
    0

Embed Size (px)

Citation preview

Running Applications on BlueData EPIC

VERSION 2.1

Running Applications on BlueData EPIC

ii Copyright © 2016, BlueData Software, Inc. ALL RIGHTS RESERVED.

NoticeBlueData Software, Inc. believes that the information in this publica-

tion is accurate as of its publication date. However, the information is

subject to change without notice. THE INFORMATION IN THIS

PUBLICATION IS PROVIDED “AS IS.” BLUEDATA SOFTWARE,

INC. MAKES NO REPRESENTATIONS OR WARRANTIES OF

ANY KIND WITH RESPECT TO THE INFORMATION IN THIS

PUBLICATION, AND SPECIFICALLY DISCLAIMS IMPLIED WAR-

RANTIES OF MERCHANTABILITY OR FITNESS FOR A PARTIC-

ULAR PURPOSE.

Use, copying, or distribution of any BlueData software described in

this publication requires an applicable software license.

For the most up-to-date regulatory document for your product line,

please refer to your specific agreements or contact BlueData Tech-

nical Support at [email protected].

The information in this document is subject to change. This manual is

believed to be complete and accurate at the time of publication and

no responsibility is assumed for any errors that may appear. In no

event shall BlueData Software, Inc. be liable for incidental or conse-

quential damages in connection with or arising from the use of this

manual and its accompanying related materials.

Copyrights and TrademarksPublished February, 2016. Printed in the United States of America.

Copyright 2016 by BlueData Software, Inc. All rights reserved. This

book or parts thereof may not be reproduced in any form without the

written permission of the publishers.

EPIC, EPIC Lite, and BlueData are trademarks of BlueData Software,

Inc. All other trademarks are the property of their respective own-

ers.

Contact InformationBlueData Software, Inc.

3979 Freedom Circle, Suite 850

Santa Clara, California 95054

Email: [email protected]

Website: www.bluedata.com

Table of Contents

1 - Preface ........................................................................ 11.1 - About This Manual ........................................................... 3

1.1.1 - Formatting Conventions ......................................... 3

1.1.2 - Organization ........................................................... 4

1.2 - Additional Information .................................................... 5

1.2.1 - Related Documentation ......................................... 5

1.2.2 - Contact Information .............................................. 5

1.2.3 - Support .................................................................. 5

1.2.4 - End User License Agreement .............................. 5

2 - Setting up EPIC .......................................................... 72.1 - Downloading the Samples ............................................ 9

2.2 - Creating the Persistent Clusters ................................. 10

2.2.1 - Hadoop Cluster .................................................... 10

2.2.2 - Spark Cluster ....................................................... 11

2.3 - Creating Directories and Uploading Data ....................13

3 - Running Applications ................................................. 153.1 - Hadoop Custom Jar .......................................................17

3.2 - Hadoop Streaming .......................................................20

3.3 - Pig Script ......................................................................22

3.4 - Hive Script ....................................................................24

3.5 - Impala Script ................................................................26

3.6 - Spark - Scala Jar .........................................................29

3.6.1 - About the Spark Context .....................................29

3.6.2 - Running a Spark Scala Jar Job .........................29

3.7 - Spark - Java Jar ...........................................................31

3.8 - Spark - Python Script .................................................. 33

3.9 - Spark - Zeppelin .......................................................... 35

3.10 - SparkR Jobs ............................................................... 37

3.11 - Spark Streaming Jobs ................................................ 38

3.12 - The <Job> Output Popup ........................................... 39

iiiCopyright © 2016, BlueData Software, Inc. ALL RIGHTS RESERVED.

Running Applications on BlueData EPIC

This page intentionally left blank.

iv Copyright © 2016, BlueData Software, Inc. ALL RIGHTS RESERVED.

1 - Preface

Copyright © 2016, BlueData Software, Inc. ALL RIGHTS RESERVED. 1

Running Applications on BlueData EPIC

Welcome! This manual introduces you to using EPIC or EPIC Lite by

BlueData Software, Inc. by guiding you through some examples of:

• Downloading the sample scripts and data file from BlueData

Software, Inc.

• Uploading the sample data that will be used to run each of the

different job types available in EPIC and EPIC Lite

• Creating two clusters (one each Hadoop and Spark)

• Running one each of the various job types supported by EPIC and

EPIC Lite on the persistent clusters you created

• Viewing the output.

Most of these jobs will count how many times each word in the

sample text file appears in that file. For instance, if the word “EPIC”

appears 50 times, you will see a listing that looks something like

“EPIC 50.” Other scripts will count the number of lines of text that

include the letters “a” and “b” and display a count for each letter. The

Impala script will create data tables, populate those tables, execute

queries on those tables, and then return the results.

This Guide contains the information you need to set up EPIC or EPIC

Lite for the examples contained herein, but does not describe the

interface in any detail. Please see the About EPIC Guide for detailed

information about EPIC, including:

• Key features and benefits

• Hadoop and application support

• Definitions

• Architecture

• Storage

• Users

• System requirements

Note: The instructions in the Guide apply to both the EPIC

(full) and EPIC Lite (evaluation) versions 1.0 and 1.1.

2 Copyright © 2016, BlueData Software, Inc. ALL RIGHTS RESERVED.

1 - Preface

1.1 - About This Manual

This section describes the formatting conventions and information

contained in this manual.

1.1.1 - Formatting Conventions

This manual uses several formatting conventions to present

information of special importance.

Lists of items, points to consider, or procedures that do not need to

be performed in a specific order appear in bullet format:

• Item 1

• Item 2

Procedures that must be followed in a specific order appear in

numbered steps:

1. Perform this step first.

2. Perform this step second.

Specific keyboard keys are depicted in square brackets and are

capitalized, for example: [ESC]. If more than one key should be

pressed simultaneously, the notation will appear as [KEY1]+[KEY 2],

for example [ALT]+[F4].

Interface elements such as document titles, fields, windows, tabs,

buttons, commands, options, and icons appear in bold text.

Specific commands appear in standard Courier font. Sequences

of commands appear in the order in which you should execute them

and include horizontal or vertical spaces between commands. The

following additional formatting also applies when discussing

command-line commands:

Plain-text responses from the system appear in bold Courier font.

This manual also contains important safety information and instructions in specially formatted callouts with accompanying graphic symbols. These callouts and their symbols appear as follows throughout the manual:

The Note and Caution icons are blue in the main chapter, and gray in

the appendices.

CAUTION: CAUTIONS ALERT YOU TO THE POSSIBILITY

OF A SERIOUS ERROR, DATA LOSS, OR OTHER

ADVERSE CONDITION.

Note: Notes provide helpful information.

3Copyright © 2016, BlueData Software, Inc. ALL RIGHTS RESERVED.

Running Applications on BlueData EPIC

1.1.2 - Organization

This manual contains the following chapters:

• 1 - Getting Started: Describes how this manual is formatted and

organized.

• 2 - Setting up EPIC: Guides you through setting up the EPIC

environment to run the sample applications that are available for

download from BlueData Software, Inc.

• 3 - Running Applications: Provides step-by-step tutorial

examples of running applications within EPIC.

4 Copyright © 2016, BlueData Software, Inc. ALL RIGHTS RESERVED.

1 - Preface

1.2 - Additional Information

This section lists related documentation and provides information on

contacting BlueData, Inc.

1.2.1 - Related Documentation

Please refer to the following documents for additional information:

• About EPIC Guide: This guide explains the EPIC architecture,

features, and benefits. It also contains the End User License

Agreement.

• EPIC Lite Installation Guide: This guide helps you install EPIC

Lite, the free demonstration version of EPIC, on a single host.

• EPIC Installation Guide: This guide contains instructions for

installing the full version of EPIC on your network.

• User/Administrator Guide: This guide describes the EPIC

interface for Site Administrator, Tenant Administrator, and

Member users.

• Deployment Guide: Certain platforms have additional

requirements and/or procedures for installing and running EPIC.

• App Store Image Authoring Guide: Describes how Site

Administrators can author new images and make them available

in their local instance of the EPIC App Store.

1.2.2 - Contact Information

You may contact BlueData Software, Inc. at the following address:

BlueData Software, Inc.

3979 Freedom Circle, Suite 850

Santa Clara, California 95054

Email: [email protected]

Website: www.bluedata.com

1.2.3 - Support

Please see the Installation Guide for information on obtaining

technical support from BlueData Software, Inc.

1.2.4 - End User License Agreement

Your use of EPIC is subject to the terms and conditions described in

the End User License Agreement (EULA).

5Copyright © 2016, BlueData Software, Inc. ALL RIGHTS RESERVED.

Running Applications on BlueData EPIC

This page intentionally left blank.

6 Copyright © 2016, BlueData Software, Inc. ALL RIGHTS RESERVED.

2 - Setting up EPIC

Copyright © 2016, BlueData Software, Inc. ALL RIGHTS RESERVED. 7

Running Applications on BlueData EPIC

This chapter shows you how to set up the EPIC environment to run

the sample applications that will be presented in “Running

Applications” on page 15.

Setting up the EPIC environment consists of the following steps:

1. Download the sample script and sample data files from http://

support.bluedata.com to your local machine. See “Downloading the

Samples” on page 9.

2. Log into EPIC and create the two persistent clusters (one each

for Hadoop and Spark) that you will use when running the

examples in “Running Applications” on page 15.

3. Create directories within the tenant DataTap to store your job

input data, and then upload the sample text file to the input

directory. See “Creating Directories and Uploading Data” on page 13.

4. Run the jobs described in “Running Applications” on page 15.

Note: These instructions assume that you have just installed

EPIC on your system; however, these instructions and sam-

ples will work with your existing tenants, clusters, file sys-

tems, etc.

8 Copyright © 2016, BlueData Software, Inc. ALL RIGHTS RESERVED.

2 - Setting up EPIC

2.1 - Downloading the Samples

To download the sample script and data files from EPIC:

1. Visit the Support page at http://support.bluedata.com.

2. Download the file appsamples.zip to your local device.

3. Extract the .zip file.

The contents of the appsamples.zip file are as follows:

- The sample_text.txt file is a plain text file that you will

use as input data for all of the jobs you will be running,

except Impala.

- The hadoop-custom-jar directory contains the cdh-examples executable .jar file that you will use for the

Hadoop Custom Jar job described in “Hadoop Custom Jar” on

page 17.

- The hadoop-streaming directory contains the

mapper.py and reducer.py scripts that you will use for

the Hadoop Streaming job described in “Hadoop Streaming” on

page 20.

- The pig-script directory contains the

wordcount.pig script that you will use for the Hadoop

Streaming job described in “Hadoop Streaming” on page 20.

- The hive-script directory contains the

wordcount.hql script that you will use for the Pig Script

job described in “Hive Script” on page 24.

- The impala-script directory contains the

impala.script file that you will use for the Impala job

described in “Impala Script” on page 26.

- The spark/spark-scala and spark/spark-java

directories contain the .jar files required to run the sample

Spark Scala and Spark Java applications. See “Spark - Scala

Jar” on page 29 and “Spark - Java Jar” on page 31.

- The spark/spark-python directory contains the script

to run the sample Spark Python application. See “Spark -

Python Script” on page 33.

- The spark/spark-zeppelin directory contains a

sample file for use with the Spark Zeppelin application. See

“Spark - Zeppelin” on page 35.

Note: Do not rename or move the contents of directories

that have multiple files, as each of them is required for the

job to run successfully. You will specify only the executable

file in each such directory, and EPIC will handle the rest.

9Copyright © 2016, BlueData Software, Inc. ALL RIGHTS RESERVED.

Running Applications on BlueData EPIC

2.2 - Creating the Persistent Clusters

You will need to log in to EPIC and create two persistent clusters in

order to run the jobs described in “Running Applications” on page 15.

• If you are logging in as the default Site

Administrator user, click the User button

in the Toolbar and select any tenant in

the list that has a green or yellow icon

next to it. (You cannot create jobs in the

Site Admin tenant.)

• If you are logging in as a Tenant

Administrator or Tenant Member, you

may click the User button in the Toolbar

to switch to any tenant you have access to. You will not be able

to see or access the Site Admin tenant.

2.2.1 - Hadoop Cluster

To create the Hadoop cluster:

1. Ensure that the CDH 5.4.3 with Cloudera Manager image is

installed on the EPIC platform using the App Store. See the

User/Admin Guide for instructions on adding App Store images.

2. Click Clusters in the Main Menu to open the Cluster

Management screen.

3. At the top of the Cluster Management screen, click the blue

Create button to open the Create New Cluster screen.

4. Enter the following information:

- Provide a descriptive name for the cluster in the Cluster

Name field. This manual will use the name

SampleHadoopCluster.

- Select Hadoop using the Select Cluster Type pull-down

menu, if it is not selected already.

- At the bottom of the screen, check the Pig, Hive, Oozie,

Impala & Hue checkbox.

- Leave the rest of the fields as-is; you do not need to modify

them to run the sample applications.

5. Review your selections, which should look like this:

10 Copyright © 2016, BlueData Software, Inc. ALL RIGHTS RESERVED.

2 - Setting up EPIC

6. Click the blue Submit button at the bottom of the screen to finish

creating the cluster and return to the Cluster Management

screen.

You will see the newly created cluster with a blue starting bar in

the Status column.

The bar will turn green and say Ready once the cluster is ready

for use.

2.2.2 - Spark Cluster

To create the Spark cluster:

1. Ensure that the Spark 1.5.2 image is installed on the EPIC

platform using the App Store. See the User/Admin Guide for

instructions on adding App Store images.

2. Click Clusters in the Main Menu to open the Cluster

Management screen.

3. At the top of the Cluster Management screen, click the blue

Create button to open the Create New Cluster screen.

4. Enter the following information:

- Provide a descriptive name for the cluster in the Cluster

Name field. This manual will use the name

SampleSparkCluster.

- Select Spark using the Select Cluster Type pull-down

menu, if it is not selected already.

- Leave the rest of the fields as-is; you do not need to modify

them to run the sample applications.

5. Review your selections, which should look like this:

11Copyright © 2016, BlueData Software, Inc. ALL RIGHTS RESERVED.

Running Applications on BlueData EPIC

6. Click the blue Submit button at the bottom of the screen to finish

creating the cluster and return to the Cluster Management

screen. You will see the newly created cluster with a blue

Starting bar in the Status column.

The bar will turn green and say Ready once the cluster is ready for

use.

The SPARK_HOME/bin directory contains all of the binaries

required to Spark jobs (e.g., sparkR, spark-shell, spark-submit, etc.).

The location of the SPARK_HOME directory depends on the version

of Spark being used:

• Spark1.3: /usr/lib/spark/$spark_version (only availavble if the

EPIC platform was upgraded from 2.0 to 2.1)

• Spark1.4: /usr/lib/spark/spark-1.4.0-bin-hadoop2.4

• Spark 1.5: /usr/lib/spark/spark-1.5.2-bin-hadoop2.4

12 Copyright © 2016, BlueData Software, Inc. ALL RIGHTS RESERVED.

2 - Setting up EPIC

2.3 - Creating Directories and Uploading Data

After creating the persistent clusters, the next step is to create

directories to hold your sample data (input) that will be used for

running the sample scripts in “Running Applications” on page 15. You

will do this using the tenant DataTap.

The following example assumes that you are using the default

TenantStorage DataTap that was automatically created by EPIC

when you created the tenant; however, these steps will work for any

DataTap that uses local system storage.

To do this:

1. Click DataTaps in the Main Menu to open the DataTaps screen.

2. In the table on the screen, click the name of the DataTap that you

are going to use in the Name column to open the <DataTap>

Browser screen, where <DataTap> is the name of the DataTap

you are using. In this example, this will be the TenantStorage

Browser screen.

3. At the top of the screen, click the blue Create directory button

(plus sign) to open the Create new directory under screen.

This example assumes that you are creating directories under

the root directory of the DataTap; however, you may create this

directory anywhere you like.

4. Enter a name for the new directory (such as DataInput) and

then click OK. This will be the directory that holds the data being

processed by jobs.

The <DataTap> Browser screen should now look like this:

5. Click the data input directory you created in Steps 3 and 4 to

select it.

Note: If you wish to use a DataTap that points to remote

storage, then the EPIC interface will not provide operations

to create directories or upload/download files for that stor-

age; in this case, you must perform these operations outside

of EPIC.

Note: If the sample file sample_text.txt is placed in a

DataTap other than the default TenantStorage DataTap

then you will need to edit the Pig script described in “Pig

Script” on page 22.

13Copyright © 2016, BlueData Software, Inc. ALL RIGHTS RESERVED.

Running Applications on BlueData EPIC

6. Click the gray File Upload button (up arrow) at the top of the

screen to open a standard File Upload popup.

7. Navigate to the directory containing the sample_text.txt

file and upload it.

The Upload Status popup appears with a progress bar showing

the upload progress. This bar turns green and the word

Completed appears when the upload is complete.

8. Click OK to close the popup.

The <DataTap> Browser screen appears showing the

sample_text.txt file in your data input directory.

Note: Do not create a results (output) directory; this will

occur when creating the jobs. Hadoop jobs will usually

return an error if the output directory exists before the job is

run.

14 Copyright © 2016, BlueData Software, Inc. ALL RIGHTS RESERVED.

3 - Running Applications

Copyright © 2016, BlueData Software, Inc. ALL RIGHTS RESERVED. 15

Running Applications on BlueData EPIC

This chapter shows you how to run one of each of the job types that

EPIC supports. You will be using the persistent clusters that you

created in “Creating the Persistent Clusters” on page 10.

The jobs you will be running are:

• Hadoop Custom Jar: See “Hadoop Custom Jar” on page 17.

• Hadoop Streaming: See “Hadoop Streaming” on page 20

• Pig Script: See “Pig Script” on page 22.

• Hive Script: See “Hive Script” on page 24.

• Impala Script: See “Impala Script” on page 26.

• Spark - Scala Jar: See “Spark - Scala Jar” on page 29.

• Spark - Java Jar: See “Spark - Java Jar” on page 31.

• Spark - Python Script: See “Spark - Python Script” on page 33.

• Spark - Zeppelin: See “Spark - Zeppelin” on page 35.

Note: These examples assume that you created clusters and

tenants with the same names as those provided in “Setting

up EPIC” on page 7; however, they will also work if you set

up differently-named clusters and directories, so long as

you replace the example names shown with the actual

names you created.

16 Copyright © 2016, BlueData Software, Inc. ALL RIGHTS RESERVED.

3 - Running Applications

3.1 - Hadoop Custom Jar

This script will count the number of times each word appears in the

sample test file. To create a new Hadoop Custom Jar job:

1. Click Jobs in the Main Menu to open the Job Management

screen.

2. At the top of the Job Management screen, click the blue Create

button to open the Create New Job screen.

3. Enter a descriptive name for the job (such as

HadoopCustomJar) in the Job Name field.

4. Use the Job Type pull-down menu to select Hadoop Custom

Jar, if it is not selected already.

5. Click the Choose button in the Jar File field to open a standard

File Upload popup.

6. Navigate to the hadoop-custom-jar folder (see “Downloading the

Samples” on page 9) and then select the file cdh-examples.jar to upload.

7. Enter wordcount in the App Name field. This is case sensitive;

do not enter WordCount, WORDCOUNT, or any other variation.

8. Check the Persistent checkbox next to

Cluster Type and make sure that the

correct cluster (SampleHadoopCluster)

is selected in the pull-down menu.

9. Click the Insert button to the right of the

Edit Arguments field to open the

DataTap Browser popup.

10. Navigate to the data input folder that you

created in “Creating Directories and

Uploading Data” on page 13 and then

double click the sample_text.txt entry.

Note: This image shows the Create New Job screen with

the Persistent radio button checked under Cluster Type.

By default, this screen appears with the Transient radio

button selected, which displays some different fields than

those shown here. You will change this setting to Persistent

as you create the job.

17Copyright © 2016, BlueData Software, Inc. ALL RIGHTS RESERVED.

Running Applications on BlueData EPIC

The Edit Arguments field will display dtap://TenantStorage/DataInput/sample_text.txt (or

equivalent, if you are using a tenant and/or directories with

different names).

11. Click the X at the top right of the DataTap Browser window to

close it.

12. In the Edit Arguments field, click just after the text that

appeared in Step 9 and then type dtap://TenantStorage/DataOutput_1.

The Edit Arguments field should now appear as shown below.

13. Review your entries. The Create New Job screen should look

like this:

14. Click the blue Submit button to run the job and return to the Job

Management screen. You will see the newly created cluster with

a blue Running bar in the Job Status column.

The bar will turn green and say Completed once the job is

completed.

Note: You may manually type these entries into the Edit

Arguments field along with any other argument(s) that

you may need to run a job, but this is not necessary to run

this sample job.

18 Copyright © 2016, BlueData Software, Inc. ALL RIGHTS RESERVED.

3 - Running Applications

At any time during or after the job run, you may click the name of

the job to open the <Job> Output popup, which displays the

current output for the selected job. See “The <Job> Output Popup”

on page 39.

15. In the Main Menu, click DataTaps to return to the DataTaps

screen.

16. In the table on the screen, click the name of the DataTap that you

used in“Creating Directories and Uploading Data” on page 13 in the

Name column to open the <DataTap> Browser screen, where

<DataTap> is the name of the DataTap you are using. In this

example, this will be the TenantStorage Browser screen.

17. The <DataTap> Browser screen appears, with the Data

Output_1 folder now showing. Clicking the + sign to the left of

this folder expands the folder listing. Your job results are in the

part-r-00000 file.

18. Select the part-r-0000 file by clicking it, and then click the

purple Download File button (down arrow) to open a standard

File Download popup. Save the file to your local device.

19. Open the downloaded file in a text editor to see a complete list of

the words in the sample_text.txt file and how many times

each word appears therein.

This concludes the Hadoop Custom Jar example.

19Copyright © 2016, BlueData Software, Inc. ALL RIGHTS RESERVED.

Running Applications on BlueData EPIC

3.2 - Hadoop Streaming

This script will count the number of times each word appears in the

sample text file. To create a new Hadoop Streaming job:

1. Click Jobs in the Main Menu to open the Job Management

screen.

2. At the top of the Job Management screen, click the blue Create

button to open the Create New Job screen.

3. Enter a descriptive name for the job (such as

HadoopStreaming) in the Job Name field.

4. Use the Job Type pull-down menu to select Hadoop Streaming,

if it is not selected already.

5. Click the Choose button in the Mapper Script field to open a

standard File Upload popup.

6. Navigate to the hadoop-streaming folder (see “Downloading the

Samples” on page 9) and then select the file mapper.py to

upload.

7. Click the Choose button in the Reducer Script field to open a

standard File Upload popup.

8. Navigate to the hadoop-streaming folder (see “Downloading the

Samples” on page 9) and then select the file reducer.py to

upload.Check the Persistent checkbox next to Cluster Type.

9. Check the Persistent checkbox next to Cluster Type and make

sure that the correct cluster (SampleHadoopCluster) is selected

in the pull-down menu.

10. Click the Insert button to the right of the Input Path field to open

the DataTap Browser popup.

11. Navigate to the data input folder that you created in “Creating

Directories and Uploading Data” on page 13 and then double click

the sample_text.txt entry.

12. In the Output Path field, type dtap://TenantStorage/DataOutput_2

13. Review your entries. The Create New Job screen should look

like this:

20 Copyright © 2016, BlueData Software, Inc. ALL RIGHTS RESERVED.

3 - Running Applications

14. Click the blue Submit button to run the job and return to the Job

Management screen. You will see the newly created cluster with

a blue Running bar in the Job Status column.

The bar will turn green and say Completed once the job is

completed.

At any time during or after the job run, you may click the name of

the job to open the <Job> Output popup, which displays the

current output for the selected job. See “The <Job> Output Popup”

on page 39.

15. In the Main Menu, click DataTaps to return to the DataTaps

screen.

16. In the table on the screen, click the name of the DataTap that you

used in“Creating Directories and Uploading Data” on page 13 in the

Name column to open the <DataTap> Browser screen, where

<DataTap> is the name of the DataTap you are using. In this

example, this will be the TenantStorage Browser screen.

17. The <DataTap> Browser screen appears, with the Data

Output_2 folder now showing. Clicking the + sign to the left of

this folder expands the folder listing. Your job results are in the

part-00000 file.

18. Select the part-00000 file by clicking it, and then click the

purple Download File button (down arrow) to open a standard

File Download popup. Save the file to your local device.

19. Open the downloaded file in a text editor to see a complete list of

the words in the sample_text.txt file and how many times

each word appears therein. This will look identical to the results

created in “Hadoop Custom Jar” on page 17.

This concludes the Hadoop Streaming example.

21Copyright © 2016, BlueData Software, Inc. ALL RIGHTS RESERVED.

Running Applications on BlueData EPIC

3.3 - Pig Script

This script will count the number of times each word appears in the

sample text file. You may navigate to the pig-script folder and view

the wordcount.pig script in a text editor.

In this script, the line beginning with A = loads the sample text file,

and the next three lines count how many times each word appears in

the sample. To create a new Pig Script job:

1. On your local device, navigate to the pig-script folder (see

“Downloading the Samples” on page 9) and then open the file

wordcount.pig in a text editor.

2. Click Jobs in the Main Menu to open the Job Management

screen.

3. At the top of the Job Management screen, click the blue Create

button to open the Create New Job screen.

4. Enter a descriptive name for the job (such as PigScriptJob)

in the Job Name field.

5. Use the Job Type pull-down menu to select Pig Script, if it is

not selected already.

6. Click the Choose button in the Script Path field to open a

standard File Upload popup.

7. Navigate to the pig-script folder (see “Downloading the Samples”

on page 9) and then select the file wordcount.pig to upload.

8. Check the Persistent checkbox next to Cluster Type and make

sure that the correct cluster (SampleHadoopCluster) is selected

in the pull-down menu.

9. Review your entries. The Create New Job screen should look

like this:

Note: If the sample file sample_text.txt is placed in a

DataTap other than the default TenantStorage DataTap

then you will need to edit the Pig script being used for this

example.

22 Copyright © 2016, BlueData Software, Inc. ALL RIGHTS RESERVED.

3 - Running Applications

10. Click the blue Submit button to run the job and return to the Job

Management screen. You will see the newly created cluster with

a blue Running bar in the Job Status column.

The bar will turn green and say Completed once the job is

completed.

At any time during or after the job run, you may click the name of

the job to open the <Job> Output popup, which displays the

current output for the selected job. See “The <Job> Output Popup”

on page 39.

11. In the Main Menu, click DataTaps to return to the DataTaps

screen.

12. In the table on the screen, click the name of the DataTap that you

used in“Creating Directories and Uploading Data” on page 13 in the

Name column to open the <DataTap> Browser screen, where

<DataTap> is the name of the DataTap you are using. In this

example, this will be the TenantStorage Browser screen.

13. The <DataTap> Browser screen appears, with the Data

Output_3 folder now showing. Clicking the + sign to the left of

this folder expands the folder listing. Your job results are in the

part-r-00000 file.

14. Select the part-r-00000 file by clicking it, and then click the

purple Download File button (down arrow) to open a standard

File Download popup. Save the file to your local device.

15. Open the downloaded file in a text editor to see a complete list of

the words in the sample_text.txt file and how many times

each word appears therein.

This concludes the Pig Script example.

23Copyright © 2016, BlueData Software, Inc. ALL RIGHTS RESERVED.

Running Applications on BlueData EPIC

3.4 - Hive Script

This script will count the number of times each word appears in the

sample text file. You may navigate to the hive-script folder and view

the wordcount.hql script in a text editor.

This script creates a table that consists of lines of strings (words, in

this case), loads the sample text file, counts the number of times

each word appears in the sample, and then outputs the results to the

specified folder (/user/hive/DataOutput). To create a new

Hive Script job:

1. On your local device, navigate to the hive-script folder (see

“Downloading the Samples” on page 9) and then open the file

wordcount.hql in a text editor.

2. Click Jobs in the Main Menu to open the Job Management

screen.

3. At the top of the Job Management screen, click the blue Create

button to open the Create New Job screen.

4. Enter a descriptive name for the job (such as

HiveScriptJob) in the Job Name field.

5. Use the Job Type pull-down menu to select Hive Script, if it is

not selected already.

6. Click the Choose button in the Script Path field to open a

standard File Upload popup.

7. Navigate to the hive-script folder (see “Downloading the Samples”

on page 9) and then select the file wordcount.hql to upload.

8. Check the Persistent checkbox next to Cluster Type and make

sure that the correct cluster (SampleHadoopCluster) is selected

in the pull-down menu.

9. Review your entries. The Create New Job screen should look

like this:

24 Copyright © 2016, BlueData Software, Inc. ALL RIGHTS RESERVED.

3 - Running Applications

10. Click the blue Submit button to run the job and return to the Job

Management screen. You will see the newly created cluster with

a blue Running bar in the Job Status column.

The bar will turn green and say Completed once the job is

completed.

At any time during or after the job run, you may click the name of

the job to open the <Job> Output popup, which displays the

current output for the selected job. See “The <Job> Output Popup”

on page 39.

11. In the Main Menu, click Clusters to return to the Cluster

Management screen.

12. n the Cluster Details column of the table on the screen, click the

Cluster FS link that corresponds to the SampleHadoopCluster

cluster to open the <Cluster> Cluster FS Browser screen,

where <Cluster> is the name of the persistent cluster you are

using. In this example, this will be the SampleHadoopCluster

Cluster FS Browser screen.

13. In the folder listing, click the + sign next to the / folder and then

click the hive sub-folder, followed by the DataOutput sub-folder.

14. Download the file 000000_0 to your local machine.

15. Open the downloaded file in a text editor to see a complete list of

the words in the sample_text.txt file and how many times

each word appears therein.

This concludes the Hive Script example.

25Copyright © 2016, BlueData Software, Inc. ALL RIGHTS RESERVED.

Running Applications on BlueData EPIC

3.5 - Impala Script

The sample Impala script does not count words, and you need not

specify any input or output directories, as these are already coded

into the script. Impala essentially functions as a database query

engine; it creates tables, populates them with data, and then runs

queries against that data to produce the desired results. To create a

new Impala Script job:

1. Click Jobs in the Main Menu to open the Job Management

screen.

2. At the top of the Job Management screen, click the blue Create

button to open the Create New Job screen.

3. Enter a descriptive name for the job (such as

ImpalaScriptJob) in the Job Name field.

4. Use the Job Type pull-down menu to select Impala Script, if it

is not selected already.

5. Click the Choose button in the Script Path field to open a

standard File Upload popup.

6. Navigate to the impala-script folder (see “Downloading the

Samples” on page 9) and then select the file impala.script to

upload.

7. Check the Persistent checkbox next to Cluster Type and make

sure that the correct cluster (SampleHadoopCluster) is selected

in the pull-down menu.

8. Review your entries. The Create New Job screen should look

like this:

9. Click the blue Submit button to run the job and return to the Job

Management screen. You will see the newly created cluster with

a blue Running bar in the Job Status column.

26 Copyright © 2016, BlueData Software, Inc. ALL RIGHTS RESERVED.

3 - Running Applications

The bar will turn green and say Completed once the job is

completed.

At any time during or after the job run, you may click the name of

the job to open the <Job> Output popup, which displays the

current output for the selected job. See “The <Job> Output Popup”

on page 39.

10. Click the orange Job Output button (down arrow) in the Action

column.

The output appears in a new browser tab. Scroll to the bottom to

see the finished results.

This concludes the Impala Script example; however, if you are

interested, you may see the tables that the script created and queried

by doing the following:

1. Click Clusters to open the Cluster Management screen, and

then click the name of the cluster that you just used to run the

Impala Script job (such as SampleHadoopCluster).

The <Cluster> screen appears, where <Cluster> is the name of

the cluster.

2. In the Process List column, click Hue Console to open the Hue

Login screen in a new browser tab.

3. When logging to Hue for the first time on a virtual cluster, you

will be prompted to create a new user name/password that you

will use for subsequent access. Please refer to your Hue

documentation for additional information.

4. In the blue toolbar at the top of the page, click the Metastore

Manager button.

27Copyright © 2016, BlueData Software, Inc. ALL RIGHTS RESERVED.

Running Applications on BlueData EPIC

5. The Database default screen appears. The tables created and

used by the Impala script are in the Table Name section in the

lower right section of the screen. You may open these tables for

viewing within Hue. Please refer to your Hue documentation for

instructions if needed.

28 Copyright © 2016, BlueData Software, Inc. ALL RIGHTS RESERVED.

3 - Running Applications

3.6 - Spark - Scala Jar

This section describes how Spark Scala Jar jobs create a Spark

context and then guides you through running a sample job. In this

example, the script counts the number of times the letters “a” and

“b” appear in the sample text file.

3.6.1 - About the Spark Context

Spark applications typically need to construct a SparkContext as a

first step. The recommended way of constructing this in a Spark

Scala Jar application running on EPIC, is as follows:

val conf = new SparkConf().setAppName(<appName>)val sc = new SparkContext(conf)

In the above example, <appName> is the name of your application

that will show up in the Spark master UI. EPIC automatically

specifies the values for the Spark master host, Spark home, and the

required .jar files from the cluster environment.

3.6.2 - Running a Spark Scala Jar Job

To create a new Spark Scala Jar job:

1. Click Jobs in the Main Menu to open the Job Management

screen.

2. At the top of the Job Management screen, click the blue Create

button to open the Create New Job screen.

3. Enter a descriptive name for the job (such as

SparkScalaJar) in the Job Name field.

4. Use the Job Type pull-down menu to select Spark - Scala Jar,

if it is not selected already.

5. Click the Choose button in the Script Path field to open a

standard File Upload popup.

6. Navigate to the spark-scala-jar folder, which is under the spark

folder (see “Downloading the Samples” on page 9), and then upload

the .jar file.

7. In the App Name field, enter SimpleApp (case sensitive).

8. Check the Persistent checkbox next to Cluster Type and make

sure that the correct cluster (SampleSparkCluster) is selected

in the pull-down menu.

9. Click the Insert button to the right of the Edit Arguments field to

open the DataTap Browser popup.

10. Navigate to the data input folder that you created in “Creating

Directories and Uploading Data” on page 13 and then double click

the sample_text.txt entry.

11. In the Edit Arguments field, type dtap://TenantStorage/DataOutput_4.

29Copyright © 2016, BlueData Software, Inc. ALL RIGHTS RESERVED.

Running Applications on BlueData EPIC

12. Review your entries. The Create New Job screen should look

like this:

13. Click the blue Submit button to run the job and return to the Job

Management screen. You will see the newly created cluster with

a blue Running bar in the Job Status column.

The bar will turn green and say Completed once the job is

completed.

At any time during or after the job run, you may click the name of

the job to open the <Job> Output popup, which displays the

current output for the selected job. See “The <Job> Output Popup”

on page 39.

14. Click the orange Job Output button (down arrow) in the Action

column.

The output appears in a new browser tab. Scroll to the bottom to

see the finished results, which will be the number of lines with

the letters “a” and “b” in them.

This concludes the Spark Scala Jar example.

Note: You can run Spark ML jobs just like any other Spark

Scala jobs via either the EPIC interface as described above

or the command line.

30 Copyright © 2016, BlueData Software, Inc. ALL RIGHTS RESERVED.

3 - Running Applications

3.7 - Spark - Java Jar

This sample script counts the number of times the letters “a” and “b”

appear in the sample text file. To create a new Spark Java Jar job:

1. Click Jobs in the Main Menu to open the Job Management

screen.

2. At the top of the Job Management screen, click the blue Create

button to open the Create New Job screen.

3. Enter a descriptive name for the job (such as SparkJavaJar)

in the Job Name field.

4. Use the Job Type pull-down menu to select Spark - Java Jar, if

it is not selected already.

5. Click the Choose button in the Script Path field to open a

standard File Upload popup.

6. Navigate to the spark-java-jar folder, which is under the spark

folder (see “Downloading the Samples” on page 9), and then upload

the .jar file.

7. In the App Name field, enter SimpleApp (case sensitive).

8. Check the Persistent checkbox next to Cluster Type and make

sure that the correct cluster (SampleSparkCluster) is selected

in the pull-down menu.

9. Click the Insert button to the right of the Edit Arguments field to

open the DataTap Browser popup.

10. Navigate to the data input folder that you created in “Creating

Directories and Uploading Data” on page 13 and then double click

the sample_text.txt entry.

11. In the Edit Arguments field, type dtap://TenantStorage/DataOutput_5.

12. Review your entries. The Create New Job screen should look

like this:

31Copyright © 2016, BlueData Software, Inc. ALL RIGHTS RESERVED.

Running Applications on BlueData EPIC

13. Click the blue Submit button to run the job and return to the Job

Management screen. You will see the newly created cluster with

a blue Running bar in the Job Status column.

The bar will turn green and say Completed once the job is

completed.

At any time during or after the job run, you may click the name of

the job to open the <Job> Output popup, which displays the

current output for the selected job. See “The <Job> Output Popup”

on page 39.

14. Click the orange Job Output button (down arrow) in the Action

column.

The output appears in a new browser tab. Scroll to the bottom to

see the finished results, which will be the number of lines with

the letters “a” and “b” in them.

This concludes the Spark Java Jar example.

32 Copyright © 2016, BlueData Software, Inc. ALL RIGHTS RESERVED.

3 - Running Applications

3.8 - Spark - Python Script

In this example, the script counts the number of times the letters “a”

and “b” appear in the sample text file. To create a new Spark Python

Script job:

1. Click Jobs in the Main Menu to open the Job Management

screen.

2. At the top of the Job Management screen, click the blue Create

button to open the Create New Job screen.

3. Enter a descriptive name for the job (such as

SparkPythonScript) in the Job Name field.

4. Use the Job Type pull-down menu to select Spark - Python

Script, if it is not selected already.

5. Click the Choose button in the Script File field to open a

standard File Upload popup.

6. Navigate to the spark-python folder, which is under either the

spark folder (see “Downloading the Samples” on page 9), and then

upload the Python script.

7. Check the Persistent checkbox next to Cluster Type and make

sure that the correct cluster (SampleSparkCluster) is selected

in the pull-down menu.

8. Click the Insert button to the right of the Input Path field to open

the DataTap Browser popup.

9. Navigate to the data input folder that you created in “Creating

Directories and Uploading Data” on page 13 and then double click

the sample_text.txt entry.

10. In the Edit Arguments field, type dtap://TenantStorage/DataOutput_6.

11. Review your entries. The Create New Job screen should look

like this:

33Copyright © 2016, BlueData Software, Inc. ALL RIGHTS RESERVED.

Running Applications on BlueData EPIC

12. Click the blue Submit button to run the job and return to the Job

Management screen. You will see the newly created cluster with

a blue Running bar in the Job Status column.

The bar will turn green and say Completed once the job is

completed.

At any time during or after the job run, you may click the name of

the job to open the <Job> Output popup, which displays the

current output for the selected job. See “The <Job> Output Popup”

on page 39.

13. Click the orange Job Output button (down arrow) in the Action

column.

The output appears in a new browser tab. Scroll to the bottom to

see the finished results, which will be the number of lines with

the letters “a” and “b” in them.

This concludes the Spark Python Script example.

34 Copyright © 2016, BlueData Software, Inc. ALL RIGHTS RESERVED.

3 - Running Applications

3.9 - Spark - Zeppelin

The Apache Zeppelin interpreter concept allows any language or

data-processing backend to plug into Zeppelin. Zeppelin currently

supports many interpreters such as Scala and Python (with Apache

Spark), SparkSQL, Hive, Markdown, and Shell. EPIC supports

Zeppelin with Spark versions 1.4 and higher, and installing Spark

automatically installs Zeppelin as well. This example demonstrates

using the Zeppelin editor to run a job.

1. Click Clusters to open the Cluster Management screen, and

then click the name of the Spark cluster (such as

SampleSparkCluster).

The <Cluster> screen appears, where <Cluster> is the name of

the cluster.

2. In the Process List column, click ZeppelinNotebook.

The Welcome to Zeppelin! page appears in a new browser tab/

window.

3. Click the Zeppelin Tutorial link.

The Zeppelin Tutorial page appears.

4. If you see an Interpreter Binding section with a list of

interpreters (highlighted in blue) and a Save button, then click

Save to accept the defaults.

5. Each section on the page is called a paragraph. You can run a

paragraph by clicking the Run button (arrow) that appears in

each paragraph. Click the Run button in the Prepare Data

paragraph followed by the Run button in the Load Data Into

Table paragraph.

The notation FINISHED next to the Run button changes to

PENDING and then RUNNING while the paragraph runs and

then changes back to FINISHED once the run has completed.

6. In EPIC, click DataTaps in the Main Menu to open the DataTaps

screen.

Note: If you are unable to click the Play button then reload

the page by clicking the browser Refresh button.

35Copyright © 2016, BlueData Software, Inc. ALL RIGHTS RESERVED.

Running Applications on BlueData EPIC

7. In the table on the screen, click the name of the DataTap that you

used in“Creating Directories and Uploading Data” on page 13 in the

Name column to open the <DataTap> Browser screen, where

<DataTap> is the name of the DataTap you are using. In this

example, this will be the TenantStorage Browser screen.

8. Create a folder called data and then create a sub-folder called

page_views.

The <DataTap> Browser screen should now look something like

this:

9. Click the gray File Upload button (up arrow) at the top of the

screen to open a standard File Upload popup.

10. Navigate to the spark-zeppelin folder and upload the

page_views.csv file.

The Upload Status popup appears with a progress bar showing

the upload progress. This bar turns green and the word

Completed appears when the upload is complete.

11. Click OK to close the popup.

12. Return to the Zeppelin Tutorial page and scroll to the bottom,

where you will see a blank paragraph with the notation READY.

13. Type or paste the following text into the blank paragraph and

then click the Run button:

%hiveDROP TABLE IF EXISTS page_views

14. Scroll down to the new blank paragraph at the bottom of the

page, type or paste the following text into that paragraph, and

then click the Run button:

%hiveCREATE EXTERNAL TABLE page_views(viewTime INT, userid BIGINT, page_url STRING)ROW FORMAT DELIMITED FIELDS TERMINATED BY ','STORED AS TEXTFILELOCATION '/data/page_views/'

This process creates a table. You may now run any query on that

table.

36 Copyright © 2016, BlueData Software, Inc. ALL RIGHTS RESERVED.

3 - Running Applications

3.10 - SparkR Jobs

To run a SparkR job:

1. Establish an SSH connection to the Spark cluster by opening an

SSH client and pointing it to the IP address of the Spark Master

node and then using your EPIC credentials to log in. If needed,

you can open the <Cluster> screen for the Spark cluster you

want to access, click SparkMaster in the Process List column

to open the Spark Master page and then copy the URL from the

browser address bar.

2. Run the sparkR --master <spark master url> command, where <spark master url> is the URL of the

Spark Master node.

3. If you see permissions warnings, then grant the requested

permissions to the log directories mentioned in the warnings.

A SparkR prompt appears.

4. You may now run the following commands:

- -sqlContext <- sparkRSQL.init(sc) //Create the DataFrame

- df <- createDataFrame(sqlContext, faithful) //Get basic information about the DataFrame df

- //DataFrame[eruptions:double, waiting:double] //Select only the "eruptions" column

- head(select(df, df$eruptions)) // eruptions ##1 3.600 ##2 1.800 ##3 3.333 //You can also pass in column names as strings.

- head(select(df, "eruptions")) //Filter the DataFrame to only retain rows with wait times shorter than 50 mins.

- head(filter(df, df$waiting < 50)) //eruptions waiting ##1 1.750 47 ##2 1.750 47 ##3 1.867 48

- df <- createDataFrame(sqlContext, faithful) //Displays the content of the DataFrame to stdout

- head(df) // eruptions waiting

- //1 3.600 79

- //2 1.800 54

- //3 3.333 74

Note: If you use an SSH+keypair to connect to a virtual node

then the username will be bluedata and not root.

37Copyright © 2016, BlueData Software, Inc. ALL RIGHTS RESERVED.

Running Applications on BlueData EPIC

3.11 - Spark Streaming Jobs

This sample Spark Streaming job performs a word count from a

network stream. To run this job:

1. Open a terminal and start a netcat server at port 9999.

nc -lk 9999

2. Open another terminal and the run the following command:

cd $SPARK_HOMEsudo bin/run-example org.apache.spark.examples.streaming.NetworkWordCount localhost 9999

3. Return to the terminal you opened in Step 1 and then begin typing

random words separated by spaces. As you type, you should see

the second terminal giving you a continuously running word

count of the stream you are typing.

4. If you are seeing too many extraneous messages in your console

window, you can change your log4j.properties under

$SPARK_HOME/conf as follows:

log4j.rootLogger=INFO,stdout,stderr,file -> log4j.rootLogger=stdout,stderr,file (Remove INFO)

5. Continue typing words in the first terminal window and the word

count will continue updating in the second terminal window.

38 Copyright © 2016, BlueData Software, Inc. ALL RIGHTS RESERVED.

3 - Running Applications

3.12 - The <Job> Output Popup

In the Job Management screen, clicking the name of a running or

complete job in the Job Name column opens the <Job> Output

popup, where <Job> is the name of the selected job. This popup

refreshes every 15 seconds while the job is running and displays the

output of that job.

When you have finished viewing the job output, click the Hide button

to close the popup and return to the Job Management screen.

39Copyright © 2016, BlueData Software, Inc. ALL RIGHTS RESERVED.

Running Applications on EPIC, version 2.1 (02/2016)

This book or parts thereof may not be reproduced in any form with-

out the written permission of the publishers. Printed in the United

States of America. Copyright 2016 by BlueData Software, Inc. All

rights reserved.

Contact Information:

BlueData Software, Inc.

3979 Freedom Circle, Suite 850

Santa Clara, California 95054

Email: [email protected]

Website: www.bluedata.com