30
New Data Lake – Work with BDCS-CE (Notebooks, Object Storage/HDFS, Spark, and Spark SQL) Before You Begin Purpose In this tutorial, you learn how to get started with your new Big Data Cloud Service - Compute Edition (BDCS-CE) instance. You will learn how to work with the Notebook, and then use the Notebook to learn how to work with the Object Storage and HDFS and with Spark and Spark SQL. There are five sections in this tutorial:- Importing Notes Tutorial 1 – Notebook Basics Tutorial 2 - Setting up your BDCSCE Environment Tutorial 3 – Working with the Object Storage and HDFS Tutorial 4 – Working with Spark and Spark SQL Time to Complete 60 minutes Background Notebooks are used to explore and visualize data in an iterative fashion. Oracle Big Data Cloud Service - Compute Edition uses Apache Zeppelin as its notebook interface and coding environment. Information about Zeppelin can be found here: https://zeppelin.apache.org/ . To see examples of notes created and shared by other Zeppelin users, see https://www.zeppelinhub.com/viewer . What Do You Need? Before starting this tutorial, you should have: A running BDCS-CE cluster BDCS-CE account credentials or Big Data Cluster Console direct URL (for example: https://xxx.xxx.xxx.xxx:1080/) BDCS-CE cluster login credentials The Object Store credentials you specified when you create the BDCS-CE instance

Before You Begin - OracleTutorial 2 Setting up your BDCSCE Environment 1721.json Tutorial 3 Working with the Object Store and HDFS 1721.json Tutorial 4 Introduction to Spark and Spark

  • Upload
    others

  • View
    6

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Before You Begin - OracleTutorial 2 Setting up your BDCSCE Environment 1721.json Tutorial 3 Working with the Object Store and HDFS 1721.json Tutorial 4 Introduction to Spark and Spark

New Data Lake – Work with BDCS-CE (Notebooks,

Object Storage/HDFS, Spark, and Spark SQL)

Before You Begin

Purpose

In this tutorial, you learn how to get started with your new Big Data Cloud Service - Compute Edition

(BDCS-CE) instance. You will learn how to work with the Notebook, and then use the Notebook to learn

how to work with the Object Storage and HDFS and with Spark and Spark SQL.

There are five sections in this tutorial:-

Importing Notes

Tutorial 1 – Notebook Basics

Tutorial 2 - Setting up your BDCSCE Environment

Tutorial 3 – Working with the Object Storage and HDFS

Tutorial 4 – Working with Spark and Spark SQL

Time to Complete

60 minutes

Background

Notebooks are used to explore and visualize data in an iterative fashion. Oracle Big Data Cloud Service -

Compute Edition uses Apache Zeppelin as its notebook interface and coding environment. Information

about Zeppelin can be found here: https://zeppelin.apache.org/ . To see examples of notes created and

shared by other Zeppelin users, see https://www.zeppelinhub.com/viewer .

What Do You Need?

Before starting this tutorial, you should have:

A running BDCS-CE cluster

BDCS-CE account credentials or Big Data Cluster Console direct URL (for example:

https://xxx.xxx.xxx.xxx:1080/)

BDCS-CE cluster login credentials

The Object Store credentials you specified when you create the BDCS-CE instance

Page 2: Before You Begin - OracleTutorial 2 Setting up your BDCSCE Environment 1721.json Tutorial 3 Working with the Object Store and HDFS 1721.json Tutorial 4 Introduction to Spark and Spark

The Note files downloaded from here for importing. Unzip the file into a folder on your computer

and make a note of the folder path. After the download is finished, you should have the following

note files:

Tutorial 1 Notebook Basics 1721.json

Tutorial 2 Setting up your BDCSCE Environment 1721.json

Tutorial 3 Working with the Object Store and HDFS 1721.json

Tutorial 4 Introduction to Spark and Spark SQL 1721.json

Context

This tutorial is part of the New Data Lake series Oracle Big Data Journey. The sequence to follow is:

Module 1: New Data Lake Overview

Module 2: Sign up for an Oracle Cloud Trial, Create Object Storage Instance, and Create Big Data

Cloud Service - Compute Edition (BDCS-CE) Instance

Module 3: Work with BDCS-CE (Notebooks, Object Storage/HDFS, Spark, and Spark SQL)

Module 4: Create Event Hub Cloud Service (OEHCS) Instance

Module 5: Work with OEHCS and Spark Streaming

Importing Notes

Navigate/login to the Oracle Cloud My Services web page and navigate to the My Services page for your

BDCS-CE cluster.

Page 3: Before You Begin - OracleTutorial 2 Setting up your BDCSCE Environment 1721.json Tutorial 3 Working with the Object Store and HDFS 1721.json Tutorial 4 Introduction to Spark and Spark

In the Services page, click the Manage this Service icon of the cluster and then click Big Data

Cluster Console.

Enter your BDCS-CE cluster administrator user name and password and click Log In if a window titled

Authentication Required appears.

Page 4: Before You Begin - OracleTutorial 2 Setting up your BDCSCE Environment 1721.json Tutorial 3 Working with the Object Store and HDFS 1721.json Tutorial 4 Introduction to Spark and Spark

In the Big Data Cloud - Compute Edition Console, click Notebook.

The Notebook page is displayed, listing any notes created for this notebook.

In the Big Data Cloud - Compute Edition Console Notebook page, click Import Note.

Page 5: Before You Begin - OracleTutorial 2 Setting up your BDCSCE Environment 1721.json Tutorial 3 Working with the Object Store and HDFS 1721.json Tutorial 4 Introduction to Spark and Spark

The Import Note window is displayed. Click Browse.

Browse for or specify the file you want to import, the file must have a .json extension. You should have

previously downloaded the .json files via a Notes.zip file earlier. Select Tutorial 1 Notebook Basics

1721.json. Click OK.

The note is imported and listed in the list of notes.

Import the rest of .json files you downloaded in the same way.

Click the link for the note, which is named …-Tutorial 1 Notebook Basics, to start the first tutorial.

Page 6: Before You Begin - OracleTutorial 2 Setting up your BDCSCE Environment 1721.json Tutorial 3 Working with the Object Store and HDFS 1721.json Tutorial 4 Introduction to Spark and Spark

Tutorial 1 – Notebook Basics

The paragraphs of the note are displayed.

Please walk through the paragraphs one by one. Read through the content of the paragraphs as you get to

them. There is much useful information that in the paragraphs that is not reproduced into these

instructions.

Interpreters enable Zeppelin users to mix and match various language/data-processing-backends into a

single platform. For example, to use Python code in Zeppelin, you can use the %pyspark interpreter. If you

are not familiar about Zeppelin, click the links to learn about it.

Page 7: Before You Begin - OracleTutorial 2 Setting up your BDCSCE Environment 1721.json Tutorial 3 Working with the Object Store and HDFS 1721.json Tutorial 4 Introduction to Spark and Spark

Read through the content in this paragraph as it introduces our first interpreter: Markdown.

Also explore the actions you can perform using the note and paragraph icons. You can:

Show and hide the code editor

Show and hide results

Clear results

Export the note

Use keyboard shortcuts

Bind interpreters and change the default interpreter

Click the Show editor icon at the top right corner of the paragraph.

Markdown is a plain text formatting syntax. It can be converted to HTML.

Read through the content in the code editor. Click the links in the paragraph to learn more about

Markdown Interpreter.

Page 8: Before You Begin - OracleTutorial 2 Setting up your BDCSCE Environment 1721.json Tutorial 3 Working with the Object Store and HDFS 1721.json Tutorial 4 Introduction to Spark and Spark

Add a new line, new item, at the bottom of the list. Click Run icon at the top right corner of the

paragraph.

The new item is displayed in the list.

Here is a quick exercise before you continue.

Add your name in the paragraph and run it.

Page 9: Before You Begin - OracleTutorial 2 Setting up your BDCSCE Environment 1721.json Tutorial 3 Working with the Object Store and HDFS 1721.json Tutorial 4 Introduction to Spark and Spark

Here are two paragraphs about the shell interpreter(%sh). Read through the quick introduction in the first

one.

Read through the shell script in this paragraph.

Among other things, the script demonstrates a useful, repeatable trick to ensure you see the full output

when running shell commands. Note that there are two yum command lines. 2>&1 is added to the

second one as a useful trick to see the standard error as well as standard output.

Click the Run icon and check the result.

From the output, you can see yum command fails since you are not root. And the standard error is visible

in the second yum command because 2>&1 is added.

Page 10: Before You Begin - OracleTutorial 2 Setting up your BDCSCE Environment 1721.json Tutorial 3 Working with the Object Store and HDFS 1721.json Tutorial 4 Introduction to Spark and Spark

At the end of Tutorial 1, we introduce you some tips such as how to print notebooks. Please follow the

steps described in the paragraphs to learn more about these tips.

After you complete the tutorial, click Notebook on the top to go back to the notebook list page.

Then click on the note named …-Tutorial 2 Setting up your BDCSCE Environment, to start the second

tutorial.

Page 11: Before You Begin - OracleTutorial 2 Setting up your BDCSCE Environment 1721.json Tutorial 3 Working with the Object Store and HDFS 1721.json Tutorial 4 Introduction to Spark and Spark

Tutorial 2 - Setting up your BDCSCE Environment

Read through the introduction in this paragraph.

You will learn how to set up the BDCSCE environment.

Note that you need to run all of the paragraphs one at a time in order from top to bottom because some

steps are performed manually.

First, follow the steps in the Enabling SSH Network Access paragraph.

Then, follow the steps in the following paragraphs:-

Page 12: Before You Begin - OracleTutorial 2 Setting up your BDCSCE Environment 1721.json Tutorial 3 Working with the Object Store and HDFS 1721.json Tutorial 4 Introduction to Spark and Spark

Connecting to your BDCS-CE Zeppelin Server via SSH

Setting up the zeppelin user with sudo access

Configuring yum and pip

Read through the command script in the Commands to setup yum and pip paragraph.

Yum allows automatic updates, package and dependency management, on RPM-based linux distributions.

(From Wikipedia)

Pip is a package management system used to install and manage software packages written in Python.

(From Wikipedia)

Because you have now set up the zeppelin user with sudo access, we can use sudo in the scripts.

Click Run. The blue bar in the middle is the progress bar of running the paragraph.

Read through the command script in the Installing the swift object store command line utility paragraph.

Oracle Object Storage supports the industry standard OpenStack Object Storage API. This OpenStack

Object Storage API is known as "swift". BDCS-CE includes swift drivers that work with Spark and Hadoop

so that you can use those tools with Object Store data.

Page 13: Before You Begin - OracleTutorial 2 Setting up your BDCSCE Environment 1721.json Tutorial 3 Working with the Object Store and HDFS 1721.json Tutorial 4 Introduction to Spark and Spark

Click Run.

Click Notebook on the top to go back to the notebook list page and proceed to the next tutorials.

Tutorial 3 – Working with the Object Storage and

HDFS

Read through the quick introduction of Tutorial 3 – Working with the Object Store and HDFS. Ensure

that you have run Tutorial 2 first as this tutorial requires it.

Data can be placed in Object Storage, HDFS, and local file system, so in this tutorial you will experiment

with loading/unloading data between any two of them.

Page 14: Before You Begin - OracleTutorial 2 Setting up your BDCSCE Environment 1721.json Tutorial 3 Working with the Object Store and HDFS 1721.json Tutorial 4 Introduction to Spark and Spark

Read through the content in the paragraph to learn more about Oracle Storage Cloud Object Store.

Here is a brief picture of why the Object Storage is important.

With Object Storage, we can detach compute from storage allow for the two environments to grow

independently. It gives us a great way of being able to scale compute and store separately from each other.

We might have massive data and we only need to work on some of them some of the time, it allows us to

keep more data in the Object Storage than that we need to keep in the BDCS-CE environment.

We can maintain a core, distribution based environment while being able to use the latest and greatest

Hadoop projects on demand. We can have different clusters of compute all working against the same

object store. We might have data in the object store which is used by three or four groups in the

organization. The groups can have their own independent BDCS-CE environments but yet work on the

same data.

We can also persist all the data in a low cost, globally distributed store that speeds processes up while

Page 15: Before You Begin - OracleTutorial 2 Setting up your BDCSCE Environment 1721.json Tutorial 3 Working with the Object Store and HDFS 1721.json Tutorial 4 Introduction to Spark and Spark

making it more durable.

Configuring your Object Store Credentials

This paragraph will setup a shell script (swift_env.sh) to retrieve your Storage Cloud credentials to simplify

running the swift command in the future. Fill in your object store credentials before you run the

paragraph. Note that your credentials might be different from these shown in the screenshot.

Click Run to test the Storage Cloud connectivity by listing the Object Store containers with swift list

command.

Page 16: Before You Begin - OracleTutorial 2 Setting up your BDCSCE Environment 1721.json Tutorial 3 Working with the Object Store and HDFS 1721.json Tutorial 4 Introduction to Spark and Spark

Download some sample data to experiment with. In this example, you are going to download the text of a

handful of United States Presidential Inauguration Speeches. Click the link of the Yale law School Avalon

Project website to see the data source if you like.

Read through the command script in the code editor. You will install the lynx browser to help with the

downloading. Then you will download five speeches from the Yale Law School Avalon Project website.

Click Run.

Page 17: Before You Begin - OracleTutorial 2 Setting up your BDCSCE Environment 1721.json Tutorial 3 Working with the Object Store and HDFS 1721.json Tutorial 4 Introduction to Spark and Spark

The five speeches are downloaded and listed in the result. These files are downloaded to the local linux

filesystem on the BDCS-CE server.

Input a name of your Object Storage container; click Run to upload the five speeches to the Object

Storage. If you forget the container name, you can run the How to List Containers in the Object Store

paragraph.

Swift upload… command is used here to copy a file from the local linux filesystem of the BDCS-CE server

to the Object Storage.

Page 18: Before You Begin - OracleTutorial 2 Setting up your BDCSCE Environment 1721.json Tutorial 3 Working with the Object Store and HDFS 1721.json Tutorial 4 Introduction to Spark and Spark

You can list the containers in your Object Store and select one of them for the five speeches.

You can list the speeches in the container after uploading. Make sure you see the presidential speeches in

the container.

Read through the script in the code editor to learn about how to download files from the Object Store to

the linux file system. Swift download… command is used here. Enter the name of container and click Run.

Page 19: Before You Begin - OracleTutorial 2 Setting up your BDCSCE Environment 1721.json Tutorial 3 Working with the Object Store and HDFS 1721.json Tutorial 4 Introduction to Spark and Spark

Read through the script in the code editor to learn about how to upload and download files from the

Zeppelin server’s linux file system to BDCS-CE’s HDFS file system. Hadoop fs –put/-get command and

parameters are used here. Click Run.

Page 20: Before You Begin - OracleTutorial 2 Setting up your BDCSCE Environment 1721.json Tutorial 3 Working with the Object Store and HDFS 1721.json Tutorial 4 Introduction to Spark and Spark

Read through the script in the code editor to learn about how to copy files from the Object Store to

BDCS-CE’s HDFS file system.

Click Run.

Click Notebook on the top to go back to the notebook list page and proceed to the next tutorials.

Tutorial 4 – Working with Spark and Spark SQL

Read through the quick introduction of Tutorial 4 – Working with Spark and Spark SQL. Ensure that you

have run the previous 2 tutorials first as this tutorial depends on it.

This BDCS-CE version supplies Zeppelin interpreters for Spark(Scala), Spark(Python), and Spark SQL. This

tutorial will give you examples using all of these.

Check out the links to get a basic knowledge about Spark and Spark SQL if you need.

Page 21: Before You Begin - OracleTutorial 2 Setting up your BDCSCE Environment 1721.json Tutorial 3 Working with the Object Store and HDFS 1721.json Tutorial 4 Introduction to Spark and Spark

In Example 1, you run Scala Spark code to manipulate data from HDFS. It defines a Spark RDD (Resilient

Distributed Dataset) against a text file (pres1861_lincoln.txt) stored in HDFS. Then it runs a few actions

against the RDD, such as counting the # of lines, displaying the first line, and counting the number of lines

matching a given term - Constitution.

From the output, you can see that there are 364 rows, 24 lines contain the word Constitution, and the

first line is First Inaugural Address of Abraham Lincoln, in the text file.

Page 22: Before You Begin - OracleTutorial 2 Setting up your BDCSCE Environment 1721.json Tutorial 3 Working with the Object Store and HDFS 1721.json Tutorial 4 Introduction to Spark and Spark

Python is an important language for data scientists because of its easy-to-understand syntax and rich

ecosystem.

The second example does the same logic as the first example, just using Python instead of Scala.

Note that %pyspark interpreter is used here.

Page 23: Before You Begin - OracleTutorial 2 Setting up your BDCSCE Environment 1721.json Tutorial 3 Working with the Object Store and HDFS 1721.json Tutorial 4 Introduction to Spark and Spark

The third example is a slight variation of the first example. The only difference is that it uses data

(pres1933_fdr1.txt) on the zeppelin's server linux file system, not HDFS.

Here is the output of running Example 3.

The fourth example is another variation of the first example. The difference is that it works with data

stored on the Object Storage, not in HDFS.

Page 24: Before You Begin - OracleTutorial 2 Setting up your BDCSCE Environment 1721.json Tutorial 3 Working with the Object Store and HDFS 1721.json Tutorial 4 Introduction to Spark and Spark

Run Example 4.

It works with data named pres1981_reagon1.txt stored on the Object Store

The fifth example runs the classic Wordcount algorithm. You can choose if you want to operate on HDFS,

Object Storage, or linux file system data by commenting out the appropriate code. The results of the

Wordcount are an RDD named wordCounts that will also be used in the following example.

Click Run.

Page 25: Before You Begin - OracleTutorial 2 Setting up your BDCSCE Environment 1721.json Tutorial 3 Working with the Object Store and HDFS 1721.json Tutorial 4 Introduction to Spark and Spark

From the output, you can see the array of pairs of word and count.

The sixth example continues from the previous example. Specifically, it takes the wordCounts RDD and

registers it as Spark DataFrame. Then it registers the new data frame as a temporary Spark SQL table

named wordcounts. Click the link in the paragraph if you want to review the Spark SQL programming

guide to learn more about the features of Spark SQL.

The seventh example continues from the previous example. Specifically, it provides two samples of

running Spark SQL against the wordcounts table.

Page 26: Before You Begin - OracleTutorial 2 Setting up your BDCSCE Environment 1721.json Tutorial 3 Working with the Object Store and HDFS 1721.json Tutorial 4 Introduction to Spark and Spark

It also demonstrates some of the features of Zeppelin's Spark SQL interpreter and display visualization

capabilities.

Click the Table icon in the tool bar to see the data in table format.

Click the Bar Chart icon in the tool bar to see the data in bar chart format.

Click the settings alongside the tool bar if you want to do advanced setting.

Page 27: Before You Begin - OracleTutorial 2 Setting up your BDCSCE Environment 1721.json Tutorial 3 Working with the Object Store and HDFS 1721.json Tutorial 4 Introduction to Spark and Spark

The eighth example continues from the previous example. Specifically, it builds an RDD against all of the

speeches we stored in the Object Storage. Then it performs a word count against the RDD and converts

the result into a Data Frame that we register as a Spark SQL temporary table named filewordsraw.

Click Run.

The chart shows the trend of # of words per speech.

Note that we use explode function in the SQL statement to create a new row for each element in the

words array.

Page 28: Before You Begin - OracleTutorial 2 Setting up your BDCSCE Environment 1721.json Tutorial 3 Working with the Object Store and HDFS 1721.json Tutorial 4 Introduction to Spark and Spark

The chart shows the trend of # of various words (america, constitution, nation, rights, freedom, and

people), per speech.

The final example continues from the previous example. Specifically, it defines a new data frame based off

one of the sample SQL statements and writes that data frame back to the Object Store (as a json file

named pres_wordcount.json). Then it reads the json data back from the Object Store into a new Data

Frame. The repartition(1) ensures that we write a single output file, which makes sense since we know

the output is small.

From the output, you can see the array of pairs of file name and the count of words.

Page 29: Before You Begin - OracleTutorial 2 Setting up your BDCSCE Environment 1721.json Tutorial 3 Working with the Object Store and HDFS 1721.json Tutorial 4 Introduction to Spark and Spark

Optional: List the contents of your Object Store container to see the structure of the saved data frame.

Note that your container name might be different from that in the screenshot. There is only output file

named “part-r-0000-…”.

Optional - Explore the Spark UI

When you use Spark (via Scala, Python, and/or SQL), you start a session with the Spark server. In many

situations, it can be helpful to view the Spark UI for your session. BDCS-CE provides easy access to Spark

UI for your Zeppelin session.

To view it, follow these steps in the paragraph.

Want to Learn More?

Working with Notebook

Running a Batch Spark Job in a Big Data Cloud Service - Compute Edition Cluster

Get Started with Oracle Big Data Cloud Service - Compute Edition

Page 30: Before You Begin - OracleTutorial 2 Setting up your BDCSCE Environment 1721.json Tutorial 3 Working with the Object Store and HDFS 1721.json Tutorial 4 Introduction to Spark and Spark

Get Started with Oracle Storage Cloud Service