22
Elastic BigData Quck Start Issue 02 Date 2015-11-04 HUAWEI TECHNOLOGIES CO., LTD.

Quck Start...2015/11/05  · l Optimization and scheduling of workloads. Core Values Core values of Elastic BigData lie in the following aspects: l Massive Supports structured and

  • Upload
    others

  • View
    0

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Quck Start...2015/11/05  · l Optimization and scheduling of workloads. Core Values Core values of Elastic BigData lie in the following aspects: l Massive Supports structured and

Elastic BigData

Quck Start

Issue 02

Date 2015-11-04

HUAWEI TECHNOLOGIES CO., LTD.

Page 2: Quck Start...2015/11/05  · l Optimization and scheduling of workloads. Core Values Core values of Elastic BigData lie in the following aspects: l Massive Supports structured and

Copyright © Huawei Technologies Co., Ltd. 2015. All rights reserved.No part of this document may be reproduced or transmitted in any form or by any means without prior writtenconsent of Huawei Technologies Co., Ltd. Trademarks and Permissions

and other Huawei trademarks are trademarks of Huawei Technologies Co., Ltd.All other trademarks and trade names mentioned in this document are the property of their respectiveholders. NoticeThe purchased products, services and features are stipulated by the contract made between Huawei and thecustomer. All or part of the products, services and features described in this document may not be within thepurchase scope or the usage scope. Unless otherwise specified in the contract, all statements, information,and recommendations in this document are provided "AS IS" without warranties, guarantees orrepresentations of any kind, either express or implied.

The information in this document is subject to change without notice. Every effort has been made in thepreparation of this document to ensure accuracy of the contents, but all statements, information, andrecommendations in this document do not constitute a warranty of any kind, express or implied.

Huawei Technologies Co., Ltd.Address: Huawei Industrial Base

Bantian, LonggangShenzhen 518129People's Republic of China

Website: http://www.huawei.com

Email: [email protected]

Issue 02 (2015-11-04) Huawei Proprietary and ConfidentialCopyright © Huawei Technologies Co., Ltd.

i

Page 3: Quck Start...2015/11/05  · l Optimization and scheduling of workloads. Core Values Core values of Elastic BigData lie in the following aspects: l Massive Supports structured and

Contents

1 Introduction to Elastic BigData...................................................................................................1

2 Introduction to Operation Process............................................................................................. 3

3 Introduction to Components....................................................................................................... 43.1 Hadoop............................................................................................................................................................................53.2 Spark...............................................................................................................................................................................5

4 Quick Start.................................................................................................................................... 124.1 Creating a Cluster......................................................................................................................................................... 134.2 Adding a Job.................................................................................................................................................................164.3 Managing Jobs..............................................................................................................................................................174.4 Creating and Deleting a Table...................................................................................................................................... 17

Elastic BigDataQuck Start Contents

Issue 02 (2015-11-04) Huawei Proprietary and ConfidentialCopyright © Huawei Technologies Co., Ltd.

ii

Page 4: Quck Start...2015/11/05  · l Optimization and scheduling of workloads. Core Values Core values of Elastic BigData lie in the following aspects: l Massive Supports structured and

1 Introduction to Elastic BigData

Overview

With rapid development of the Internet era, enterprise data is on the increase everyday. Anurgent demand is developed on how to generate values from this complicated and chaotic dataand how to use this data to perform innovation and take business opportunities. Obviously,conventional data processing is incompetent in this big data era.

Huawei Elastic BigData builds a reliable, secure, and easy-to-use operation and maintenance(O&M) platform and provides storage and analysis capabilities for massive data, helpingaddress enterprise data storage and processing demands. Users can independently apply forand use the hosted Hadoop and Spark service to quickly create Hadoop clusters and Sparkclusters on a host, which provides storage and computing capabilities for massive data thathave low requirements on real-time processing. After data storage and computing arefulfilled, the cluster service can be terminated, and no fee will be charged accordingly. Youcan also choose to run clusters permanently.

Application Scenario

Based on the Hadoop open source software and Spark memory computing engine, ElasticBigData provides an enhanced and unified platform for storing, querying, and analyzingelastic enterprise data, and helps enterprises quickly establish an information system forprocessing massive data. This platform features:

l Swift integration and management of big data sets of various types.

l Advanced analysis of native information.

l Visualization of available data for special analysis.

l Optimization and scheduling of workloads.

Core Values

Core values of Elastic BigData lie in the following aspects:

l MassiveSupports structured and unstructured data processing at PB-level or above,outperforming traditional databases.

l Economic

Elastic BigDataQuck Start 1 Introduction to Elastic BigData

Issue 02 (2015-11-04) Huawei Proprietary and ConfidentialCopyright © Huawei Technologies Co., Ltd.

1

Page 5: Quck Start...2015/11/05  · l Optimization and scheduling of workloads. Core Values Core values of Elastic BigData lie in the following aspects: l Massive Supports structured and

Allows use of the Elastic BigData service on-demand so that computing resources can bereleased after jobs are completed, utilizing resources to the maximum.

l ReliabilityEnsures the reliability of distributed data processing by using the backup and recoverymechanism of HDFS and the task monitoring of MapReduce and Spark.

l EfficiencySpark memory computing engine, HDFS efficient data interaction, MapReduce, andlocal data processing lay a foundation for efficiently massive data processing.

Elastic BigDataQuck Start 1 Introduction to Elastic BigData

Issue 02 (2015-11-04) Huawei Proprietary and ConfidentialCopyright © Huawei Technologies Co., Ltd.

2

Page 6: Quck Start...2015/11/05  · l Optimization and scheduling of workloads. Core Values Core values of Elastic BigData lie in the following aspects: l Massive Supports structured and

2 Introduction to Operation Process

Elastic BigData is a basic service of Huawei enterprise cloud, which is easy to use andprovides a user-friendly user interface (UI). By using computers connected in a cluster, youcan run various tasks, process/store PB-level data. A typical procedure for using ElasticBigData is as follows:

1. Create a cluster.To use Elastic BigData, you must have purchased cluster resources. The cluster quantityis subject to virtual machine (VM) quantity. Then, you can create clusters and set basiccluster information. You can submit a job at the same time when you create a cluster.

2. Add a job.After a cluster is created, you can analyze and process data by adding jobs. Note thatElastic BigData provides a platform for executing programs developed by users. You cansubmit, execute, and monitor such programs by using Elastic BigData.

3. Execute a job.After a job is added, the job is in Running state by default.

4. View the execution result.You can view the execution result of jobs in Job Management.

NOTE

Operations such as creating a cluster or job will be recorded in logs helping locate faults in case of anyproblem in Elastic BigData usage.

Elastic BigDataQuck Start 2 Introduction to Operation Process

Issue 02 (2015-11-04) Huawei Proprietary and ConfidentialCopyright © Huawei Technologies Co., Ltd.

3

Page 7: Quck Start...2015/11/05  · l Optimization and scheduling of workloads. Core Values Core values of Elastic BigData lie in the following aspects: l Massive Supports structured and

3 Introduction to Components

3.1 Hadoop

3.2 Spark

Elastic BigDataQuck Start 3 Introduction to Components

Issue 02 (2015-11-04) Huawei Proprietary and ConfidentialCopyright © Huawei Technologies Co., Ltd.

4

Page 8: Quck Start...2015/11/05  · l Optimization and scheduling of workloads. Core Values Core values of Elastic BigData lie in the following aspects: l Massive Supports structured and

3.1 HadoopCurrently, Hadoop components are mandatory, including HDFS, MapReduce, and YARN.

Hadoop is a type of open source software. Elastic BigData provides enhanced functions basedon Hadoop. Core design of Hadoop is the Hadoop distributed file system (HDFS) andMapReduce components.

l HDFS: HDFS supports high-throughput data access and applies to processing of largedata sets. In HDFS, a file is divided into one or multiple data blocks that are stored inDataNodes. NameNodes store and manage all metadata in HDFS.– NameNode: manages the namespace, directory structure, and metadata of file

systems, and records the mapping relationships between data blocks and files towhich the data blocks belong. In the HA mode, NameNode and secondaryNameNode are provided.

– DataNode: stores data blocks of each file and periodically reports stored data blocksto the NameNode.

l MapReduce: Map and Reduce. Map divides one task into multiple tasks, and Reducesummarizes the process results of these tasks and produces the final analysis result.

l YARN: Hadoop YARN is the resource management system of Hadoop 2.0. It is a generalresource module that manages and schedules resources for applications. YARN can beused not only in the MapReduce framework, but also in other frameworks. The YARNmodel consists of ResourceManager, ApplicationMaster, and NodeManager.– ResourceManager (RM): Resource manager of a cluster, which schedules resources

for applications based on the requirements. ResourceManager provides a plug-in forscheduling policies, which is used to allocate the resources of a cluster to multiplequeues and applications. The scheduling plug-in schedules resources based on thecurrent scheduling capability and the fair scheduling model.

– ApplicationMaster: App Mstr that is displayed in the figure. ApplicationMasterschedules and coordinates resources among applications. ApplicationMaster usesresources obtained from ResourceManager and works with NodeManager toexecute and monitor tasks.

– NodeManager (NM): Container that implements applications. NodeManager alsomonitors the usage of resources (including CPU, memory, hard disk, and networkresources) of applications and reports the monitoring results to ResourceManager.

3.2 Spark

Function

Spark is a distributed computing framework based on memory. In iterative computationscenarios, data is stored in memory when being processed, which prevents the problems thatoccur in the MapReduce computing framework. Spark uses the Hadoop Distributed FileSystem (HDFS), enabling users to quickly switch to Spark from MapReduce and providingperformance that is 10 to 100 times higher than that of MapReduce. As a computing engine,Spark also supports flow-based processing in small batches, offline processing in batches,structured query language (SQL) query, and data mining. This avoids the same data beingloaded to different systems and saves costs on storage and performance.

Elastic BigDataQuck Start 3 Introduction to Components

Issue 02 (2015-11-04) Huawei Proprietary and ConfidentialCopyright © Huawei Technologies Co., Ltd.

5

Page 9: Quck Start...2015/11/05  · l Optimization and scheduling of workloads. Core Values Core values of Elastic BigData lie in the following aspects: l Massive Supports structured and

Features of Spark are as follows:

l Supports distributed storage computation.

l Supports iterative computation.

l Supports online Spark SQL.

l Supports compatibility with the file read and write modes of the Hadoop system.

l Supports fault tolerance during the computation process.

l Supports multiple programming languages for developing applications (including Scala,Java, and Python).

l Supports linear expansion of computation capability.

Architecture

Figure 1 Spark architecture describes the Spark architecture and Table 3-1 lists the Sparkmodules.

Figure 3-1 Spark architecture

Table 3-1 Spark modules

Module Description

Cluster Manager Indicates the manager of a cluster. Spark supports multiplecluster managers, including Mesos, YARN, and the standalonecluster manager that is delivered with Spark.

Driver Program Indicates a driver process contained in the Spark applicationduring running. This process is also the main process of theapplication. It resolves the application, generates stages, andschedules tasks to Executor.

Executor Indicates the location where an application is executed. Acluster involves multiple Executors. Each Executor receivesthe Launch Task command from Driver. An Executor canexecute one or multiple tasks.

Master Node Indicates the master node of a cluster. Master Node receivesjobs that are submitted by the client, manages Worker, andcommands Worker to start Driver and Executor.

Elastic BigDataQuck Start 3 Introduction to Components

Issue 02 (2015-11-04) Huawei Proprietary and ConfidentialCopyright © Huawei Technologies Co., Ltd.

6

Page 10: Quck Start...2015/11/05  · l Optimization and scheduling of workloads. Core Values Core values of Elastic BigData lie in the following aspects: l Massive Supports structured and

Module Description

Work Node Runs the Spark application in the cluster. Work Node alsomanages resources of the node, reports the heartbeat to Masterperiodically, and receives commands from Master.

SparkContext Indicates the entry of the Spark application for users, which isalso the main control of the Spark application. It makesdistribution plans and schedules tasks. SparkContext is a classdefined by users based on the service logics. It involvesdirected acyclic graphs (DAGs). SparkContext divides stagesand generates tasks based on the user service logics.

Task Carries the computation unit of the service logics. It is theminimum working unit that can be executed on the Sparkplatform. An application can be divided into multiple tasksbased on the execution plan and computation amount.

Cache Indicates the distributed cache. Each task can store the result inCache for subsequent tasks to read.

Basic ConceptsThe following concepts help you learn Spark quickly and focus on the actual services duringthe Spark development.

l RDDResilient Distributed Dataset (RDD) is a core concept of Spark. It indicates a read-onlyand partitionable distributed dataset. Partial or all data of this dataset can be cached inthe memory and reused between computations.RDD Generation– An RDD can be generated from the Hadoop file system or other storage systems

that are compatible with Hadoop, such as Hadoop Distributed File System (HDFS).– A new RDD can be transferred from a parent RDD.– An RDD can be converted from a collection.RDD Storage– Users can select different storage levels to store an RDD for reuse, such as

DISK_ONLY, MEMORY_AND_DISK.– The current RDD is stored in the memory by default. When the memory is

insufficient, the RDD overflows to the disk.l RDD Dependency

The logical relationship between parent RDD and sub RDD, The RDD dependencyincludes the narrow dependency and wide dependency.

Elastic BigDataQuck Start 3 Introduction to Components

Issue 02 (2015-11-04) Huawei Proprietary and ConfidentialCopyright © Huawei Technologies Co., Ltd.

7

Page 11: Quck Start...2015/11/05  · l Optimization and scheduling of workloads. Core Values Core values of Elastic BigData lie in the following aspects: l Massive Supports structured and

Figure 3-2 RDD dependency

– Narrow dependency: Each partition of the parent RDD is used by at most onepartition of the child RDD partition, which includes two situations, partitions of oneparent RDD correspond to partitions of one child RDD (situation 1) and partitionsof two parent RDD correspond to partitions of one child RDD (situation 2). In thepreceding figure, map/filter and union belong to situation 1 and join with inputs co-partitioned belongs to situation 2.

– Wide dependency: Partitions of the child RDD depends on all partitions of theparent RDD due to shuffle operations. In the preceding figure, groupByKey andjoin with inputs that is not co-partitioned belong to this type.

The narrow dependency facilitates the optimization. Logically, each RDD operator is afork/join process. Fork the RDD to each partition, and then perform the computation.After the computation, join the results, and then perform the fork/join operation on nextoperator. It takes a long period of time to directly translate the RDD to physicalimplementation. There are two reasons: each RDD (even the intermediate results) mustbe physicalized to the memory or storage, which takes time and space; the partitions canbe joined only when the computation of all partitions is complete (if the computation of apartition is slow, the entire join process is slowed down). If the partitions of the childRDD narrowly depend on the partitions of the parent RDD, the two fork/join processescan be combined to optimize the entire process. If the relationship in the continuousoperator sequence is narrow dependency, multiple fork/join processes can be combinedto reduce the time for waiting and improve the performance. This is called pipelineoptimization in Spark.

l Transformation and Action (RDD Operations)Operations on RDD include transformation (the returned value is an RDD) and action(the returned value is not an RDD). Figure 3-3 shows the process. The transformation islazy, which indicates that the transformation from one RDD to another RDD is notimmediately executed. Spark only records the transformation but does not execute itimmediately. The real computation is started only when the action is started. The actionreturns results or writes the RDD data into the storage system. The action is the drivingforce for Spark to start the computation.

Elastic BigDataQuck Start 3 Introduction to Components

Issue 02 (2015-11-04) Huawei Proprietary and ConfidentialCopyright © Huawei Technologies Co., Ltd.

8

Page 12: Quck Start...2015/11/05  · l Optimization and scheduling of workloads. Core Values Core values of Elastic BigData lie in the following aspects: l Massive Supports structured and

Figure 3-3 RDD operation

Figure 3-4 shows an example. The data and operation model of RDD are quite differentfrom those of Scala.

a. The textFile operator reads log files from the HDFS and returns file (originalRDD).

b. The filter operator filters rows with ERROR and assigns them to errors (a newRDD). The filter operator is a transformation.

c. The cache operator caches errors for future use.d. The count operator returns the number of rows of errors. The count operator is an

action.

Figure 3-4 Scala example

Elastic BigDataQuck Start 3 Introduction to Components

Issue 02 (2015-11-04) Huawei Proprietary and ConfidentialCopyright © Huawei Technologies Co., Ltd.

9

Page 13: Quck Start...2015/11/05  · l Optimization and scheduling of workloads. Core Values Core values of Elastic BigData lie in the following aspects: l Massive Supports structured and

Transformation includes the following types:– The RDD elements are regarded as simple elements:

The input and output have the one-to-one relationship, and the partition structure ofthe result RDD remains unchanged, for example, map.The input and output have the one-to-many relationship, and the partition structureof the result RDD remains unchanged, for example, flatMap (one element becomesa sequence containing multiple elements and then flattens to multiple elements).The input and output have the one-to-one relationship, but the partition structure ofthe result RDD changes, for example, union (two RDDs integrates to one RDD, andthe number of partitions becomes the sum of the number of partitions of two RDDs)and coalesce (partitions are reduced).Operators of some elements are selected from the input, such as filter, distinct(duplicate elements are deleted), subtract (elements only exist in this RDD areretained), and sample (samples are taken).

– The RDD elements are regarded as Key-Value pairs.Perform the one-to-one calculation on the single RDD, such as mapValues (thepartition mode of the source RDD is retained, which is different from map).Sort the single RDD, such as sort and partitionBy (partitioning with consistency,which is important to the local optimization).Restructure and reduce the single RDD based on key, such as groupByKey andreduceByKey.Join and restructure two RDDs based on key, such as join and cogroup.

NOTE

The later three operations involve sorting and are called shuffle operations.

Action includes the following types:– Generate scalar configuration items, such as count (the number of elements in the

returned RDD), reduce, fold/aggregate (the number of scalar configuration itemsthat are returned), and take (the number of elements before the return).

– Generate the Scala collection, such as collect (import all elements in the RDD to theScala collection) and lookup (look up all values corresponds to the key).

– Write data to the storage, such as saveAsTextFile (which corresponds to thepreceding textFile).

– Check points, such as checkpoint. When Lineage is quite long (which occursfrequently in graphics computation), it takes a long period of time to execute thewhole sequence again when a fault occurs. In this case, checkpoint is used as thecheck point to write the current data to stable storage.

l ShuffleShuffle is a specific phase in the Spark framework. If the output results of Map are to beused by Reduce, the output results must be hashed based on the key and distributed toeach Reducer. This process is called Shuffle. Shuffle involves the read and write of thedisk and the transmission of the network, so that the performance of Shuffle directlyaffects the operation efficiency of the entire program.The following figure describes the entire process of the Shuffle algorithm.

Elastic BigDataQuck Start 3 Introduction to Components

Issue 02 (2015-11-04) Huawei Proprietary and ConfidentialCopyright © Huawei Technologies Co., Ltd.

10

Page 14: Quck Start...2015/11/05  · l Optimization and scheduling of workloads. Core Values Core values of Elastic BigData lie in the following aspects: l Massive Supports structured and

Figure 3-5 Algorithm process

l Spark Application StructureThe Spark application structure includes the initial SparkContext and the main program.– Initial SparkContext: constructs the operation environment of the Spark application.

Constructs the SparkContext object, for example:new SparkContext(master, appName, [SparkHome], [jars])

Parameter description:master: indicates the link string. The link modes include local, YARN-cluster, and YARN-client.appName: indicates the application name.SparkHome: indicates the directory where Spark is installed in the cluster.jars: indicates the code and dependency package of the application.

– Main program: processes data.

Elastic BigDataQuck Start 3 Introduction to Components

Issue 02 (2015-11-04) Huawei Proprietary and ConfidentialCopyright © Huawei Technologies Co., Ltd.

11

Page 15: Quck Start...2015/11/05  · l Optimization and scheduling of workloads. Core Values Core values of Elastic BigData lie in the following aspects: l Massive Supports structured and

4 Quick Start

4.1 Creating a Cluster

4.2 Adding a Job

4.3 Managing Jobs

4.4 Creating and Deleting a Table

Elastic BigDataQuck Start 4 Quick Start

Issue 02 (2015-11-04) Huawei Proprietary and ConfidentialCopyright © Huawei Technologies Co., Ltd.

12

Page 16: Quck Start...2015/11/05  · l Optimization and scheduling of workloads. Core Values Core values of Elastic BigData lie in the following aspects: l Massive Supports structured and

4.1 Creating a ClusterTo use Elastic BigData, you must have purchased cluster resources. The following describeshow to create a cluster by using Elastic BigData.

Operation Procedure

Step 1 Log in to Elastic BigData and enter the Cluster List page.

Step 2 Click Apply For Cluster and open the Cluster Configurations page.

Step 3 Table 4-1 describes basic information of a cluster.

Table 4-1 Cluster configuration information

Parameter Description

on-demand Charged by actual duration of use; unit: hour. The fee is deducted perhour. During fee deduction, if the user account encounters insufficientbalance, a message will be sent notifying users to pay renewal fee andcluster resources will be frozen. After renewal fee is paid, clusterresources will be unfrozen. If no renewal fee is paid, cluster resourceswill be deleted after the freeze period. Currently, only charge on-demandis supported.Charging details by using Elastic BigData are described as follows:l Charging factor: fee rate, number of nodes, and duration. The fee rate

varies depending on specifications.l Charging formula:

Charge on-demand: Fee = (Fee rate per hour x Number of masternodes x Duration (by hour)) + (Fee rate per hour x Number of corenodes x Duration (by hour))

NOTEThe preceding formula is only used to calculate the fee for purchasing clusterresources. Bandwidth or traffic fees for big data are charged additionally.

Basic Information

Cluster name Cluster name, which is globally unique.A cluster name consists of 1-64 characters containing English letters,digits, and - _ special characters. The cluster name can be started with anEnglish letter, digit, or special character, but cannot be started with spaceor consist of only space.

Region Work zone of a cluster. Currently, only southchina, eastchina andnorthchina are available.view

VPC Select the VPC that needs to create a cluster to enter the VPC and viewthe created name and ID of VPC. A virtual private cloud (VPC) is asecure, isolated, logical network environment.

Elastic BigDataQuck Start 4 Quick Start

Issue 02 (2015-11-04) Huawei Proprietary and ConfidentialCopyright © Huawei Technologies Co., Ltd.

13

Page 17: Quck Start...2015/11/05  · l Optimization and scheduling of workloads. Core Values Core values of Elastic BigData lie in the following aspects: l Massive Supports structured and

Parameter Description

Subnet Select the subnet that needs to create a cluster to enter the VPC and viewthe created name and ID of subnet. A subnet provides dedicated networkresources that are isolated from other networks, improving networksecurity.

Cluster node

Type Elastic BigData provides two types of nodes:l Master: A master node in an elastic big data cluster manages the

cluster, assigns MapReduce executable files to core nodes, traces theexecution status of each job, and monitors DataNode running status.

l Core: A core node in a cluster processes data and stores process datain the HDFS.

InstanceSpecification

Instance specification of a node. Huawei provides optimal specificationsbased on optimization experience. The instance specification isdetermined by CPU, memory, and disks used. Currently, threespecifications are supported:l s1.2xlarge.linux.bigdata

– CPU: 8-core– Memory: 32 GB– Disk: 40 GB

l s1.4xlarge.linux.bigdata– CPU: 16-core– Memory: 64 GB– Disk: 40 GB

l s1.8xlarge.linux.bigdata– CPU: 32-core– Memory: 128 GB– Disk: 40 GB

NOTEA more advanced instance specification allows better data processing, andaccordingly requires higher cluster cost.

Quantity Number of master and core nodes.NOTE

A small value may cause slow cluster running while a large value may causeunnecessary cost. Properly set a value based on data to be processed.

Elastic BigDataQuck Start 4 Quick Start

Issue 02 (2015-11-04) Huawei Proprietary and ConfidentialCopyright © Huawei Technologies Co., Ltd.

14

Page 18: Quck Start...2015/11/05  · l Optimization and scheduling of workloads. Core Values Core values of Elastic BigData lie in the following aspects: l Massive Supports structured and

Parameter Description

Storage space The default volume size of a cluster is 40 GB. Users can purchasevolumes to enlarge storage capacity when creating a cluster. Scenariosof cluster are as follows:l Storing data in the OBS system, which is different from the system

for computing. The computing capabilities and cost of cluster arerelatively lower, and the cluster can be deleted at any time. Thisscenario is recommended when computing is not frequentlyperformed.

l Storing and computing data together in the HDFS system. Thecomputing capabilities and cost of cluster are relatively higher andthe cluster exists longer. This scenario is used when computing isfrequently performed.

The supported type of volume is SATA or SSD. The size of the volumeranges from 100 GB to 1000 GB, and the increment step is 10 GB. Forexample, the minimum volume size larger than 100 GB is 110 GB.

Component Configurations

ComponentName

Component name. Currently, the Hadoop (mandatory) and Sparkcomponents are supported.

Version Component version. Currently, Hadoop 2.7.0 and Spark 1.3.0 aresupported.

Description Component description.

Add JobYou can submit a job at the same time when you create a cluster. Only one job can beadded and its status is Running after a cluster is created.

Add Job Adding a job. For details, see topic "Adding a Job" in Quick Start.

Name Name of a job.

Type Type of a job.

Parameters Key parameter for executing an application.

Operation l Edit: modify the configuration of a job.l Delete: delete a job.

Step 4 Select according to actual service,The cluster will be automatically stopped when the jobis completed.

Step 5 Confirm cluster configuration information and click Apply Now.

Step 6 Click Submit.

After clusters are created, you can monitor and manage clusters.

----End

Elastic BigDataQuck Start 4 Quick Start

Issue 02 (2015-11-04) Huawei Proprietary and ConfidentialCopyright © Huawei Technologies Co., Ltd.

15

Page 19: Quck Start...2015/11/05  · l Optimization and scheduling of workloads. Core Values Core values of Elastic BigData lie in the following aspects: l Massive Supports structured and

4.2 Adding a JobA job is an executable program provided by Elastic BigData for processing and analyzinguser data. You can submit developed programs to Elastic BigData and execute them. Thefollowing describes how to add a new job.

Operation Procedure

Step 1 Log in to Elastic BigData and enter the Job Management page.

Step 2 Click Add Job and open the Add Job page.

Step 3 Table 4-2 describes job configuration information.

Table 4-2 Job configuration information

Parameter Description

Job Type Job type.l MapReducel Spark Jarl Spark ScriptNOTE

To add a Spark job, you need to select the Spark component when creating acluster.

Job name Job name, which consists of 1-64 characters containing English letters,digits and special characters (excluding ;|&>',<$). The job name can bestarted with an English letter, digit, or special character, but cannot bestarted with space, Chinese or consist of only space.NOTE

Identical job names are allowed but not recommended.

Path of jar fileto be executed

Address of the .jar package of the program for executing jobs, whichmust be started with / or s3a://.A maximum of 255 characters are allowed, but special characters suchas ;|&>',<$ are not allowed. The address cannot be empty or full ofspace.

Parameters ofjar file to beexecuted

Key parameter for executing jobs, which is assigned by an internalfunction. Elastic BigData is only responsible for inputting the parameter.A maximum of 255 characters are allowed, but special characters suchas ;|&>',<$ are not allowed. This parameter can be empty.

Input data path Address for inputting data, which must be started with / or s3a://. Acorrect OBS path is required.A maximum of 255 characters are allowed, but special characters suchas ;|&>',<$ are not allowed. This parameter can be empty.

Elastic BigDataQuck Start 4 Quick Start

Issue 02 (2015-11-04) Huawei Proprietary and ConfidentialCopyright © Huawei Technologies Co., Ltd.

16

Page 20: Quck Start...2015/11/05  · l Optimization and scheduling of workloads. Core Values Core values of Elastic BigData lie in the following aspects: l Massive Supports structured and

Parameter Description

Output datapath

Address for outputting data, which must be started with / or s3a://. Acorrect OBS path is required. If no such path is provided, an OBS pathwill be automatically created.A maximum of 255 characters are allowed, but special characters suchas ;|&>',<$ are not allowed. This parameter can be empty.

Log path Address for storing job logs that record job running status. This addressmust be started with / or s3a://. A correct OBS path is required.A maximum of 255 characters are allowed, but special characters suchas ;|&>',<$ are not allowed. This parameter can be empty.

Step 4 Confirm job configuration information and click OK.

After jobs are added, you can manage them.

----End

4.3 Managing JobsAdded jobs can be executed according to a job plan. All jobs are displayed in JobManagement. As an administrator, you can manage jobs in an Elastic BigData cluster.

Operation Procedurel View details of a job.

On the Job Management page, click View Details under Operation to view details of ajob.

l Stop a job.

On the Job Management page, click Stop under Operation to stop a running job. Aftera job is stopped, its status will be changed to Stopped. In this case, this job cannot beexecuted again.

l Clone a job.

On the Job Management page, click Clone under Operation to copy and add a job.

l Delete a job.

On the Job Management page, click Delete under Operation to delete a job.

NOTE

Deleted jobs cannot be restored. Therefore, exercise caution when deleting a job.

4.4 Creating and Deleting a TableSpark provides the Spark SQL language similar to SQL to process structured data. Thissection describes how to use Spark SQL to create, query and delete a table.

Elastic BigDataQuck Start 4 Quick Start

Issue 02 (2015-11-04) Huawei Proprietary and ConfidentialCopyright © Huawei Technologies Co., Ltd.

17

Page 21: Quck Start...2015/11/05  · l Optimization and scheduling of workloads. Core Values Core values of Elastic BigData lie in the following aspects: l Massive Supports structured and

BackgroundUsers can submit Spark SQL statements online and obtain results. For example, if a usercreates a table, inserts row data into the table, and saves it, the data of the table will be savedin the Hadoop or Spark cluster.

Operation Procedure

Step 1 Log in to an Elastic BigData cluster and the Job Managementpage is displayed.

Step 2 Select Spark SQL and the Spark SQL job page is displayed.

Step 3 Enter the Spark SQL statement for creating a table.

The syntax is:

CREATE [EXTERNAL] TABLE [IF NOT EXISTS] table_name [(col_name data_type[COMMENT col_comment], ...)] [COMMENT table_comment] [PARTITIONED BY(col_name data_type [COMMENT col_comment], ...)] [CLUSTERED BY (col_name,col_name, ...) [SORTED BY (col_name [ASC|DESC], ...)] INTO num_buckets BUCKETS][ROW FORMAT row_format] [STORED AS file_format] [LOCATION hdfs_path];

For example:

Method 1: Create a src_data table and write data in every row. The data is stored in /usr/guest/inputdirectory.

create external table src_data(line string) row format delimited fields terminated by '\\n'stored as textfile location '/user/guest/input/';

Method 2: Create a src_data table and load the data to the src_dada1 table.

create table src_data1 (eid int, name String, salary String, destination String) row formatdelimited fields terminated by ',' ;

load data inpath '/tttt/test.txt' into table src_data1;

NOTICEThe data from Object Storage Service can not load to the created tables in method 2.

Step 4 Enter the Spark SQL statement for querying a table.

The syntax is:

DESC table_name;

For example:

desc src_data;

Step 5 Enter the Spark SQL statement for deleting a table.

The syntax is:

DROP TABLE [IF EXISTS] table_name;

For example:

Elastic BigDataQuck Start 4 Quick Start

Issue 02 (2015-11-04) Huawei Proprietary and ConfidentialCopyright © Huawei Technologies Co., Ltd.

18

Page 22: Quck Start...2015/11/05  · l Optimization and scheduling of workloads. Core Values Core values of Elastic BigData lie in the following aspects: l Massive Supports structured and

drop table src_data;

Step 6 Click Check to check the correctness of the statement.

Step 7 Click Submit.

After the Spark SQL statement is submitted, you can check the result in Execution Result.

----End

Elastic BigDataQuck Start 4 Quick Start

Issue 02 (2015-11-04) Huawei Proprietary and ConfidentialCopyright © Huawei Technologies Co., Ltd.

19