User Guide · MapReduce Service

MapReduce Service

User Guide

Issue 01

Date 2018-09-06

Contents

1 Overview......................................................................................................................................... 11.1 Introduction.................................................................................................................................................................... 11.2 Application Scenarios.....................................................................................................................................................11.3 List of MRS Component Versions..................................................................................................................................21.4 Functions........................................................................................................................................................................ 21.4.1 Cluster Management Function.....................................................................................................................................31.4.2 Hadoop.........................................................................................................................................................................41.4.3 Spark............................................................................................................................................................................51.4.4 Spark SQL................................................................................................................................................................... 51.4.5 HBase...........................................................................................................................................................................61.4.6 Hive............................................................................................................................................................................. 61.4.7 Hue...............................................................................................................................................................................81.4.8 Kerberos Authentication..............................................................................................................................................81.4.9 Kafka........................................................................................................................................................................... 91.4.10 Storm......................................................................................................................................................................... 91.4.11 CarbonData.............................................................................................................................................................. 101.4.12 Flume....................................................................................................................................................................... 111.4.13 Loader...................................................................................................................................................................... 111.5 Relationships with Other Services................................................................................................................................121.6 Required Permission for Using MRS........................................................................................................................... 121.7 Limitations....................................................................................................................................................................13

2 MRS Quick Start..........................................................................................................................152.1 Introduction to the Operation Process.......................................................................................................................... 152.2 Quick Start....................................................................................................................................................................162.2.1 Creating a Cluster...................................................................................................................................................... 162.2.2 Managing Files.......................................................................................................................................................... 172.2.3 Creating a Job............................................................................................................................................................ 19

3 Security.......................................................................................................................................... 233.1 Security Configuration Suggestions for Clusters with Kerberos Authentication Disabled..........................................23

4 Cluster Operation Guide............................................................................................................244.1 Overview...................................................................................................................................................................... 244.2 Cluster List................................................................................................................................................................... 25

MapReduce ServiceUser Guide Contents

Issue 01 (2018-09-06) ii

4.3 Creating a Cluster......................................................................................................................................................... 294.4 Creating the Smallest Cluster....................................................................................................................................... 414.5 Creating a Cluster (History Versions)...........................................................................................................................434.6 Managing Active Clusters............................................................................................................................................ 774.6.1 Viewing Basic Information About an Active Cluster................................................................................................774.6.2 Viewing Patch Information About an Active Cluster................................................................................................814.6.3 Accessing the Cluster Management Page..................................................................................................................824.6.4 Expanding a Cluster...................................................................................................................................................824.6.5 Shrinking a Cluster.................................................................................................................................................... 854.6.6 Performing Auto Scaling for a Cluster...................................................................................................................... 874.6.7 Terminating a Cluster................................................................................................................................................ 914.6.8 Deleting a Failed Task............................................................................................................................................... 924.6.9 Managing Jobs in an Active Cluster..........................................................................................................................924.6.10 Managing Data Files................................................................................................................................................924.6.11 Viewing the Alarm List............................................................................................................................................954.6.12 Configuring Message Notification.......................................................................................................................... 964.6.13 O&M Authorization................................................................................................................................................ 984.6.14 Sharing Logs............................................................................................................................................................984.7 Managing Historical Clusters....................................................................................................................................... 994.7.1 Viewing Basic Information About a Historical Cluster.............................................................................................994.7.2 Viewing Job Configurations in a Historical Cluster................................................................................................ 1034.8 Managing Jobs............................................................................................................................................................1044.8.1 Introduction to Jobs................................................................................................................................................. 1044.8.2 Adding a Jar or Script Job....................................................................................................................................... 1064.8.3 Submitting a Spark SQL Statement.........................................................................................................................1094.8.4 Viewing Job Configurations and Logs.....................................................................................................................1114.8.5 Stopping Jobs........................................................................................................................................................... 1114.8.6 Replicating Jobs.......................................................................................................................................................1124.8.7 Deleting Jobs............................................................................................................................................................1144.9 Querying Operation Logs........................................................................................................................................... 1154.10 Managing Cluster Tags............................................................................................................................................. 1164.11 Bootstrap Actions......................................................................................................................................................1184.11.1 Introduction to Bootstrap Actions..........................................................................................................................1184.11.2 Preparing the Bootstrap Action Script................................................................................................................... 1194.11.3 Adding a Bootstrap Action.................................................................................................................................... 1194.11.4 View Execution Records........................................................................................................................................1214.11.5 Sample Scripts....................................................................................................................................................... 122

5 Remote Operation Guide.........................................................................................................1255.1 Overview.................................................................................................................................................................... 1255.2 Logging In to a Master Node......................................................................................................................................1265.2.1 Logging In to an ECS Using VNC.......................................................................................................................... 1265.2.2 Logging In to a Linux ECS Using a Key Pair (SSH).............................................................................................. 128


Issue 01 (2018-09-06) iii

5.2.3 Logging In to a Linux ECS Using a Password (SSH).............................................................................................1285.3 Viewing Active and Standby Nodes........................................................................................................................... 1285.4 Client Management.....................................................................................................................................................1295.4.1 Updating the Client..................................................................................................................................................1295.4.2 Using the Client on a Cluster Node......................................................................................................................... 1325.4.3 Using the Client on Another Node of a VPC.......................................................................................................... 134

6 MRS Manager Operation Guide............................................................................................ 1376.1 MRS Manager Introduction........................................................................................................................................1376.2 Accessing MRS Manager........................................................................................................................................... 1406.3 Accessing MRS Manager Supporting Kerberos Authentication................................................................................1416.4 Viewing Running Tasks in a Cluster.......................................................................................................................... 1426.5 Monitoring Management............................................................................................................................................ 1436.5.1 Viewing the System Overview................................................................................................................................ 1436.5.2 Configuring a Monitoring History Report...............................................................................................................1446.5.3 Managing Service and Host Monitoring................................................................................................................. 1456.5.4 Managing Resource Distribution.............................................................................................................................1496.5.5 Configuring Monitoring Metric Dumping...............................................................................................................1506.6 Alarm Management.................................................................................................................................................... 1526.6.1 Viewing and Manually Clearing an Alarm..............................................................................................................1526.6.2 Configuring an Alarm Threshold............................................................................................................................ 1536.6.3 Configuring Syslog Northbound Interface.............................................................................................................. 1546.6.4 Configuring SNMP Northbound Interface.............................................................................................................. 1576.7 Alarm Reference.........................................................................................................................................................1596.7.1 ALM-12001 Audit Log Dump Failure.................................................................................................................... 1596.7.2 ALM-12002 HA Resource Is Abnormal................................................................................................................. 1616.7.3 ALM-12004 OLdap Resource Is Abnormal............................................................................................................1636.7.4 ALM-12005 OKerberos Resource Is Abnormal..................................................................................................... 1646.7.5 ALM-12006 Node Fault.......................................................................................................................................... 1666.7.6 ALM-12007 Process Fault.......................................................................................................................................1676.7.7 ALM-12010 Manager Heartbeat Interruption Between the Active and Standby Nodes.........................................1696.7.8 ALM-12011 Manager Data Synchronization Exception Between the Active and Standby Nodes........................ 1706.7.9 ALM-12012 NTP Service Is Abnormal.................................................................................................................. 1726.7.10 ALM-12016 CPU Usage Exceeds the Threshold..................................................................................................1746.7.11 ALM-12017 Insufficient Disk Capacity................................................................................................................ 1766.7.12 ALM-12018 Memory Usage Exceeds the Threshold............................................................................................1786.7.13 ALM-12027 Host PID Usage Exceeds the Threshold...........................................................................................1796.7.14 ALM-12028 Number of Processes in the D State on the Host Exceeds the Threshold........................................ 1816.7.15 ALM-12031 User omm or Password Is About to Expire......................................................................................1836.7.16 ALM-12032 User ommdba or Password Is About to Expire................................................................................ 1846.7.17 ALM-12033 Slow Disk Fault................................................................................................................................ 1866.7.18 ALM-12034 Periodic Backup Failure................................................................................................................... 1876.7.19 ALM-12035 Unknown Data Status After Recovery Task Failure........................................................................ 188


Issue 01 (2018-09-06) iv

6.7.20 ALM-12037 NTP Server Is Abnormal..................................................................................................................1896.7.21 ALM-12038 Monitoring Indicator Dump Failure................................................................................................. 1916.7.22 ALM-12039 GaussDB Data Is Not Synchronized................................................................................................ 1936.7.23 ALM-12040 Insufficient System Entropy............................................................................................................. 1956.7.24 ALM-13000 ZooKeeper Service Unavailable.......................................................................................................1976.7.25 ALM-13001 Available ZooKeeper Connections Are Insufficient........................................................................ 1996.7.26 ALM-13002 ZooKeeper Heap Memory or Direct Memory Usage Exceeds the Threshold................................. 2026.7.27 ALM-14000 HDFS Service Unavailable.............................................................................................................. 2036.7.28 ALM-14001 HDFS Disk Usage Exceeds the Threshold.......................................................................................2056.7.29 ALM-14002 DataNode Disk Usage Exceeds the Threshold.................................................................................2076.7.30 ALM-14003 Number of Lost HDFS Blocks Exceeds the Threshold....................................................................2086.7.31 ALM-14004 Number of Damaged HDFS Blocks Exceeds the Threshold............................................................2106.7.32 ALM-14006 Number of HDFS Files Exceeds the Threshold............................................................................... 2116.7.33 ALM-14007 HDFS NameNode Memory Usage Exceeds the Threshold............................................................. 2126.7.34 ALM-14008 HDFS DataNode Memory Usage Exceeds the Threshold............................................................... 2146.7.35 ALM-14009 Number of Dead DataNodes Exceeds the Threshold....................................................................... 2156.7.36 ALM-14010 NameService Service Is Abnormal.................................................................................................. 2176.7.37 ALM-14011 HDFS DataNode Data Directory Is Not Configured Properly......................................................... 2206.7.38 ALM-14012 HDFS JournalNode Data Is Not Synchronized................................................................................2236.7.39 ALM-16000 Percentage of Sessions Connected to the HiveServer to Maximum Number Allowed Exceeds theThreshold.......................................................................................................................................................................... 2246.7.40 ALM-16001 Hive Warehouse Space Usage Exceeds the Threshold.....................................................................2266.7.41 ALM-16002 Successful Hive SQL Operations Are Lower than the Threshold....................................................2286.7.42 ALM-16004 Hive Service Unavailable.................................................................................................................2306.7.43 ALM-18000 Yarn Service Unavailable................................................................................................................. 2336.7.44 ALM-18002 NodeManager Heartbeat Lost...........................................................................................................2356.7.45 ALM-18003 NodeManager Unhealthy................................................................................................................. 2366.7.46 ALM-18006 MapReduce Job Execution Timeout.................................................................................................2376.7.47 ALM-19000 HBase Service Unavailable.............................................................................................................. 2396.7.48 ALM-19006 HBase Replication Synchronization Failed......................................................................................2406.7.49 ALM-25000 LdapServer Service Unavailable...................................................................................................... 2436.7.50 ALM-25004 Abnormal LdapServer Data Synchronization.................................................................................. 2456.7.51 ALM-25500 KrbServer Service Unavailable........................................................................................................ 2476.7.52 ALM-27001 DBService Unavailable.................................................................................................................... 2496.7.53 ALM-27003 DBService Heartbeat Interruption Between the Active and Standby Nodes................................... 2516.7.54 ALM-27004 Data Inconsistency Between Active and Standby DBServices........................................................2526.7.55 ALM-28001 Spark Service Unavailable............................................................................................................... 2556.7.56 ALM-26051 Storm Service Unavailable...............................................................................................................2566.7.57 ALM-26052 Number of Available Supervisors in Storm Is Lower Than the Threshold......................................2586.7.58 ALM-26053 Slot Usage of Storm Exceeds the Threshold.................................................................................... 2606.7.59 ALM-26054 Heap Memory Usage of Storm Nimbus Exceeds the Threshold......................................................2616.7.60 ALM-38000 Kafka Service Unavailable...............................................................................................................2636.7.61 ALM-38001 Insufficient Kafka Disk Space..........................................................................................................265


Issue 01 (2018-09-06) v

6.7.62 ALM-38002 Heap Memory Usage of Kafka Exceeds the Threshold................................................................... 2676.7.63 ALM-24000 Flume Service Unavailable.............................................................................................................. 2696.7.64 ALM-24001 Flume Agent Is Abnormal................................................................................................................2706.7.65 ALM-24003 Flume Client Connection Failure..................................................................................................... 2726.7.66 ALM-24004 Flume Fails to Read Data................................................................................................................. 2746.7.67 ALM-24005 Data Transmission by Flume Is Abnormal.......................................................................................2766.7.68 ALM-12041 Permission of Key Files Is Abnormal.............................................................................................. 2786.7.69 ALM-12042 Key File Configurations Are Abnormal...........................................................................................2796.7.70 ALM-23001 Loader Service Unavailable............................................................................................................. 2816.7.71 ALM-12357 Failed to Export Audit Logs to the OBS.......................................................................................... 2846.7.72 ALM-12014 Partition Lost.................................................................................................................................... 2866.7.73 ALM-12015 Partition Filesystem Readonly..........................................................................................................2876.7.74 ALM-12043 DNS Resolution Duration Exceeds the Threshold........................................................................... 2886.7.75 ALM-12045 Network Read Packet Dropped Rate Exceeds the Threshold.......................................................... 2916.7.76 ALM-12046 Network Write Packet Dropped Rate Exceeds the Threshold..........................................................2966.7.77 ALM-12047 Network Read Packet Error Rate Exceeds the Threshold................................................................ 2976.7.78 ALM-12048 Network Write Packet Error Rate Exceeds the Threshold............................................................... 2996.7.79 ALM-12049 Network Read Throughput Rate Exceeds the Threshold................................................................. 3016.7.80 ALM-12050 Network Write Throughput Rate Exceeds the Threshold................................................................ 3036.7.81 ALM-12051 Disk Inode Usage Exceeds the Threshold........................................................................................3056.7.82 ALM-12052 TCP Temporary Port Usage Exceeds the Threshold........................................................................ 3076.7.83 ALM-12053 File Handle Usage Exceeds the Threshold.......................................................................................3096.7.84 ALM-12054 The Certificate File Is Invalid...........................................................................................................3116.7.85 ALM-12055 The Certificate File Is About to Expire............................................................................................ 3136.7.86 ALM-18008 Heap Memory Usage of Yarn ResourceManager Exceeds the Threshold....................................... 3166.7.87 ALM-18009 Heap Memory Usage of MapReduce JobHistoryServer Exceeds the Threshold.............................3186.7.88 ALM-20002 Hue Service Unavailable.................................................................................................................. 3196.7.89 ALM-43001 Spark Service Unavailable............................................................................................................... 3226.7.90 ALM-43006 Heap Memory Usage of the JobHistory Process Exceeds the Threshold........................................ 3236.7.91 ALM-43007 Non-Heap Memory Usage of the JobHistory Process Exceeds the Threshold................................ 3256.7.92 ALM-43008 Direct Memory Usage of the JobHistory Process Exceeds the Threshold.......................................3266.7.93 ALM-43009 JobHistory GC Time Exceeds the Threshold................................................................................... 3286.7.94 ALM-43010 Heap Memory Usage of the JDBCServer Process Exceeds the Threshold......................................3296.7.95 ALM-43011 Non-Heap Memory Usage of the JDBCServer Process Exceeds the Threshold..............................3316.7.96 ALM-43012 Direct Memory Usage of the JDBCServer Process Exceeds the Threshold.................................... 3326.7.97 ALM-43013 JDBCServer GC Time Exceeds the Threshold.................................................................................3346.8 Object Management....................................................................................................................................................3356.8.1 Introduction............................................................................................................................................................. 3356.8.2 Querying Configurations......................................................................................................................................... 3366.8.3 Managing Services.................................................................................................................................................. 3376.8.4 Configuring Service Parameters..............................................................................................................................3376.8.5 Configuring Customized Service Parameters..........................................................................................................339


Issue 01 (2018-09-06) vi

6.8.6 Synchronizing Service Configurations....................................................................................................................3406.8.7 Managing Role Instances.........................................................................................................................................3416.8.8 Configuring Role Instance Parameters.................................................................................................................... 3416.8.9 Synchronizing Role Instance Configuration............................................................................................................3426.8.10 Decommissioning and Recommissioning Role Instances..................................................................................... 3436.8.11 Managing a Host....................................................................................................................................................3446.8.12 Isolating a Host......................................................................................................................................................3446.8.13 Canceling Isolation of a Host................................................................................................................................ 3456.8.14 Starting and Stopping a Cluster............................................................................................................................. 3456.8.15 Synchronizing Cluster Configurations.................................................................................................................. 3466.8.16 Exporting Configuration Data of a Cluster............................................................................................................3466.9 Log Management........................................................................................................................................................3476.9.1 Viewing and Exporting Audit Logs.........................................................................................................................3476.9.2 Exporting Services Logs..........................................................................................................................................3486.9.3 Configuring Audit Log Export Parameters............................................................................................................. 3496.10 Health Check Management...................................................................................................................................... 3506.10.1 Performing a Health Check................................................................................................................................... 3516.10.2 Viewing and Exporting a Check Report................................................................................................................ 3526.10.3 Configuring the Number of Health Check Reports to Be Reserved......................................................................3526.10.4 Managing Health Check Reports...........................................................................................................................3536.10.5 DBService Health Check.......................................................................................................................................3536.10.6 Flume Health Check.............................................................................................................................................. 3546.10.7 HBase Health Check..............................................................................................................................................3546.10.8 Host Health Check.................................................................................................................................................3556.10.9 HDFS Health Check.............................................................................................................................................. 3626.10.10 Hive Health Check...............................................................................................................................................3626.10.11 Kafka Health Check.............................................................................................................................................3636.10.12 KrbServer Health Check......................................................................................................................................3646.10.13 LdapServer Health Check....................................................................................................................................3656.10.14 Loader Health Check........................................................................................................................................... 3666.10.15 MapReduce Health Check................................................................................................................................... 3676.10.16 OMS Health Check..............................................................................................................................................3676.10.17 Spark Health Check............................................................................................................................................. 3726.10.18 Storm Health Check.............................................................................................................................................3726.10.19 Yarn Health Check...............................................................................................................................................3736.10.20 ZooKeeper Health Check.................................................................................................................................... 3736.11 Static Service Pool Management.............................................................................................................................. 3746.11.1 Viewing the Status of a Static Service Pool...........................................................................................................3746.11.2 Configuring a Static Service Pool..........................................................................................................................3756.12 Tenant Management..................................................................................................................................................3786.12.1 Introduction........................................................................................................................................................... 3786.12.2 Creating a Tenant...................................................................................................................................................379


Issue 01 (2018-09-06) vii

6.12.3 Creating a Sub-tenant............................................................................................................................................ 3826.12.4 Deleting a Tenant...................................................................................................................................................3846.12.5 Managing a Tenant Directory................................................................................................................................ 3856.12.6 Recovering Tenant Data........................................................................................................................................ 3866.12.7 Creating a Resource Pool...................................................................................................................................... 3876.12.8 Modifying a Resource Pool................................................................................................................................... 3886.12.9 Deleting a Resource Pool...................................................................................................................................... 3886.12.10 Configuring a Queue........................................................................................................................................... 3896.12.11 Configuring the Queue Capacity Policy of a Resource Pool...............................................................................3906.12.12 Clearing the Configuration of a Queue................................................................................................................3916.13 Backup and Restoration............................................................................................................................................3916.13.1 Introduction........................................................................................................................................................... 3916.13.2 Backing Up Metadata............................................................................................................................................ 3946.13.3 Recovering Metadata.............................................................................................................................................3966.13.4 Modifying a Backup Task......................................................................................................................................3996.13.5 Viewing Backup and Recovery Tasks................................................................................................................... 4006.14 Security Management............................................................................................................................................... 4016.14.1 List of Default Users..............................................................................................................................................4016.14.2 Changing the Password for an OS User................................................................................................................ 4056.14.3 Changing the Password for User admin................................................................................................................ 4066.14.4 Changing the Password for the Kerberos Administrator.......................................................................................4076.14.5 Changing the Password for the OMS Kerberos Administrator............................................................................. 4086.14.6 Changing the Password for the LDAP Administrator and the LDAP User (including OMS LDAP).................. 4096.14.7 Changing the Password for a Component Running User...................................................................................... 4106.14.8 Changing the Password for the OMS Database Administrator............................................................................. 4116.14.9 Changing the Password for the Data Access User of the OMS Database............................................................. 4126.14.10 Changing the Password for a Component Database User................................................................................... 4136.14.11 Replacing HA Certificates................................................................................................................................... 4146.14.12 Updating the Key of a Cluster............................................................................................................................. 4156.15 Patch Operation Guide..............................................................................................................................................4166.15.1 Patch Operation Guide for Versions Earlier than MRS 1.7.0................................................................................4166.15.2 Patch Operation Guide for MRS 1.7.0 or Later.....................................................................................................4186.16 Restoring Patches for the Isolated Hosts.................................................................................................................. 419

7 Management of Clusters with Kerberos Authentication Enabled.................................. 4207.1 Users and Permissions of Clusters with Kerberos Authentication Enabled...............................................................4207.2 Default Users of Clusters with Kerberos Authentication Enabled............................................................................. 4247.3 Creating a Role........................................................................................................................................................... 4347.4 Creating a User Group................................................................................................................................................4407.5 Creating a User........................................................................................................................................................... 4417.6 Modifying User Information...................................................................................................................................... 4427.7 Locking a User............................................................................................................................................................4437.8 Unlocking a User........................................................................................................................................................ 443


Issue 01 (2018-09-06) viii

7.9 Deleting a User.......................................................................................................................................................... 4447.10 Changing the Password of an Operation User..........................................................................................................4447.11 Initializing the Password of a System User.............................................................................................................. 4457.12 Downloading a User Authentication File................................................................................................................. 4477.13 Modifying a Password Policy...................................................................................................................................4487.14 Configuring Cross-Cluster Mutual Trust Relationships........................................................................................... 4497.15 Configuring Users to Access Resources of a Trusted Cluster.................................................................................. 453

8 Using MRS..................................................................................................................................4558.1 Accessing the UI of the Open Source Component..................................................................................................... 4558.1.1 List of Open Source Component Ports.................................................................................................................... 4558.1.2 Overview................................................................................................................................................................. 4708.1.3 Creating an SSH Channel to Connect an MRS Cluster and Configuring the Browser........................................... 4738.2 Using Hadoop from Scratch....................................................................................................................................... 4768.3 Using Spark from Scratch...........................................................................................................................................4808.4 Using Spark SQL from Scratch.................................................................................................................................. 4848.5 Using HBase from Scratch......................................................................................................................................... 4868.6 Using HBase............................................................................................................................................................... 4898.6.1 Configuring the HBase Replication Function......................................................................................................... 4898.6.2 Enabling the Cross-Cluster Copy Function............................................................................................................. 4988.6.3 Using the ReplicationSyncUp Tool......................................................................................................................... 4998.6.4 Using HIndex...........................................................................................................................................................5018.6.4.1 Introduction to HIndex......................................................................................................................................... 5018.6.4.2 Loading Index Data in Batches............................................................................................................................ 5078.6.4.3 Using an Index Generation Tool...........................................................................................................................5108.6.4.4 Migrating Index Data............................................................................................................................................5138.7 Using Hue................................................................................................................................................................... 5158.7.1 Accessing the Hue WebUI.......................................................................................................................................5158.7.2 Using HiveQL Editor on the Hue WebUI................................................................................................................5168.7.3 Using the Metadata Browser on the Hue WebUI.................................................................................................... 5188.7.4 Using File Browser on the Hue WebUI...................................................................................................................5218.7.5 Using Job Browser on the Hue WebUI....................................................................................................................5248.8 Using Kafka................................................................................................................................................................5258.8.1 Managing Kafka Topics...........................................................................................................................................5258.8.2 Querying Kafka Topics............................................................................................................................................5268.8.3 Managing Kafka User Permission...........................................................................................................................5268.8.4 Managing Messages in Kafka Topics...................................................................................................................... 5288.9 Using Storm................................................................................................................................................................5298.9.1 Submitting Storm Topologies on the Client............................................................................................................ 5298.9.2 Accessing the Storm WebUI....................................................................................................................................5318.9.3 Managing Storm Topologies....................................................................................................................................5318.9.4 Querying Storm Topology Logs.............................................................................................................................. 5328.10 Using CarbonData.................................................................................................................................................... 533


Issue 01 (2018-09-06) ix

8.10.1 Getting Started with CarbonData.......................................................................................................................... 5338.10.2 About CarbonData Table....................................................................................................................................... 5358.10.3 Creating a CarbonData Table.................................................................................................................................5368.10.4 Deleting a CarbonData Table.................................................................................................................................5388.11 Using Flume..............................................................................................................................................................5388.11.1 Introduction............................................................................................................................................................5398.11.2 Installing the Flume Client.................................................................................................................................... 5408.11.3 Viewing Flume Client Logs...................................................................................................................................5428.11.4 Stopping or Uninstalling the Flume Client............................................................................................................ 5438.11.5 Using the Encryption Tool of the Flume Client.....................................................................................................5448.11.6 Flume Configuration Parameter Description.........................................................................................................5458.11.7 Example: Using Flume to Collect Logs and Import Them to Kafka.....................................................................5618.11.8 Example: Using Flume to Collect Logs and Import Them to OBS.......................................................................5648.11.9 Example: Using Flume to Read OBS Files and Upload Them to HDFS.............................................................. 5668.12 Using Loader............................................................................................................................................................ 5688.12.1 Introduction........................................................................................................................................................... 5688.12.2 Loader Link Configuration....................................................................................................................................5698.12.3 Managing Loader Links.........................................................................................................................................5728.12.4 Source Link Configurations of Loader Jobs..........................................................................................................5738.12.5 Destination Link Configurations of Loader Jobs.................................................................................................. 5778.12.6 Managing Loader Jobs...........................................................................................................................................5808.12.7 Preparing a Driver for MySQL Database Link......................................................................................................5838.12.8 Example: Using Loader to Import Data from OBS to HDFS................................................................................583

9 MRS Patch Description............................................................................................................ 5869.1 MRS 1.5.1.4 Patch Description.................................................................................................................................. 5869.2 MRS 1.7.1.2 Patch Description.................................................................................................................................. 589

A ECS Specifications Used by MRS......................................................................................... 591

B Change History..........................................................................................................................594


Issue 01 (2018-09-06) x

1 Overview

1.1 IntroductionMapReduce Service (MRS) is a data processing and analysis service that is based on a cloudcomputing platform. It is stable, reliable, scalable, and easy to manage. You can use MRSimmediately after applying for it.

MRS builds a reliable, secure, and easy-to-use operation and maintenance (O&M) platform.MRS is capable of processing and analyzing a large volume of data to meet your requirementson data storage and processing. You can independently apply for, use, and host componentsHadoop, Spark, HBase, and Hive to quickly create clusters on hosts for storing and computinga large volume of data that is not demanding on real-time in batches. After data storage andcomputing tasks are completed, the clusters can be terminated, and no fee will be chargedaccordingly.

1.2 Application ScenariosMRS can be applied in various industries in the processing, analysis, and storage of massivedata.

l Analysis and processing of mass dataUsage: analysis and processing of massive sets of data, online and offline analysis, andbusiness intelligenceCharacteristics: processing of massive data sets, heavy computing workloads, long-termanalysis, and data analysis and processing on a large number of computersApplication scenarios: log analysis, online and offline analysis, simulation calculationsin scientific research, biometric analysis, and spatial-temporal data analysis

l Storage of mass dataUsage: storage and retrieval of massive sets of data and data warehouseCharacteristics: storage, retrieval, backup, and disaster recovery of massive sets of datawith zero data lossApplication scenarios: log storage, file storage, simulation data storage in scientificresearch, biological characteristic information storage, genetic engineering data storage,and spatial-temporal data storage

MapReduce ServiceUser Guide 1 Overview

Issue 01 (2018-09-06) 1

l Streaming processing of mass data

Usage: real-time analysis of mass data, continuous computing, offline and onlinemessage consumption

Characteristics: massive amount of data, high throughput, high reliability, flexiblescalability, and distributed real-time computing framework

Application scenarios: streaming data collection, active tracking on websites, datamonitoring, distributed ETL, and risk control

1.3 List of MRS Component Versions

Table 1-1 lists component versions for each MRS cluster version.

Table 1-1 MRS component versions

SupportedComponent

MRS 1.5.1 MRS 1.6.3 MRS 1.7.1 MRS 1.7.2

Hadoop 2.7.2 2.7.2 2.8.3 2.8.3

Spark 2.1.0 2.1.0 2.2.1 2.2.1

HBase 1.0.2 1.3.1 1.3.1 1.3.1

Hive 1.2.1 1.2.1 1.2.1 1.2.1

Hue 3.11.0 3.11.0 3.11.0 3.11.0

Loader 2.0.0 2.0.0 2.0.0 2.0.0

Kafka 0.10.0.0 0.10.0.0 0.10.2.0 0.10.2.0

Storm 1.0.2 1.0.2 1.0.2 1.0.2

Flume 1.6.0 1.6.0 1.6.0 1.6.0

1.4 FunctionsMRS, capable of processing and storing massive sets of data, supports the following features:

l Enhanced open-source Hadoop software

l Spark in-memory computing engine

l HBase distributed storage database

l Hive data warehouse

l Hue web framework (After Kerberos Authentication is set to Enable, the Huecomponent can be selected.)

It also supports cluster management. To meet service requirements, you should specify thenode quantity and data disk space when applying for MRS. Then you need only focus on dataanalysis.


Issue 01 (2018-09-06) 2

1.4.1 Cluster Management FunctionThis section describes the Web interface functions of MRS clusters.

MRS provides a Web interface, the functions of which are described as follows:

l Creating a cluster:Users can create a cluster on MRS. Currently, On-demand and Yearly/Monthly modesare supported. In On-demand mode, nodes are charged by actual duration of use, with abilling cycle of one hour. In Yearly/Monthly mode, you can pay for nodes by year ormonth. The minimum cluster duration is 1 month and the maximum available clusterduration is 1 year. When fees are being deducted, if a user account has insufficient funds,a message will be sent notifying users to pay a renewal fee. Cluster resources are frozenuntil the renewal fee has been paid. If no renewal fee is paid, cluster resources aredeleted once the freeze period expires. The application scenarios of a cluster are asfollows:– Data storage and computing are performed separately. Data is stored in the Object

Storage Service (OBS), which features a low-cost and unlimited storage capacityand clusters can be terminated at any time. The computing performance isdetermined by the OBS access performance and is lower than that of HadoopDistributed File System (HDFS). OBS is recommended when data computing isinfrequent.

– Data storage and computing are performed together. Data is stored in HDFS, whichfeatures high cost, high computing performance, and limited storage capacity.Before terminating clusters, you must export and store the data. HDFS isrecommended when data computing is frequent.

l Adding Task nodes:The Task node processes data rather than storing cluster data. When the number ofclusters does not change much but the clusters' service processing capabilities need to beremarkably and temporarily improved, add Task nodes to address the followingsituations:– The volume of temporary services is increased, for example, report processing at

the end of the year.– Long-term tasks must be completed in a short time, for example, some urgent

analysis tasks.l Expanding clusters:

To expand clusters and handle peak service loads, add core nodes or Task nodes.l Shrinking clusters:

Reduce the number of Core nodes or Task nodes to shrink the cluster so that MRSdelivers better storage and computing capabilities at lower O&M costs based on servicerequirements.

l Auto Scaling:Automatically adjust computing resources based on service requirements and the presetpolicies, so that the number of Task nodes can be automatically scaled out and scaled-inwith service load changes, ensuring stable service running.

l Managing clusters:After completing data processing and analysis, you can manage and terminate clusters.– Querying alarms:


Issue 01 (2018-09-06) 3

If either the system or a cluster is faulty, Elastic BigData will collect faultinformation and report it to the network management system. Maintenancepersonnel will then be able to locate the faults.

– Querying logs:To help locate faults in the case of faulty clusters, operation information is recorded.

– Managing files:MRS supports the ability to import data from the OBS system to HDFS and alsoexport data that has already been processed and analyzed. You can store data inHDFS.

l Adding a job:A job is an executable program provided by MRS to process and analyze user data.Currently, MRS supports MapReduce jobs, Spark jobs, and Hive jobs, and allows usersto submit Spark SQL statements online to query and analyze data.

l Adding tags:Tags are cluster identifiers. Adding tags to clusters can help you identify and manageyour cluster resources.You can add a maximum of 10 tags to a cluster when creating the cluster or add them onthe details page of the created cluster.

l Adding bootstrap actions: Bootstrap actions indicate that you can run your scripts on aspecified cluster node before or after starting big data components. You can runbootstrap actions to install third-party software, modify the cluster running environment,and perform other customizations. If you choose to run bootstrap actions whenexpanding a cluster, the bootstrap actions will be run on the newly added nodes in thesame way.

l Managing jobs:Jobs can be managed, stopped, or deleted. You can also view details of completed jobsalong with detailed configurations. Spark SQL jobs, however, cannot be stopped.

l Providing management interfaces:MRS Manager functions as a unified management platform for MRS clusters.– Cluster monitoring enables you to quickly see the health status of hosts and

services.– Graphical indicator monitoring and customization enable you to quickly obtain key

information about the system.– Service property configurations can meet service performance requirements.– Cluster, service, and role instance operations enable you to start or stop services and

clusters in one-click mode.

1.4.2 HadoopMRS deploys and hosts Apache Hadoop clusters in the cloud to provide services featuringhigh availability and enhanced reliability for big data processing and analysis.

MRS uses the FusionInsight HD commercial release. Hadoop is a distributed systemarchitecture that consists of HDFS, MapReduce, and Yarn. The following describes thefunctions of each component:l HDFS:

HDFS provides high-throughput data access and is applicable to the processing of largedata sets. MRS cluster data is stored in HDFS.


Issue 01 (2018-09-06) 4

l MapReduce:As a programming model that simplifies parallel computing, MapReduce gets its namefrom two key operations: Map and Reduce. Map divides one task into multiple tasks, andReduce summarizes their processing results and produces the final analysis result. MRSclusters allow users to submit self-developed MapReduce programs, execute theprograms, and obtain the results.

l Yarn:As the resource management system of Hadoop, Yarn manages and schedules resourcesfor applications. MRS uses Yarn to schedule and manage cluster resources.

For details about Hadoop architecture and principles, see http://hadoop.apache.org/docs/stable/index.html.

1.4.3 SparkSpark is a distributed and parallel data processing framework. MRS deploys and hosts ApacheSpark clusters in the cloud.

MRS uses the FusionInsight Spark commercial release, which has been officially certified byDataBrick.

Fault-tolerant Spark is a distributed computing framework based on memory, which ensuresthat data can be quickly restored and recalculated. It is more efficient than MapReduce interms of iterative data computing.

In the Hadoop ecosystem, Spark and Hadoop are seamlessly interconnected. By using HDFSfor data storage and Yarn for resource management and scheduling, users can switch fromMapReduce to Spark quickly.

Spark applies to the following scenarios:l Data processing and ETL (extract, transform, and load)l Machine learningl Interactive analysisl Iterative computing and data reuse. Users benefit more from Spark when they perform

operations frequently and the volume of the required data is large.l On-demand capacity expansion. This is due to Spark's ease-of-use and low cost in the

cloud.

For details about Spark architecture and principles, see http://spark.apache.org/docs/2.1.0/quick-start.html.

1.4.4 Spark SQLSpark SQL is an important component of Apache Spark and subsumes Shark. It helpsengineers unfamiliar with MapReduce to get started quickly. Users can enter SQL statementsdirectly to analyze, process, and query data.

Spark SQL has the following highlights:l Is compatible with most Hive syntax, which enables seamless switchovers.l Is compatible with standard SQL syntax.l Resolves data skew problems.

Spark SQL can join and convert skew data. It evenly distributes data that does notcontain skewed keys to different tasks for processing. For data that contains skewed


Issue 01 (2018-09-06) 5

http://hadoop.apache.org/docs/stable/index.html

http://hadoop.apache.org/docs/stable/index.html

http://spark.apache.org/docs/2.1.0/quick-start.html

http://spark.apache.org/docs/2.1.0/quick-start.html

keys, Spark SQL broadcasts the smaller amount of data and uses the Map-Side Join toevenly distribute the data to different tasks for processing. This fully utilizes CPUresources and improves performance.

l Optimizes small files.Spark SQL employs the coalesce operator to process small files and combines partitionsgenerated by small files in tables. This reduces the number of hash buckets during ashuffle operation and improves performance.

For details about Spark SQL architecture and principles, see http://spark.apache.org/docs/2.1.0/programming-guide.html.

1.4.5 HBaseHBase is a column-oriented distributed cloud storage system. It features enhanced reliability,excellent performance, and elastic scalability.

It is applicable in distributed computing and the storage of massive data. With HBase, userscan filter and analyze data with ease and get responses in milliseconds, thereby rapidly miningdata.

HBase applies to the following scenarios:l Massive data storage

Users can use HBase to build a storage system capable of storing TB or PB of data. Italso provides dynamic scaling capabilities so that users can adjust cluster resources tomeet specific performance or capacity requirements.

l Real-time queryThe columnar and key-value storage models apply to the ad-hoc querying of enterpriseuser details. The low-latency point query, based on the master key, reduces the responselatency to seconds or milliseconds, facilitating real-time data analysis.

HBase has the following highlights:

l Provides automatic Region recovery from an exception, ensuring reliability of dataaccess.

l Enables data imported to the active cluster using BulkLoad to be automaticallysynchronized to the disaster recovery backup cluster. HBase also enhances theReplication feature, for example, supporting table structure synchronization, datasynchronization between tables with system permissions, and the cluster readonlyfunction.

l Improves performance of the BulkLoad feature, accelerating data import.

For details about HBase architecture and principles, see http://hbase.apache.org/book.html.

1.4.6 HiveHive is a data warehouse framework built on Hadoop. It stores structured data using the Hivequery language (HiveQL), a language similar to SQL.

Hive converts HiveQL statements to MapReduce or HDFS tasks to query and analyzemassive data stored in Hadoop clusters. The console provides the interface to enter HiveScript and supports the online submission of HiveQL statements.

Hive supports the HDFS Colocation, column encryption, HBase deletion, row delimiter, andCSV SerDe functions, as detailed below.


Issue 01 (2018-09-06) 6

http://spark.apache.org/docs/2.1.0/programming-guide.html

http://spark.apache.org/docs/2.1.0/programming-guide.html

http://hbase.apache.org/book.html

HDFS Colocation

HDFS Colocation is the data location control function provided by HDFS. The HDFSColocation interface stores associated data or data on which associated operations areperformed on the same storage node.

Hive supports the HDFS Colocation function. When Hive tables are created, after the locatorinformation is set for table files, the data files of related tables are stored on the same storagenode. This ensures convenient and efficient data computing among associated tables.

Column Encryption

Hive supports encryption of one or more columns. The columns to be encrypted and theencryption algorithm can be specified when a Hive table is created. When data is inserted intothe table using the insert statement, the related columns are encrypted.

The Hive column encryption mechanism supports two encryption algorithms that can beselected to meet site requirements during table creation:

l AES (the encryption class is org.apache.hadoop.hive.serde2.AESRewriter)

l SMS4 (the encryption class is org.apache.hadoop.hive.serde2.SMS4Rewriter)

HBase Deletion

Due to the limitations of underlying storage systems, Hive does not support the ability todelete a single piece of table data. In Hive on HBase, MRS Hive supports the ability to deletea single piece of HBase table data. Using a specific syntax, Hive can delete one or morepieces of data from an HBase table.

Row Delimiter

In most cases, a carriage return character is used as the row delimiter in Hive tables stored intext files, that is, the carriage return character is used as the terminator of a row duringsearches. However, some data files are delimited by special characters, and not a carriagereturn character.

MRS Hive allows users to use different characters or character combinations to delimit rowsof Hive text data. When creating a table, set inputformat toSpecifiedDelimiterInputFormat, and set the following parameter before each search.

set hive.textinput.record.delimiter='';

The table data is then queried by the specified delimiter.

CSV SerDe

Comma separated value (CSV) is a common text file format. CSV stores table data (digits andtexts) in texts and uses a comma (,) as the text delimiter.

CSV files are universal. Many applications allow users to view and edit CSV files inWindows Office or conventional databases.

MRS Hive supports CSV files. Users can import CSV files to Hive tables or export user Hivetable data as CSV files to use them in other applications.


Issue 01 (2018-09-06) 7

1.4.7 HueHue is a web application developed based on the open source Django Python Web framework.It provides graphical user interfaces (GUIs) for users to configure, use, and view MRSclusters. Hue supports HDFS, Hive, MapReduce, and ZooKeeper in MRS clusters, includingthe following application scenarios:l HDFS: You can create, view, modify, upload, and download files as well as create

directories and modify directory permissions.l Hive: You can edit and execute HiveQL and add, delete, modify, and query databases,

tables, and views through MetaStore.l MapReduce: You can check MapReduce tasks that are being executed or have been

finished in the clusters, including their status, start and end time, and run logs.l ZooKeeper: You can check ZooKeeper status in the clusters.l Sqoop: You can create, configure, run, and check Sqoop jobs.

For details about Hue, visit http://gethue.com/.

1.4.8 Kerberos Authentication

OverviewTo ensure data security for users, MRS clusters provide user identity verification and userauthentication functions. To enable all verification and authentication functions, you mustenable Kerberos authentication when creating the cluster.

Identity Verification

The user identity verification function verifies the identity of a user when the user performsO&M operations or accesses service data in a cluster.

If the user restarts services in an MRS cluster on MRS Manager, the user must enter thepassword of the current account on MRS Manager. For example, restart services andsynchronize cluster configurations.

Authentication

Users with different identities may have different permissions to access and use clusterresources. To ensure data security, users must be authenticated after identity verification.

Identity VerificationClusters that support Kerberos authentication use the Kerberos protocol for identityverification. The Kerberos protocol supports mutual verification between clients and servers.This eliminates the risks incurred by sending user credentials over the network for simulatedverification. In MRS clusters, KrbServer provides the Kerberos authentication function.

Kerberos User Object

In the Kerberos protocol, each user object is a principal. A complete principal consists of twoparts: username and domain name. In O&M or application development scenarios, the useridentity must be verified before a client connects to a server. Users for O&M and serviceoperations in MRS clusters are classified into Human-machine and Machine-machine users.The password of Human-machine users is manually configured, while the password ofMachine-machine users is generated by the system randomly.


Issue 01 (2018-09-06) 8

http://gethue.com/

Kerberos Authentication

Kerberos supports two authentication modes: password and keytab. The default verificationvalidity period is 24 hours.

l Password verification: User identity is verified by inputting the correct password. Thismode mainly applies to O&M scenarios where Human-machine users are used. Theconfiguration command is kinit user name.

l Keytab verification: Keytab files contain users' security information. During keytabverification, the system automatically uses the encrypted credential information forverification. Users do not need to enter the password. This mode mainly applies tocomponent application development scenarios where Machine-machine users are used.Keytab verification can also be configured using the kinit command.

Authentication

After identity verification for users, the MRS system also authenticates the users to ensurethat they have limited or full permission on cluster resources. If a user does not have thepermission for accessing cluster resources, the system administrator must grant the requiredpermission to the user. Otherwise, the user fails to access the resources.

1.4.9 KafkaMRS deploys and hosts Kafka clusters in the cloud based on the open-source Apache Kafka.Kafka is a distributed, partitioned, replicated message publishing and subscription system. Itprovides features similar to Java Message Service (JMS) and has the following enhancements:

l Message persistencyMessages are stored in the storage space of clusters in persistence mode and can be usedfor batch consumption and real-time application programs. Data persistence preventsdata loss.

l High throughputHigh throughput is provided for message publishing and subscription.

l ReliabilityMessage processing methods such as At-Least Once, At-Most Once, and Exactly Onceare provided.

l DistributionA distributed system is easy to expand. When new Core nodes are added for capacityexpansion, the MRS cluster detects the nodes on which Kafka is installed and adds themto the cluster without interrupting services.

Kafka applies to online and offline message consumption. It is ideal for network service datacollection scenarios, such as conventional data collection, website active tracing, datamonitoring, and log collection.

For details about Kafka architecture and principles, see https://kafka.apache.org/0100/documentation.html.

1.4.10 StormMRS deploys and hosts Storm clusters in the cloud based on the open-source Apache Storm.Storm is a distributed, reliable, fault-tolerant computing system that processes large-volume


Issue 01 (2018-09-06) 9

https://kafka.apache.org/0100/documentation.html

https://kafka.apache.org/0100/documentation.html

streaming data in real time. It is applicable to real-time analysis, continuous computing, anddistributed extract, transform, and load (ETL). It has the following features:

l Distributed real-time computingIn a Storm cluster, each node runs multiple work processes; each work process createsmultiple threads; each thread executes multiple tasks; and each task processes dataconcurrently.

l Fault toleranceDuring message processing, if a node or a process is faulty, the message processing unitcan be redeployed.

l Reliable messagesData processing methods At-Least Once, At-Most Once, and Exactly Once aresupported.

l Flexible topology defining and deploymentThe Flux framework is used to define and deploy service topologies. If the service DAGis changed, users only need to modify YAML domain specific language (DSL), but donot need to recompile or package service code.

l Integration with external componentsStorm supports integration with external components such as Kafka, HDFS, and HBase.This facilitates implementation of services that involve multiple data sources.

For details about Storm architecture and principles, see http://storm.apache.org/.

1.4.11 CarbonDataCarbonData is a new Apache Hadoop file format. It adopts the advanced column-orientedstorage, index, compression, and encoding technologies and stores data in HDFS to improvecomputing efficiency. It helps accelerate the PB-level data query and is applicable to quickerinteractive queries. CarbonData is also a high-performance analysis engine that integrates datasources with Spark. Users can execute Spark SQL statements to query and analyze data.

CarbonData has the following features:

l SQLCarbonData is compatible with Spark SQL and supports SQL query operationsperformed on Spark SQL.

l Simple definition of table data setsCarbonData supports defining and creating data sets by using user-friendly DataDefinition Language (DDL) statements. CarbonData DDL is flexible and easy to use,and can define complex tables.

l Convenient data managementCarbonData provides various data management functions for data loading andmaintenance. It can load historical data and incrementally load new data. The loadeddata can be deleted according to the loading time and specific data loading operationscan be canceled.

l Quick query responseCarbonData features high-performance query. It uses dedicated data formats and appliesmultiple index technologies, global dictionary code, and multiple push-downoptimizations. The query speed is 10 times that of Spark SQL.


Issue 01 (2018-09-06) 10

http://storm.apache.org/

l Efficient data compressionCarbonData compresses data by combining the lightweight and heavyweightcompression algorithms. This saves 60% to 80% data storage space and the hardwarestorage cost.

l Table pre-aggregationCarbonData 1.3.1 supports the pre-aggregation feature. You do not need to modify anySQL statement to accelerate the group by statistics performance and query detailed data.In this way, one copy of data meets multiple application scenarios.

l Real-time storage and queryYou can use Structured Streaming to import data to CarbonData tables in real time andimmediately query the data.

l Partition table creationCarbonData 1.3.1 enables you to create partition tables. You can use any column tocreate partitions to accelerate query.

l Table permission controlCarbonData 1.3.1 supports table permission control. You need permissions to operatedatabases and tables.

For details about CarbonData architecture and principles, see http://carbondata.apache.org/.

1.4.12 FlumeFlume is a distributed and highly available system for massive log aggregation. Users cancustomize data transmitters in Flume to collect data. Flume can also roughly process the datait receives.

Flume provides the following features:

l Collects and aggregates event stream data in a distributed approach.l Collects log data.l Supports dynamic configuration update.l Provides the context-based routing function.l Supports load balancing and failover.l Provides comprehensive scalability.

For details about the Flume architecture and principles, see https://flume.apache.org/releases/1.6.0.html.

1.4.13 LoaderLoader is a data migration component developed based on Apache Sqoop. It quickens andsimplifies data migration between Hadoop and structured, semi-structured, and unstructureddata sources. Loader can both import and export data into and out of MRS clusters.

Loader provides the following features:

l Uses a high-available service architecture.l Supports data migration using a client.l Manages data migration jobs.l Supports data processing during migration.


Issue 01 (2018-09-06) 11

http://carbondata.apache.org/

https://flume.apache.org/releases/1.6.0.html

https://flume.apache.org/releases/1.6.0.html

l Runs migration jobs using MapReduce components.

For details about the Loader architecture and principles, see http://sqoop.apache.org/docs/1.99.7/index.html.

1.5 Relationships with Other ServicesThis section describes the relationships between MRS and other services.

l Virtual Private Cloud (VPC)

MRS clusters are created in the subnets of a VPC. VPCs provide secure, isolated, logicalnetwork environments for MRS clusters.

l Object Storage Service (OBS)

OBS stores the following user data:

– MRS job input data, such as user programs and data files

– MRS job output data, such as result files and log files of jobs

In MRS clusters, the HDFS, Hive, MapReduce, Yarn, Spark, Flume, and Loader modulescan import or export data from OBS.

l Elastic Cloud Server (ECS)

Each node in an MRS cluster is an ECS.

l Identity and Access Management (IAM)

IAM provides authentication for MRS.

l Cloud Trace Service (CTS)

CTS provides operation records, including requests for operating MRS resources and therequest results. With CTS, you can record operations associated with MRS for laterquery, audit, and backtrack operations.

Table 1-2 MRS operations that can be recorded by CTS

Operation Resource Type Trace Name

Creating a cluster cluster createCluster

Deleting a cluster cluster deleteCluster

Expanding a cluster cluster scaleOutCluster

Shrinking a cluster cluster scaleInCluster

After you enable CTS, the system starts recording operations on cloud resources. Youcan view operation records of the last 7 days on the CTS management console. Fordetails, see Querying Real-Time Traces.

1.6 Required Permission for Using MRSThis section describes permission required for using MRS.


Issue 01 (2018-09-06) 12

http://sqoop.apache.org/docs/1.99.7/index.html

http://sqoop.apache.org/docs/1.99.7/index.html

https://support.huaweicloud.com/en-us/qs-cts/en-us_topic_0030598499.html

Permission

By default, the system provides user management permission and resource managementpermission. The user management permission is used to manage users, user groups, and theirpermissions. The resource management permission is used to manage user operations oncloud service resources.

See Table 1-3 for MRS permission.

Table 1-3 Permission list

Permission Description Setting

MRS operationpermission

Users with this permissionhave the full operationrights on MRS resources.

Two methods are available:l Add the Tenant Administrator

permission to the user group.l Add the MRS Administrator,

Server Administrator, andTenant Guest permission tothe user group.

MRS query permission Users with this permissioncan:l View overview

information aboutMRS.

l Query MRS operationlogs.

l Query MRS clusterlists, including existingclusters, historicalclusters, and task lists.

l View cluster basicinformation and patchinformation.

l View job lists and jobdetails.

l Query the HDFS filelist and file operationrecords.

l Query the alarm list.l Access the MRS

Manager portal.

Add the MRS Administrator orTenant Guest permission to theuser group.NOTE

For MRS, the Tenant Guestpermission and MRS Administratorpermission are the same. Theirdifference is that Tenant Guest hasthe rights to query other cloudservices, while MRS Administratorhas only the rights to query MRSresources.

1.7 LimitationsBefore using MRS, ensure that you have read and understand the following limitations.

l MRS clusters must be created in VPC subnets.


Issue 01 (2018-09-06) 13

l You are advised to use any of the following browsers to access MRS:– Google Chrome 36.0 or later– Internet Explorer 9.0 or later

If you use Internet Explorer 9.0, you may fail to log in to the MRS managementconsole because user Administrator is disabled by default in some Windowssystems, such as Windows 7 Ultimate. Internet Explorer automatically selects asystem user for installation. As a result, Internet Explorer cannot access themanagement console. You are advised to reinstall Internet Explorer 9.0 or later asthe administrator (recommended) or alternatively run Internet Explorer 9.0 as theadministrator.

l When you create an MRS cluster, you can select Auto Create from the drop-down list ofSecurity Group to create a security group or select an existing security group. After theMRS cluster is created, do not delete or modify the used security group. Otherwise, acluster exception may occur.

l To prevent illegal access, only assign access permission for security groups used byMRS where necessary.

l Do not perform the following operations because they will cause cluster exceptions:– Deleting or modifying the default security group that is created when you create an

MRS cluster.– Powering off, restarting, or deleting cluster nodes displayed in ECS, changing or

reinstalling their OS, or modifying their specifications when you use MRS.– Deleting the processes, installed applications or files that already exist on the cluster

node.– Deleting MRS cluster nodes. Deleted nodes will still be charged.

l If a cluster exception occurs when no incorrect operations have been performed, contacttechnical support engineers. The technical support engineers will ask you for yourpassword and then perform troubleshooting.

l Keep the initial password for logging in to the Master node properly because MRS willnot save it. Use a complex password to avoid malicious attacks.

l MRS clusters are still charged during exceptions. Contact technical support engineers tohandle cluster exceptions.

l Plan disks of cluster nodes based on service requirements. If you want to store a largevolume of service data, add EVS disks or storage space to prevent insufficient storagespace from affecting node running.

l The cluster nodes store only users' service data. Non-service data can be stored in theOBS or other ECS nodes.

l The cluster nodes only run MRS cluster programs. Other client applications or userservice programs are deployed on separate ECS nodes.


Issue 01 (2018-09-06) 14

2 MRS Quick Start

2.1 Introduction to the Operation ProcessMRS is easy to use and provides a user-friendly user interface (UI). By using computersconnected in a cluster, you can run various tasks, and process or store petabytes of data.

After Kerberos authentication is disabled, a typical procedure for using MRS is as follows:

1. Prepare data.Upload the local programs and data files to be computed to Object Storage Service(OBS).

2. Create a cluster.Create clusters before you use MRS. The cluster quantity is subject to the Elastic CloudServer (ECS) quantity. Configure basic cluster information to complete cluster creation.You can submit a job at the same time when you create a cluster.

NOTE

When you create a cluster, only one new job can be added. If you need to add more jobs, performStep 4.

3. Import data.After an MRS cluster is successfully created, use the import function of the cluster toimport OBS data to HDFS. An MRS cluster can process both OBS data and HDFS data.

4. Add a job.After a cluster is created, you can analyze and process data by adding jobs. Note thatMRS provides a platform for executing programs developed by users. You can submit,execute, and monitor such programs by using MRS. After a job is added, the job is in theRunning state by default.

5. View the execution result.The job operation takes a while. After job running is complete, go to the JobManagement page, and refresh the job list to view the execution results on the Job tabpage.You cannot execute a successful or failed job, but can add or copy the job. After settingjob parameters, you can submit the job again.

6. Terminate a cluster.

MapReduce ServiceUser Guide 2 MRS Quick Start

Issue 01 (2018-09-06) 15

If you want to terminate a cluster after jobs are complete, click Terminate in Cluster.The cluster status changes from Running to Terminating. After the cluster isterminated, the cluster status will change to Terminated and will be displayed inCluster History. No fee will be charged.

2.2 Quick Start

2.2.1 Creating a ClusterTo use MRS, you must purchase cluster resources first. This section describes how to create acluster using MRS.

Procedure

Step 1 Log in to the MRS management console.

Step 2 Click Purchase Cluster and open the Purchase Cluster page.

NOTE

Note the usage of quotas when you create a cluster. If the resource quotas are insufficient, apply for newquotas based on the prompted information and create new clusters.

The following is a cluster configuration example:

l Billing Mode: Use the default value Yearly/Monthly or select On-demand.l Current Region: Use the default value, for example, CN North-Beijing1.l AZ: Use the default value.l Cluster Name: This parameter can be set to the default system name. For the ease of

distinguishing and memorizing, it is recommended that the cluster name be set to a valueconsisting of the employee ID, short spelling of the user's name, or the date, for example,mrs_20160907.

l Cluster Version: Use the latest version by default.

l Kerberos Authentication: The default value is "Enable": .l Cluster Type: Use the default value Analysis Cluster or select Streaming Cluster.l Component: For an analysis cluster, select components Spark, HBase, Hive, and so on.

For a streaming cluster, select components Kafka, Storm, and so on.l VPC: Use the default value. If no virtual private cloud (VPC) exists, click View VPC to

enter VPC, and create a VPC.l Subnet: Use the default value. If no subnet is created in the VPC, click Create Subnet

to create a subnet in the corresponding VPC.l Security Group: Select Auto Create.l Cluster HA: Cluster HA is enabled by default.l Instance Specifications: Select General Computing S3 -> 4 vCPU,8GB(s3.xlarge.2)

for both the Master and Core nodes.l Quantity: The number of Master nodes is fixed to 2. Set the number of Core nodes to 3.l Storage Type: Select Common I/O.l Storage Space (GB): Use the default value.


Issue 01 (2018-09-06) 16

l Data Disks: Use the default value.l Login Mode: Select the mode used for logging in to the ECS node.

– Password: Set the password used for logging in to the ECS node.– Key Pair: Select the key pair, for example SSHkey-bba1.pem, from the drop-down

list. If you have obtained the private key file, select I acknowledge that I haveobtained private key file SSHkey-bba1.pem and that without this file I will notbe able to log in to my ECS. If no key pair is created, click View Key Pair andcreate or import a key pair. Then obtain the private key file.

l Logging: Select "Disable": . The default value is "Enable": .l OBS Bucket: Select "I confirm that OBS bucket s3a://xxxxxx will be created and only

used to collect logs that record MRS cluster creation failures".l Advanced Settings: Select Do not configure.

NOTE

MRS streaming clusters do not support Job Management or File Management. If the cluster typeis Streaming Cluster, the Create Job area is not displayed on the cluster creation page.

Step 3 If you select the Yearly/Monthly billing mode, click Buy Now to create a cluster. If youselect the On-demand billing mode, click Create Now to create a cluster.

Step 4 Confirm cluster specifications, select I have read and agree to the MapReduce ServiceAgreement. If you select the Yearly/Monthly billing mode, click Submit Order. If youselect the On-demand billing mode, click Submit Application to submit a cluster creationtask.

Step 5 Click Back to Cluster List to view the cluster status.

The cluster creation takes a while. The initial state of the cluster created is Starting. After thecluster is created successfully, the status will be updated to Running. Please be patient.

----End

2.2.2 Managing FilesYou can create directories, delete directories, and import, export, or delete files on the FileManagement page in an analysis cluster with Kerberos authentication disabled.

Background

Data to be processed by MRS is stored in either OBS or HDFS. OBS provides you withmassive, highly reliable, and secure data storage capabilities at a low cost. You can view,manage, and use data through OBS Console or OBS Browser.

Importing Data

MRS supports data import from the OBS system to HDFS. This function is recommended ifthe data size is small, because the upload speed reduces as the file size increases.

Both files and folders containing files can be imported. The operations are as follows:

1. Log in to the MRS management console.2. Choose Cluster > Active Cluster, select a cluster, and click its name to switch to the

cluster information page.


Issue 01 (2018-09-06) 17

3. Click File Management and go to the File Management tab page.4. Select HDFS File List.5. Click the data storage directory, for example, bd_app1.

bd_app1 is just an example. The storage directory can be any directory on the page. Youcan create a directory by clicking Create Folder.

6. Click Import Data to configure the paths for HDFS and OBS.

NOTE

When configuring the OBS or HDFS path, click Browse, select the file path, and click OK.

– The path for OBSn Must start with s3a://. s3a:// is used by default.n Files and programs encrypted by the KMS cannot be imported.n Empty folders cannot be imported.n Directories and file names can contain letters, Chinese characters, digits,

hyphens (-), or underscores (_), but cannot contain special characters (;|&><'$*?\).

n Directories and file names cannot start or end with spaces, but can have spacesbetween other characters.

n The full path of OBS contains a maximum of 1023 characters.– The path for HDFS

n It starts with /user by default.n Directories and file names can contain letters, Chinese characters, digits,



n The full path of HDFS contains a maximum of 1023 characters.n The parent HDFS directory in HDFS File List is displayed in the textbox for

the HDFS path by default when data is imported.7. Click OK.

View the upload progress in File Operation Record. The data import operation isoperated as a Distcp job by MRS. You can check whether the Distcp job is successfullyexecuted in Job Management > Job.

Exporting Data

After data is processed and analyzed, you can either store the data in HDFS or export it to theOBS system.

Both files and folders containing files can be exported. The operations are as follows:


cluster information page.3. Click File Management and go to the File Management tab page.4. Select HDFS File List.


Issue 01 (2018-09-06) 18

5. Click the data storage directory, for example, bd_app1.6. Click Export Data and configure the paths for HDFS and OBS.

NOTE


– The path for OBSn Must start with s3a://. s3a:// is used by default.n Empty folders cannot be imported.n Directories and file names can contain letters, Chinese characters, digits,








the HDFS path by default when data is imported.

NOTE

Ensure that the exported folder is not empty. If an empty folder is exported to the OBS system, thefolder is exported as a file. After the folder is exported, its name is changed, for example, from testto test-$folder$, and its type is file.

7. Click OK.View the upload progress in File Operation Record. The data export operation isoperated as a Distcp job by MRS. You can check whether the Distcp job is successfullyexecuted in Job Management > Job.

2.2.3 Creating a JobYou can submit developed programs to MRS, execute them, and obtain the execution resulton the Job Management page in an analysis cluster with Kerberos authentication disabled.

Prerequisites

Before creating jobs, upload the local data to OBS for computing and analysis. MRS allowsdata to be exported from OBS to HDFS for computing and analysis. After the analysis andcomputing are complete, you can either store the data in HDFS or export it to OBS. HDFSand OBS can store compressed data in the format of bz2 or gz.

Procedure



Issue 01 (2018-09-06) 19

Step 2 Choose Cluster > Active Cluster, select a cluster, and click its name to switch to the clusterinformation page.

Step 3 Click Job Management and go to the Job Management tab page.

Step 4 On the Job tab page, click Create and go to the Create Job page.

Table 2-1 describes job configuration information.

Table 2-1 Job configuration information

Parameter Description

Type Job typePossible types include:l MapReducel Sparkl Spark Scriptl Hive Script

NOTETo add jobs of the Spark and Hive types, you need to select Spark and Hivecomponents when creating a cluster and the cluster must be in the runningstate. Spark Script jobs support Spark SQL only, and Spark supports SparkCore and Spark SQL.

Name Job nameThis parameter consists of 1 to 64 characters, including letters, digits,hyphens (-), or underscores (_). It cannot be null.NOTE

Identical job names are allowed but not recommended.

Program Path Address of the JAR file of the program for executing jobsNOTE

When configuring this parameter, click OBS or HDFS, specify the file path, andclick OK.

This parameter cannot be null.This parameter must meet the following requirements:l A maximum of 1023 characters are allowed, but special characters

(*?<">|\) are not allowed. The address cannot be empty or full ofspaces.

l The path varies depending on the file system:– OBS: The path must start with s3a://, for example, s3a://

wordcount/program/hadoop-mapreduce-examples-2.7.x.jar.– HDFS: The path starts with /user by default.

l Spark Script must end with .sql; MapReduce and Spark must endwith .jar. sql, jar are case-insensitive.


Issue 01 (2018-09-06) 20


Parameters Key parameter for executing jobsThis parameter is assigned by an internal function. MRS is onlyresponsible for inputting the parameter. Separate parameters with spaces.Format: package name.class nameA maximum of 2047 characters are allowed, but special characters (;|&>',<$) are not allowed. This parameter can be empty.NOTE

When you enter parameters containing sensitive information, for example, apassword for login, you can add an at sign (@) before the parameters to encryptthe parameter values and prevent persistence of sensitive information in the formof plaintext. Therefore, when you view job information on the MRS managementconsole, sensitive information will be displayed as asterisks (*).

Example: username=admin @password=admin_123

Import From Address for inputting dataNOTE


The path varies depending on the file system:l OBS: The path must start with s3a://.l HDFS: The path starts with /user by default.A maximum of 1023 characters are allowed, but special characters (*?<">|\) are not allowed. This parameter can be empty.

Export To Address for outputting dataNOTE



Log path Address for storing job logs that record job running statusNOTE




Issue 01 (2018-09-06) 21

NOTE

l The OBS path supports s3a://. s3a:// is used by default.

l Files and programs encrypted by the KMS cannot be imported if the OBS path is used.

l The full path of HDFS and OBS contains a maximum of 1023 characters.

Step 5 Confirm job configuration information and click OK.

After jobs are added, you can manage them.

NOTE

By default, each cluster supports a maximum of 10 running jobs.

----End


Issue 01 (2018-09-06) 22

3 Security

3.1 Security Configuration Suggestions for Clusters withKerberos Authentication Disabled

The Hadoop community version provides two authentication modes: Kerberos authentication(security mode) and Simple authentication (non-security mode). When creating a cluster, youcan choose to enable or disable Kerberos authentication.

Clusters in security mode use the Kerberos protocol for security authentication.

In non-security mode, MRS cluster components use a native open source authenticationmechanism, which is typically Simple authentication. If Simple authentication is used,authentication is automatically performed by a client user (for example, user root) by defaultwhen a client connects to a server. The authentication is imperceptible to the administrator orservice user. In addition, when being executed, the client may even pretend to be any user(including superuser) by injecting UserGroupInformation. Cluster resource managementand data control APIs are not authenticated on the server and are easily exploited and attackedby hackers.

Therefore, in non-security mode, network access permissions must be strictly controlled toensure cluster security. You are advised to perform the following operations to ensure clustersecurity.

l Deploy service applications on ECSs in the same VPC and subnet and avoid accessingMRS clusters through an external network.

l Configure security group rules to strictly control the access scope. Do not configureaccess rules that allow Any or 0.0.0.0 for the inbound direction of MRS cluster ports.

l If you want to access the native pages of the components in the cluster from the external,follow instructions in Creating an SSH Channel to Connect an MRS Cluster andConfiguring the Browser for configuration.

MapReduce ServiceUser Guide 3 Security

Issue 01 (2018-09-06) 23

4 Cluster Operation Guide

4.1 OverviewYou can view the overall cluster status on the Dashboard > Overview page, and obtainrelevant MRS documents by clicking the document name under Helpful Links.

MRS helps manage and analyze massive data. MRS is easy to use and allows you to create acluster in about 20 minutes. You can add MapReduce, Spark, and Hive jobs to clusters toprocess and analyze user data. Additionally, processed data can be encrypted by using SecureSockets Layer (SSL) and transmitted to OBS, ensuring data security and integrity.

Cluster StatusTable 4-1 describes the possible status of each cluster on the MRS management console.

Table 4-1 Cluster status

Status Description

Starting A cluster is being created.

Running A cluster has been created successfully and all components in thecluster are running properly.

Expanding Core/Task nodes are being added to a cluster.

Shrinking The Shrinking state is displayed when a node is being deleted inthe following operations: shutting down the node, deleting thenode, changing the OS of the node, reinstalling the OS of the node,and modifying the specifications of the node.

Abnormal Some components in a cluster are abnormal, and the cluster isabnormal.

MapReduce ServiceUser Guide 4 Cluster Operation Guide

Issue 01 (2018-09-06) 24

Status Description

Terminating When a cluster charged in On-demand mode is being deleted, itsstatus is Terminating.NOTE

Clusters charged in Yearly/Monthly mode cannot be terminated.

Frozen The balance is insufficient for purchasing a cluster.NOTE

A cluster in the Frozen state is unavailable and all ECSs in the cluster areshut down. After being unfrozen, the cluster returns to the Running state. Ifno renewal fee is paid, the cluster will be deleted after a specified period(called the freeze period) and the cluster status will be changed toTerminated.

Failed Cluster creation, termination, or capacity expansion fails.

Terminated A cluster has been terminated.

Job Status

Table 4-2 describes the status of jobs that you can add after logging in to the MRSmanagement console.

Table 4-2 Job status

Status Description

Running A job is being executed.

Completed Job execution is complete and successful.

Terminated A job is stopped during execution.

Abnormal An error occurs during job execution or job execution fails.

4.2 Cluster ListThe cluster list contains all clusters in MRS. You can view clusters in various states. If a largenumber of clusters are involved, navigate through multiple pages to view all of the clusters.

MRS, as a platform managing and analyzing massive data, provides a PB-level dataprocessing capability. Multiple clusters can be created. The cluster quantity is subject to theECS quantity. Currently, only the On-demand billing mode is supported. The charging unit ishour.

Active Cluster

Clusters are listed in chronological order by default in the cluster list, with the most recentcluster displayed at the top. Table 4-3 describes parameters of the cluster list.


Issue 01 (2018-09-06) 25

l Active Cluster: contains all clusters except the clusters in the Failed or Terminatedstate.

l Failed Task: contains only the tasks in the Failed state. Task failures include:– Cluster creation failure– Cluster termination failure– Cluster capacity expansion failure– Cluster capacity reduction failure

Table 4-3 Parameters in the active cluster list


Name Cluster name, which is set when a cluster is created.

ID Unique identifier of a cluster, which is automatically assigned when acluster is created.

Nodes Number of nodes that can be deployed in a cluster. This parameter is setwhen a cluster is created.NOTE

A small value may cause slow cluster running while a large value may causeunnecessary cost. Properly set a value based on data to be processed.

Status Status and operation progress description of a cluster.The cluster creation progress includes:l Verifying cluster parametersl Applying for cluster resourcesl Creating VMl Initializing VMl Installing MRS Managerl Deploying clusterl Cluster installation failedThe cluster expansion progress includes:l Preparing for cluster expansionl Creating VMl Initializing VMl Adding node to the clusterl Cluster expansion failedThe cluster shrink progress includes:l Preparing for cluster shrinkl Decommissioning instancel Deleting VMl Deleting node from the clusterl Cluster shrink failedIt displays causes of cluster installation, expansion, and shrink failures.For details, see Table 4-12.


Issue 01 (2018-09-06) 26


Created Time when MRS starts charging MRS clusters of the customer.

Billing Mode Method by which the price for cluster usage is charged. Currently, thecommercial version of MRS is charged based on ECSs in a cluster. Bydefault, only cluster nodes can be purchased on demand during MRSpurchasing. Cluster node usage is charged based on time, with thedefault unit being per hour.

AZ An availability zone of the working zone in the cluster, which is setwhen a cluster is created.

Operation Terminate: If you want to terminate a cluster after jobs are complete,click Terminate. The cluster status changes from Running toTerminating. After the cluster is terminated, the cluster status willchange to Terminated and will be displayed in Cluster History. If theMRS cluster fails to be deployed, the cluster is automatically terminated.No fee will be charged accordingly.This parameter is displayed in Active Cluster only.NOTE

If a cluster is terminated before data processing and analysis are completed, dataloss may occur. Therefore, exercise caution when terminating a cluster.

Remote Login: Use the password set during cluster creation to log in tothe ECS. Only the Master node of the cluster can be logged in currently.

Table 4-4 Button description

Button Description

In the drop-down list, select a state to filter clusters:l Active Cluster

– All (Num): displays all existing clusters.– Starting (Num): displays existing clusters in the Starting state.– Running (Num): displays existing clusters in the Running state.– Expanding (Num): displays existing clusters in the Expanding

state.– Shrinking (Num): displays existing clusters in the Shrinking

state.– Abnormal (Num): displays existing clusters in the Abnormal

state.– Terminating (Num): displays existing clusters in the Terminating

state.– Frozen (Num): displays existing clusters in the Frozen state.

Click to open the page for managing failed task.

Num: displays the failed tasks in the Failed state.


Issue 01 (2018-09-06) 27

Button Description

Enter a cluster name in the search bar and click to search for acluster.

Click to manually refresh the cluster list.

Cluster HistoryOnly the clusters in the Failed or Terminated state are displayed on the Cluster Historypage. Only clusters terminated within the last six months are displayed. If you want to viewclusters terminated six months ago, contact technical support engineers.

Table 4-5 Parameters in the historical cluster list


Name Cluster name, which is set when a cluster is created.

Nodes Number of nodes that can be deployed in a cluster. This parameter is setwhen a cluster is created.NOTE

A small value may cause slow cluster running while a large value may causeunnecessary cost. Properly set a value based on data to be processed.

Status Status of a cluster

Created Time when MRS starts charging MRS clusters of the customer.

Terminated Termination time of a cluster, that is, time when charging for the clusterstops. This parameter is displayed in Cluster History only.

Billing Mode Method by which the price for cluster usage is charged. Currently, thecommercial version of MRS is charged based on ECSs in a cluster. Bydefault, only cluster nodes can be purchased on demand during MRSpurchasing. Cluster node usage is charged based on time, with thedefault unit being per hour.

AZ An availability zone of the working zone in the cluster, which is setwhen a cluster is created.


Button Description

Enter a cluster name in the search bar and click to search for acluster.

Click to manually refresh the cluster list.


Issue 01 (2018-09-06) 28

4.3 Creating a ClusterTo use MRS, you must purchase cluster resources first. This section describes how to create acluster using MRS.

BackgroundCurrently, the commercial version of MRS is charged based on ECSs in a cluster. By default,cluster nodes can be purchased in Yearly/Monthly mode or On-demand mode.

l Yearly/Monthly: The duration ranges from one month to one year. The customer mustpay in full when purchasing a cluster. The discount increases as the customer subscribesto a cluster for a longer time.

l On-demand: Cluster node usage is charged based on time, with the default unit beingper hour.

NOTE

l The fee here contains only the cost on clusters. The costs on data storage, bandwidth, andtraffic on MRS are excluded.

l The charging is stopped only when MRS clusters are terminated.

l During fee deduction, if the user account encounters insufficient balance, a message will besent notifying users to pay the renewal fee, and corresponding cluster resources are frozen andcannot be used. If no renewal fee is paid, cluster resources will be deleted after the freezeperiod.

l Yearly/Monthly clusters cannot be restored after being deleted, and you will not receiverefund. Exercise caution when deleting a yearly/monthly cluster.

l You can continue to use a yearly/monthly cluster after it is overdue. However, its on-demandservices will be unavailable. That is, you cannot submit jobs via the OBS system.

Creating an MRS 1.7.2 ClusterNOTE

If you want to create a cluster of MRS history versions, follow instructions in Creating a Cluster(History Versions).



NOTE


Step 3 Table 4-7, Table 4-8, Table 4-9, Table 4-10, and Table 4-11 describe the basic configurationinformation, node configuration information, login information, component information andjob configuration information for a cluster, respectively.


Issue 01 (2018-09-06) 29

Table 4-7 Basic cluster configuration information


Billing Mode MRS provides two billing modes:l On-demandl Yearly/Monthly

Current RegionTo change the region, click in the upper left corner to select one.

AZ An availability zone (AZ) is a physical area that uses independent powerand network resources. In this way, applications are interconnectedusing internal networks but are physically isolated. As a result,application availability is improved. It is recommended that you createclusters under different AZs.MRS enables an AZ to be randomly selected to prevent excessive VMsto be created in the specified default AZ and avoid uneven resourceoccupation among AZs. MRS also enables a tenant's all VMs to becreated in one AZ as much as possible.If your VMs must be located in different AZs, specify AZs whencreating VMs. In a multi-user and multi-AZ scenario, each user tries toobtain a default AZ that is different from other users' default AZs.Select an AZ of the region in the cluster. Currently, only the CN North-Beijing1, CN East-Shanghai2 or CN South-Guangzhou region issupported.AZs are associated with each region as follows:l CN North-Beijing1: AZ1 and AZ2l CN East-Shanghai2: AZ1, AZ2 and AZ3l CN South-Guangzhou: AZ1, AZ2 and AZ3

Cluster Name Cluster name, which is globally unique.A cluster name can contain only 1 to 64 characters, including letters,digits, hyphens (-), or underscores (_).The default name is mrs_xxxx, where xxxx is a random combination offour letters and numbers.

Cluster Version Currently, MRS 1.5.1, MRS 1.6.3, MRS 1.7.1 and MRS 1.7.2 aresupported.The latest version of MRS is used by default.


Issue 01 (2018-09-06) 30


KerberosAuthentication

Indicates whether to enable Kerberos authentication when logging in toMRS Manager. Possible values are as follows:

lIf Kerberos authentication is disabled, you can use all functions of anMRS cluster. You are advised to disable Kerberos authentication insingle-user scenarios. For clusters with Kerberos authenticationdisabled, you can directly access the MRS cluster management pageand components without security authentication.If Kerberosauthentication is disabled, you can follow instructions in SecurityConfiguration Suggestions for Clusters with KerberosAuthentication Disabled to perform security configuration.

lIf Kerberos authentication is enabled, common users cannot use thefile management and job management functions of an MRS clusterand cannot view cluster resource usage or the job records for Hadoopand Spark. To use more cluster functions, the users must contact theMRS Manager administrator to assign more permissions. You areadvised to enable Kerberos authentication in multi-user scenarios.

You can click or to disable or enable Kerberosauthentication, respectively.After creating MRS clusters with Kerberos authentication enabled, userscan manage running clusters on MRS Manager. The users must preparea working environment on the public cloud platform for accessing MRSManager. For details, see Accessing MRS Manager SupportingKerberos Authentication.NOTE

The Kerberos Authentication, Username, Password, and Confirm Passwordparameters are displayed only after the user obtains the permission for MRS insecurity mode.

Username Indicates the username for the administrator of MRS Manager. admin isused by default.This parameter needs to be configured only when Kerberos

Authentication is set to "Enable": .


Issue 01 (2018-09-06) 31


Password Indicates the password of the MRS Manager administrator.l Must contain 8 to 32 characters.l Must contain at least three types of the following:

– Lowercase letters– Uppercase letters– Digits– Special characters of `~!@#$%^&*()-_=+\|[{}];:'",<.>/?– Spaces

l Must be different from the username.l Must be different from the username written in reverse order.Password strength: The colorbar in red, orange, and green indicatesweak, medium, and strong password, respectively.This parameter needs to be configured only when Kerberos


ConfirmPassword

Enter the user password again.This parameter needs to be configured only when Kerberos


Cluster Type MRS 1.3.0 or later provides two types of clusters:l Analysis cluster: is used for offline data analysis and provides

Hadoop components.l Streaming cluster: is used for streaming tasks and provides stream

processing components.NOTE

MRS streaming clusters do not support Job Management or File Management.If the cluster type is Streaming Cluster, the Create Job area is not displayed onthe cluster creation page.


Issue 01 (2018-09-06) 32


Component l MRS 1.7.0 supports the following components:Components of an analysis cluster:– Hadoop 2.8.3: distributed system architecture– Spark 2.2.1: in-memory distributed computing framework– HBase 1.3.1: distributed column store database– Hive 1.2.1: data warehouse framework built on Hadoop– Hue 3.11.0: providing the Hadoop UI capability, which enables

users to analyze and process Hadoop cluster data on browsers– Loader 2.0.0: a tool based on source sqoop 1.99.7, designed for

efficiently transferring bulk data between Apache Hadoop andstructured datastores such as relational databases.Hadoop is mandatory, and Spark and Hive must be used together.Select components based on services.

Components of a streaming cluster:– Kafka 0.10.0.0: distributed message subscription system– Storm 1.0.2: distributed real-time computing system– Flume 1.6.0: a distributed, reliable, and available service for

efficiently collecting, aggregating, and moving large amounts oflog data.

l MRS 1.5.0 or MRS 1.5.1 supports the following components:Components of an analysis cluster:– Hadoop 2.7.2: distributed system architecture– Spark 2.1.0: in-memory distributed computing framework– HBase 1.0.2: distributed column store database– Hive 1.2.1: data warehouse framework built on Hadoop– Hue 3.11.0: providing the Hadoop UI capability, which enables





l MRS 1.3.0 supports the following components:Components of an analysis cluster:– Hadoop 2.7.2: distributed system architecture– Spark 1.5.1: in-memory distributed computing framework


Issue 01 (2018-09-06) 33


– HBase 1.0.2: distributed column store database– Hive 1.2.1: data warehouse framework built on Hadoop– Hue 3.11.0: providing the Hadoop UI capability, which enables

users to analyze and process Hadoop cluster data on browsersHadoop is mandatory, and Spark and Hive must be used together.Select components based on services.NOTE

After Kerberos Authentication is set to "Enable": , the Huecomponent can be selected, but the Create Job area is not displayed,indicating that jobs cannot be created.

Components of a streaming cluster:– Kafka 0.10.0.0: distributed message subscription system– Storm 1.0.2: distributed real-time computing system

VPC A VPC is a secure, isolated, and logical network environment.Select the VPC for which you want to create a cluster and click ViewVPC to view the name and ID of the VPC. If no VPC is available, createone.

Subnet A subnet provides dedicated network resources that are isolated fromother networks, improving network security.Select the subnet for which you want to create a cluster to enter the VPCand view the name and ID of the subnet.If no subnet is created under the VPC, click Create Subnet to createone.WARNING

Do not associate the subnet with the network ACL.

Security Group A security group is a set of ECS access rules. It provides access policiesfor ECSs that have the same security protection requirements and aremutually trusted in a VPC.When you create an MRS cluster, you can select Auto Create from thedrop-down list of Security Group to create a security group or select anexisting security group.


Issue 01 (2018-09-06) 34


Cluster HA Cluster HA specifies whether to enable high availability for a cluster.This parameter is enabled by default.If you enable this option, the management processes of all componentswill be deployed on both Master nodes to achieve hot standby andprevent single-node failure, improving reliability. If you disable thisoption, they will be deployed on only one Master node. As a result, if aprocess of a component becomes abnormal, the component will fail toprovide services.

l : Disabled. When Cluster HA is disabled, there is only oneMaster node and the number of Core nodes is three by default.However, you can decrease the number of Core nodes to 1.

l : Enabled. When Cluster HA is enabled, there are twoMaster nodes and the number of Core nodes is three by default.However, you can decrease the number of Core nodes to 1.

You can click or to disable or enable high availability,respectively.

Table 4-8 Cluster node information


Type MRS provides three types of nodes:l Master: A Master node in an MRS cluster manages the cluster,

assigns cluster executable files to Core nodes, traces the executionstatus of each job, and monitors the DataNode running status.

l Core: A Core node in a cluster processes data and stores process datain HDFS.

l Task: A Task node in a cluster is used for computing and does notstore persistent data. Yarn and Storm are mainly installed on Tasknodes. Task nodes are optional, and the number of Task nodes can bezero. (Task nodes are supported by MRS 1.6.0 or later.)When the number of clusters does not change much but the clusters'service processing capabilities need to be remarkably andtemporarily improved, add Task nodes to address the followingsituations:– The volume of temporary services is increased, for example,

report processing at the end of the year.– Long-term tasks must be completed in a short time, for example,

some urgent analysis tasks.


Issue 01 (2018-09-06) 35


(Optional) AddTask Node

Click Add Task Node to configure the information about the Task node.Currently, when you create a cluster whose billing mode is On-demand,click behind Disabled in the row of Task. On the Auto Scaling pagethat is displayed, enable auto scaling. For details, see Performing AutoScaling for a Cluster.

InstanceSpecifications

Instance specifications of a nodeMRS supports host specifications determined by CPU, memory, anddisks space.l In the CN North-Beijing1, CN East-Shanghai2 and CN South-

Guangzhou regions, MRS 1.7.2 supports instance specificationsdetailed in ECS Specifications Used by MRS.

NOTE

l More advanced instance specifications provide better data processing.However they require higher cluster cost.

l If the specifications of Core nodes are d1.4xlarge, d2.4xlarge.8, andd2.8xlarge.8, Data Disk is not displayed. This is because data disks areconfigured by default for these specifications. Other specifications do nothave data disks. Users must manually add data disks if they are required.

l If you select HDDs for Core nodes, there is no charging information for datadisks. The fees are charged with ECSs.

l If you select HDDs for Core nodes, the system disks (40 GB) of Master nodesand Core nodes, as well as the data disks (200 GB) of Master nodes, are SATAdisks.

l If you select non-HDD disks for Core nodes, the disk types of Master andCore nodes are determined by Data Disk.

l If Sold Out appears next to an instance specification of a node, the node ofthis specification cannot be purchased. You can only purchase nodes of otherspecifications.

Quantity Number of Master, Core, and Task nodesFor Master nodes:l If Cluster HA is enabled, the number of Master nodes is fixed to 2.l If Cluster HA is disabled, the number of Master nodes is fixed to 1.The minimum number of Core nodes is 1 and the total number of Coreand Task nodes cannot exceed 500.NOTE

l If more than 500 Core nodes and Task nodes are required, contact technicalsupport engineers or invoke a background interface to modify the database.

l A small number of nodes may cause clusters to run slowly while a largenumber of nodes may be unnecessarily costly. Set an appropriate value basedon data to be processed.

Storage Type Disk storage typeThe following disk types are supported:l SATA: Common I/Ol SAS: High I/Ol SSD: Ultra-high I/O


Issue 01 (2018-09-06) 36


Storage Space(GB)

Disk space of MRSUsers can add disks to increase storage capacity when creating a cluster.There are two different configurations for storage and computing:l Data storage and computing are isolated.

Data is stored in OBS, which features low cost and unlimited storagecapacity. The clusters can be terminated at any time in OBS. Thecomputing performance is determined by OBS access performanceand is lower than that of HDFS. This configuration is recommendedif data computing is infrequent.

l Data storage and computing are not isolated.Data is stored in HDFS, which features high cost, high computingperformance, and limited storage capacity. Before terminatingclusters, you must export and store the data. This configuration isrecommended if data computing is frequent.

The disk sizes range from 100 GB to 32000 GB, with 10 GB added eachtime, for example, 100 GB, 110 GB.NOTE

l The Master node increases data disk storage space for MRS Manager. Thedisk type must be the same as the data disk type of Core nodes. The defaultdisk space is 200 GB and cannot be changed.

l If the specifications of Core nodes are d1.4xlarge, d2.4xlarge.8, d2.8xlarge.8or d1.8xlarge, Data Disk is not displayed. This applies to MRS 1.6.0 orearlier.

Data Disks Number of data disks on Master, Core, and Task nodesMaster: currently fixed at 1Core: 1 to 10Task: 0 to 10

Operation Only after you click Add Task Node, Task nodes can be configured.If you do not need to configure a Task node, click Delete in the row ofthe Task node.


Issue 01 (2018-09-06) 37

Table 4-9 Login information


Login Mode l PasswordYou can log in to ECS nodes using a password.A password must meet the following requirements:1. Must be 8 to 26 characters long.2. Must contain at least 3 of the following character types:

uppercase letters, lowercase letters, digits, and special characters(!@$%^-_=+[{}]:\,./?), but must not contain spaces.

3. Cannot be the username or the username spelled backwards.l Key PairKeys are used to log in to Master1 of the cluster.A key pair, also called an SSH key, consists of a public key and a privatekey. You can create an SSH key and download the private key forauthenticating remote login. For security, a private key can only bedownloaded once. Keep it secure.Select the key pair, for example SSHkey-bba1.pem, from the drop-downlist. If you have obtained the private key file, select I acknowledge thatI have obtained private key file SSHkey-bba1.pem and that withoutthis file I will not be able to log in to my ECS. If no key pair iscreated, click View Key Pair to create or import keys. Then obtain theprivate key file.Configure an SSH key using either of the following two methods:1. Create an SSH keyAfter you create an SSH key, a public key and a private key aregenerated. The public key is stored in the system, and the private key isstored in the local ECS. When you log in to an ECS, the public andprivate keys are used for authentication.2. Import an SSH keyIf you have obtained the public and private keys, import the public keyinto the system. When you log in to an ECS, the public and private keysare used for authentication.

Table 4-10 Log management information


Logging Indicates whether the tenant has enabled the log collection function.

l : Enabled

l : Disabled

You can click or to disable or enable the log collectionfunction, respectively.


Issue 01 (2018-09-06) 38


OBS Bucket Indicates the log save path, for example, s3a://mrs-log-a3859af76b874760969cd24f2640bbb4-northchina.Select I confirm that OBS bucket s3a://mrs-log-a3859af76b874760969cd24f2640bbb4-northchina will be createdand used to collect MRS system logs only, and I will be charged forthis service.If an MRS cluster that supports logging fails to be created, you can useOBS to download related logs for troubleshooting.Procedure:1. Log in to the OBS management console.2. Select the mrs-log-<tenant_id>-<region_id> bucket from the

bucket list and go to the /<cluster_id>/install_log folder todownload the YYYYMMDDHHMMSS.tar.gz log, for example, /mrs-log-a3859af76b874760969cd24f2640bbb4-northchina/65d0a20f-bcb7-4da3-81d3-71fef12d993d/20170818091516.tar.gz.

Table 4-11 Advanced settings


Set Now After you click Set Now, the page for adding a job, a tag or a bootstrapaction is displayed.l For details about how to add a job, see Managing Jobs.l For details about how to add a tag, see Managing Cluster Tags.l For details about how to add a bootstrap action, see Bootstrap

Actions.

Configure now You can set parameters later.

Create Now You can click Create Now to access the job creation page. Then clickCreate to access job configuration information.

Create Later You can add job configuration information later.

Create Job You can click Create to submit a job at the same time when you create acluster. Only one job can be added and its status is Running after acluster is created. For details, see Adding a Jar or Script Job.You can add jobs only after Kerberos Authentication is set to

"Disable": .

Name Name of a job

Type Type of a job

Parameter Key parameters for executing an application


Issue 01 (2018-09-06) 39


Operation l Edit: modifies job configurations.l Delete: deletes a job.

Step 4 When creating a yearly/monthly cluster, click Buy Now. When creating an on-demandcluster, click Create Now.

Step 5 Confirm cluster specifications. If you select the Yearly/Monthly billing mode, click SubmitOrder. If you select the On-demand billing mode, click Submit Application to submit acluster creation task.


For details about cluster status during cluster creation, see the Status parameter description inTable 4-3.

Cluster creation takes some time. The initial status of the cluster is Starting. After the clusteris created successfully, the cluster status becomes Running.

Users can create a maximum of 10 clusters at a time and manage a maximum of 100 clusterson the MRS management console.

NOTE

The name of a new cluster can be the same as that of a failed or terminated cluster.

----End

Failed to Create a ClusterIf the cluster fails to be created, the failed task automatically switches to the Manage Failed

Task page. You can click displayed in Figure 4-1 to access the Manage Failed Task

page and move the cursor over in the Task Status column shown in Figure 4-2 to viewthe causes. For details about how to delete the failed task, see Deleting a Failed Task.

Figure 4-1 Managing failed tasks


Issue 01 (2018-09-06) 40

Figure 4-2 Causes

Table 4-12 provides error codes about cluster creation failure.

Table 4-12 Error codes

Error Code Message

MRS.101 Insufficient quota to meet your request. Contact customer service toincrease the quota.

MRS.102 The token cannot be null or invalid. Try again later or contact customerservice.

MRS.103 Invalid request. Try again later or contact customer service.

MRS.104 Insufficient resources. Try again later or contact customer service.

MRS.105 Insufficient IP addresses in the existing subnet. Try again later or contactcustomer service.

MRS.201 Failed due to an ECS error. Try again later or contact customer service.

MRS.202 Failed due to an IAM error. Try again later or contact customer service.

MRS.203 Failed due to a VPC error. Try again later or contact customer service.

MRS.300 MRS system error. Try again later or contact customer service.

4.4 Creating the Smallest ClusterMRS 1.6.2 and later versions allow you to create the smallest cluster with only one Masternode and one Core node. This helps you reduce costs in lightweight use scenarios, forexample, development and commissioning of enterprise big data services.

Perform the following steps to create the smallest cluster.


Step 2 Click Create Cluster in the upper-right corner. The Create Cluster page is displayed.


Issue 01 (2018-09-06) 41

Figure 4-3 shows cluster node configurations.

Figure 4-3 Cluster node configurations

The detailed cluster configurations are as follows:

l Billing Mode: Select On-demand.l Region: Use the default value, for example, CN North-Beijing1l AZ: Select AZ1 or AZ2.l Cluster Name: You can use the default name. However, you are recommended to

include a project name abbreviation or date for consolidated memory and easydistinguishing, for example, mrs_20180321.

l Cluster Version: Use the default value MRS 1.7.2.

l Kerberos Authentication: It is enabled by default, as shown in .l Cluster Type: Use the default value Analysis cluster or select Streaming cluster.l Component: Select Spark, HBase, Hive, and other components for an analysis cluster.

Select Kafka, Storm, and other components for a streaming cluster.l VPC: Use the default value. If there is no available VPC, click View VPC to access the

Virtual Private Cloud console and create a new VPC.l Subnet: Use the default value.l Security Group: Select Auto Create.

l Cluster HA: Click to disable cluster HA.l Instance Specifications: Select 4 vCPUs 16GB(c3.xlarge.4) under General

Computing-plus c3 for both Master and Core nodes.l Quantity: The number of Master nodes is fixed to 1. Set the number of Core nodes to 1.l Storage Type: Select Common I/O for both Master and Core nodes.l Storage Space (GB): Set Storage Space to 100 GB for both Master and Core nodes.l Data Disk: Use the default value. There is one Master node and one Core node by

default.l Add Task Node: Do not add a Task node.l Login Mode: Select a mode to log in to an ECS node.

– Password: Set a password for logging in to an ECS node.– Key Pair: Select a key pair form the drop-down list. Select "I acknowledge that I

have obtained private key file SSHkey-bba1.pem and that without this file I willnot be able to log in to my ECS." If you have never created a key pair, click ViewKey Pair to create or import a key pair. And then, obtain a private key file.

l Logging: It is enabled by default, as shown in . Use the default value.


Issue 01 (2018-09-06) 42

l OBS Bucket: Select "I confirm that OBS bucket s3a://xxxxxx will be created and onlyused to collect logs that record MRS cluster creation failures".

l Advanced Settings: Select Configure now.

NOTE

MRS streaming clusters do not support job management and file management functions. Whenyou create a streaming cluster, the Add Job area will not be displayed on the page.

Step 3 After parameter configuration is complete, click Create now in the lower right corner.

Step 4 After confirming cluster details, click Submit Application to submit a cluster creation task.



----End

4.5 Creating a Cluster (History Versions)When clusters are created, their parameters and parameter configurations vary depending oncluster versions.

l If the cluster version is MRS 1.7.1, follow instructions in Creating an MRS 1.7.1Cluster to create a cluster.



NOTE

More nodes in a cluster require higher disk capacity of Master nodes. To ensure stable cluster running,set the disk capacity of the Master node to over 600 GB if the number of nodes is 300 and increase it toover 1 TB if the number of nodes reaches 500.

Creating an MRS 1.7.1 Cluster



NOTE


Step 3 Table 4-13, Table 4-14, Table 4-15, Table 4-16, and Table 4-17 describe the basicconfiguration information, node configuration information, login information, componentinformation and job configuration information for a cluster, respectively.


Issue 01 (2018-09-06) 43





AZ An availability zone (AZ) is a physical area that uses independent powerand network resources. In this way, applications are interconnectedusing internal networks but are physically isolated. As a result,application availability is improved. It is recommended that you createclusters under different AZs.MRS enables an AZ to be randomly selected to prevent excessive VMsto be created in the specified default AZ and avoid uneven resourceoccupation among AZs. MRS also enables a tenant's all VMs to becreated in one AZ as much as possible.If your VMs must be located in different AZs, specify AZs whencreating VMs. In a multi-user and multi-AZ scenario, each user tries toobtain a default AZ that is different from other users' default AZs.Select an AZ of the region in the cluster. Currently, only the CN North-Beijing1, CN East-Shanghai2 or CN South-Guangzhou region issupported.AZs are associated with each region as follows:l CN North-Beijing1: AZ1 and AZ2l CN East-Shanghai2: AZ1, AZ2 and AZ3l CN South-Guangzhou: AZ1, AZ2 and AZ3




Issue 01 (2018-09-06) 44











Issue 01 (2018-09-06) 45






ConfirmPassword








Issue 01 (2018-09-06) 46












l MRS 1.3.0 supports the following components:Components of an analysis cluster:– Hadoop 2.7.2: distributed system architecture– Spark 1.5.1: in-memory distributed computing framework


Issue 01 (2018-09-06) 47


– HBase 1.0.2: distributed column store database– Hive 1.2.1: data warehouse framework built on Hadoop– Hue 3.11.0: providing the Hadoop UI capability, which enables

users to analyze and process Hadoop cluster data on browsersHadoop is mandatory, and Spark and Hive must be used together.Select components based on services.NOTE

After Kerberos Authentication is set to "Enable": , the Huecomponent can be selected, but the Create Job area is not displayed,indicating that jobs cannot be created.

Components of a streaming cluster:– Kafka 0.10.0.0: distributed message subscription system– Storm 1.0.2: distributed real-time computing system






Issue 01 (2018-09-06) 48








Type MRS provides three types of nodes:l Master: A Master node in an MRS cluster manages the cluster,







Issue 01 (2018-09-06) 49


(Optional) AddTask Node

Click Add Task Node to configure the information about the Task node.Currently, when you create a cluster whose billing mode is On-demand,click behind Disabled in the row of Task. On the Auto Scaling pagethat is displayed, enable auto scaling. For details, see Performing AutoScaling for a Cluster.


Instance specifications of a nodeMRS supports host specifications determined by CPU, memory, anddisks space.l In the CN North-Beijing1, CN East-Shanghai2 and CN South-

Guangzhou regions, MRS 1.7.1 supports instance specificationsdetailed in ECS Specifications Used by MRS.

NOTE







Quantity Number of Master, Core, and Task nodesFor Master nodes:l If Cluster HA is enabled, the number of Master nodes is fixed to 2.l If Cluster HA is disabled, the number of Master nodes is fixed to 1.The minimum number of Core nodes is 1 and the total number of Coreand Task nodes cannot exceed 500.NOTE



Storage Type Disk storage typeThe following disk types are supported:l SATA: Common I/Ol SAS: High I/Ol SSD: Ultra-high I/O


Issue 01 (2018-09-06) 50


Storage Space(GB)

Disk space of MRSUsers can add disks to increase storage capacity when creating a cluster.There are two different configurations for storage and computing:l Data storage and computing are isolated.



The disk sizes range from 100 GB to 32000 GB, with 10 GB added eachtime, for example, 100 GB, 110 GB.NOTE

l The Master node increases data disk storage space for MRS Manager. Thedisk type must be the same as the data disk type of Core nodes. The defaultdisk space is 200 GB and cannot be changed.

l If the specifications of Core nodes are d1.4xlarge, d2.4xlarge.8, d2.8xlarge.8or d1.8xlarge, Data Disk is not displayed. This applies to MRS 1.6.0 orearlier.

Data Disks Number of data disks on Master, Core, and Task nodesMaster: currently fixed at 1Core: 1 to 10Task: 0 to 10

Operation Only after you click Add Task Node, Task nodes can be configured.If you do not need to configure a Task node, click Delete in the row ofthe Task node.


Issue 01 (2018-09-06) 51









l : Enabled

l : Disabled



Issue 01 (2018-09-06) 52







Actions.





"Disable": .

Name Name of a job

Type Type of a job



Issue 01 (2018-09-06) 53









NOTE


----End




NOTE


Step 3 Configure basic cluster information according to the following tables.






Issue 01 (2018-09-06) 54


AZ An availability zone (AZ) is a physical area that uses independent powerand network resources. In this way, applications are interconnectedusing internal networks but are physically isolated. As a result,application availability is improved. It is recommended that you createclusters under different AZs.MRS enables an AZ to be randomly selected to prevent excessive VMsto be created in the specified default AZ and avoid uneven resourceoccupation among AZs. MRS also enables a tenant's all VMs to becreated in one AZ as much as possible.If your VMs must be located in different AZs, specify AZs whencreating VMs. In a multi-user and multi-AZ scenario, each user tries toobtain a default AZ that is different from other users' default AZs.Select an AZ of the region in the cluster. Currently, only the CN East-Shanghai2 or CN North-Beijing1 region is supported.AZs are associated with each region as follows:l CN North-Beijing1: AZ1 and AZ2l CN East-Shanghai2: AZ1 and AZ2




Issue 01 (2018-09-06) 55











Issue 01 (2018-09-06) 56






ConfirmPassword








Issue 01 (2018-09-06) 57














Issue 01 (2018-09-06) 58










Issue 01 (2018-09-06) 59



Type MRS provides two types of nodes:l Master: A Master node in an MRS cluster manages the cluster,







Instance specifications of Master and Core nodes. MRS supports hostspecifications determined by CPU, memory, and disks space.In the CN North-Beijing1 and CN East-Shanghai2 regions, MRS 1.6.3supports instance specifications detailed in ECS Specifications Used byMRS.NOTE








Issue 01 (2018-09-06) 60


Data Disks Number of data disks on Master and Core nodesMaster: currently fixed at 2Core: 3 to 100NOTE



Storage Space Data disk space of Core nodesUsers can add disks to increase storage capacity when creating a cluster.There are two different configurations for storage and computing:l Data storage and computing are isolated.



Currently, SATA, SAS, and SSD are supported.l SATA: Common I/Ol SAS: High I/Ol SSD: Ultra-high I/OValue range: 100 GB to 32000 GBNOTE

l The Master node automatically increases data disk storage space for MRSManager. The disk type is the same as the data disk type of Core nodes. Thedefault disk space is 200 GB and cannot be changed.

l If the specifications of Core nodes are d1.4xlarge, d2.4xlarge.8, d2.8xlarge.8,or d1.8xlarge, the Storage Space parameter is not displayed.


Issue 01 (2018-09-06) 61









l : Enabled

l : Disabled



Issue 01 (2018-09-06) 62




Table 4-22 Purchase amount configuration


ValidityDuration

Validity duration of a cluster charged in Yearly/Monthly mode youpurchase. The duration ranges from 1 month to 1 year.




Actions.





Issue 01 (2018-09-06) 63



"Disable": .

Name Name of a job

Type Type of a job









NOTE


----End




NOTE


Step 3 Configure basic cluster information according to the following tables.


Issue 01 (2018-09-06) 64





AZ An availability zone (AZ) is a physical area that uses independent powerand network resources. In this way, applications are interconnectedusing internal networks but are physically isolated. As a result,application availability is improved. It is recommended that you createclusters under different AZs.MRS enables an AZ to be randomly selected to prevent excessive VMsto be created in the specified default AZ and avoid uneven resourceoccupation among AZs. MRS also enables a tenant's all VMs to becreated in one AZ as much as possible.If your VMs must be located in different AZs, specify AZs whencreating VMs. In a multi-user and multi-AZ scenario, each user tries toobtain a default AZ that is different from other users' default AZs.Select an AZ of the region in the cluster. Currently, only the CN East-Shanghai2 or CN North-Beijing1 region is supported.AZs are associated with each region as follows:l CN North-Beijing1: AZ1 and AZ2l CN East-Shanghai2: AZ1 and AZ2




Issue 01 (2018-09-06) 65




lIf Kerberos authentication is disabled, you can use all functions of anMRS cluster. You are advised to disable Kerberos authentication insingle-user scenarios. For clusters with Kerberos authenticationdisabled, you can directly access the MRS cluster management pageand components without security authentication.







Issue 01 (2018-09-06) 66






ConfirmPassword








Issue 01 (2018-09-06) 67












Issue 01 (2018-09-06) 68








Type MRS provides two types of nodesl Master: A Master node in an MRS cluster manages the cluster,







Issue 01 (2018-09-06) 69



Instance specifications of Master, Core, and Task nodes. MRS supportshost specifications determined by CPU, memory, and disks space.l Task nodes support s3.xlarge.2, s3.xlarge.4, s3.2xlarge.2, s3.4xlarge.

2, s3.4xlarge.4l In the CN East-Shanghai2 and CN North-Beijing1 regions, MRS

1.5.1 supports instance specifications detailed in ECS SpecificationsUsed by MRS.

NOTE







Data Disks Number of data disks on Master and Core nodesMaster: currently fixed at 2Core: 3 to 100NOTE




Issue 01 (2018-09-06) 70


Storage Space Data disk space of Core nodesUsers can add disks to increase storage capacity when creating a cluster.There are two different configurations for storage and computing:l Data storage and computing are isolated.



Currently, SATA, SAS, and SSD are supported.l SATA: Common I/Ol SAS: High I/Ol SSD: Ultra-high I/OValue range: 100 GB to 32000 GBNOTE

l The Master node automatically increases data disk storage space for MRSManager. The disk type is the same as the data disk type of Core nodes. Thedefault disk space is 200 GB and cannot be changed.

l If the specifications of Core nodes are d1.4xlarge, d2.4xlarge.8, d2.8xlarge.8,or d1.8xlarge, the Storage Space parameter is not displayed.


Issue 01 (2018-09-06) 71









l : Enabled

l : Disabled



Issue 01 (2018-09-06) 72




Table 4-28 Purchase amount configuration


Validityduration

Validity duration of a cluster charged in Yearly/Monthly mode youpurchase. The duration ranges from 1 month to 1 year.




Actions.





Issue 01 (2018-09-06) 73



"Disable": .

Name Name of a job

Type Type of a job









NOTE


----End

Step 1


Issue 01 (2018-09-06) 74









l : Enabled

l : Disabled



Issue 01 (2018-09-06) 75







Actions.





"Disable": .

Name Name of a job

Type Type of a job



Issue 01 (2018-09-06) 76









NOTE


----End

4.6 Managing Active ClustersAfter an MRS cluster is created, you can view basic information and patch information aboutthe cluster, records of completed jobs and the cluster management page.

4.6.1 Viewing Basic Information About an Active ClusterAfter clusters are created, you can monitor and manage clusters. Choose Cluster > ActiveCluster. Select a cluster and click its name to switch to the cluster information page. ClickCluster Details to view information about a cluster such as the configurations, deployednodes, and other basic information.

Table 4-33, Table 4-34, Table 4-35, Table 4-36, Table 4-37, and Table 4-38 describe theinformation about cluster configurations and nodes, respectively.

Table 4-33 Cluster configuration information


Cluster Name Cluster nameThis parameter is set when a cluster is created.

Cluster Status Status of a cluster


Issue 01 (2018-09-06) 77


ClusterManager

Click View to open the Cluster Manager page.

Cluster Details Cluster details include basic information, node information, andcomponent information. You can click Cluster Details to hideinformation.

Table 4-34 Basic information


Cluster Version MRS versionCurrently, MRS 1.5.1, MRS 1.6.3, MRS 1.7.1 and MRS 1.7.2 aresupported.This parameter is set when a cluster is created.



processing components.This parameter is set when a cluster is created.

Cluster ID Unique identifier of a clusterThis parameter is automatically assigned when a cluster is created.

Created Time when MRS starts charging MRS clusters of the customer

Billing Mode Billing mode of a clusterCurrently, Yearly/Monthly and On-demand are supported.

AZ AZ of the region in the clusterThis parameter is set when a cluster is created.AZs are associated with each region as follows:l CN North-Beijing1: AZ1 and AZ2l CN East-Shanghai2: AZ1, AZ2 and AZ3l CN South-Guangzhou: AZ1, AZ2 and AZ3

VPC VPC informationThis parameter is set when a cluster is created.A VPC is a secure, isolated, and logical network environment.

Subnet Subnet informationThis parameter is set when a cluster is created.A subnet provides dedicated network resources that are isolated fromother networks, improving network security.


Issue 01 (2018-09-06) 78


ClusterManager IPAddress

Floating IP address for accessing MRS ManagerThis parameter is displayed only after Kerberos authentication isenabled.

Key Pair Key pair nameThis parameter is set when a cluster is created.This parameter is not displayed if you set Login Mode to Password.

Table 4-35 Node information


Master Node Information about the Master nodeFormat: [ECS type-instance specification | node quantity]

Core Node Information about a Core nodeFormat: [ECS type-instance specification | node quantity]

Task Node Information about a Task nodeFormat: [ECS type-instance specification | node quantity]

Active MasterNode IPAddress

IP address of the active Master node in a cluster, which is also the IPaddress of the active management node of MRS Manager.

Task NodeAuto Scaling

Auto Scaling can automatically adjust computing resources based oncustomers' service requirements and preset policies. With auto scaling,the number of instances in MRS tasks can increase and decrease as theservice load increases and decreases, ensuring smooth operation ofservices.

Table 4-36 Component information


Hadoop Version Hadoop version

Spark Version Spark versionOnly a Spark cluster displays this version. Because Spark and Hive mustbe used together, Spark and Hive versions are displayed at the sametime.

HBase Version HBase versionOnly an HBase cluster displays this version.

Hive Version Hive versionOnly a Hive cluster displays this version.


Issue 01 (2018-09-06) 79


Hue Version Hue versionFor MRS 1.3.0, this parameter is displayed only after Kerberosauthentication is enabled. For MRS 1.5.0 or later versions, thisparameter is displayed without the limitation of Kerberos authentication.

Loader Version Loader VersionThis parameter is displayed when the MRS version is MRS 1.5.0 orlater.

Kafka Version Kafka VersionOnly a Kafka cluster displays this version.

Storm Version Storm VersionOnly a Storm cluster displays this version.

Flume Version Flume VersionThis parameter is displayed when the MRS version is MRS 1.5.0 orlater.


Indicates whether to enable Kerberos authentication when logging in toMRS Manager.


Table 4-37 Node description


Resize Cluster For details about adding or deleting a Core/Task node to a cluster, seeExpanding a Cluster or Shrinking a Cluster. This applies to MRS1.6.0 or later.Resize Cluster is unavailable and capacity expansion or reduction is notallowed in any of the following situations:l The cluster is not in the running state.l The number of Core nodes or Task nodes exceeds the maximum

value (500).l The cluster billing mode is not on-demand.

Add Node For details about adding or deleting a Core node to a cluster, seeExpanding a Cluster. This applies to MRS 1.6.0 or earlier.Add Node is unavailable and capacity expansion is not allowed in anyof the following situations:l The cluster is not in the running state.l The number of Core nodes exceeds the maximum value (100).l The cluster billing mode is not on-demand.

Name Name of a cluster node


Issue 01 (2018-09-06) 80



Type Node typel Master

A Master node in an MRS cluster manages the cluster, assignsMapReduce executable files to Core nodes, traces the executionstatus of each job, and monitors DataNode running status.

l CoreA Core node in a cluster processes data and stores processed data inHDFS.

l TaskA Task node in a cluster is used for computing and does not storepersistent data. Task nodes are optional, and the number of Tasknodes can be zero. (Task nodes are supported by MRS 1.6.0 or later.)

IP Address IP address of a cluster node

Specifications Instance specifications of a nodeThis parameter is determined by the CPU, memory, and disks used.NOTE

More advanced instance specifications allow better data processing, although theyhave a higher cluster cost.

DefaultSecurity Group

Security group name for master and Core/Task nodes, which areautomatically assigned when a cluster is created.This is the default security group. Do not modify or delete the securitygroup because modifying or deleting it will cause a cluster exception.


Button Description

Click to manually refresh the node.

4.6.2 Viewing Patch Information About an Active ClusterYou can view patch information about cluster components. If a cluster component, such asHadoop or Spark, is abnormal, download the patch. Then choose Cluster > Active Cluster.Select a cluster and click its name to switch the cluster information page to upgrade thecomponent to resolve the problem.

The Patch Information is displayed on the basic information page only when patchinformation exists in the database. Patch information contains the following parameters:

l Patch Name: patch name set when the patch is uploaded to OBS

l Patch Path: location where the patch is stored in OBS


Issue 01 (2018-09-06) 81

l Patch Description: patch description

4.6.3 Accessing the Cluster Management PageAfter Kerberos authentication is disabled, you can choose Cluster > Active Cluster, select acluster and click its name to switch to the cluster information page, and then click View to goto the cluster management page. You can view and handle alarms, modify clusterconfigurations, and upgrade cluster patches on the page.

You can enter the cluster management page of clusters in the Running, Expanding orShrinking state only. For details about how to use the cluster management page, see MRSManager Operation Guide.

4.6.4 Expanding a ClusterThe storage and computing capabilities of MRS can be improved by simply adding Corenodes or Task nodes instead of modifying system architecture, reducing O&M costs. Corenodes can process and store data. You can add Core nodes to increase the number of nodesthat process the peak load. Task nodes are used for computing and do not store persistent data.

Background

An MRS cluster supports a maximum of 502 nodes. A cluster has one or two Master nodes bydefault and must have at least one Core node. A maximum of 500 Core nodes and Task nodesare supported by default.

Core nodes and Task nodes can be added but Master nodes cannot. The maximum number ofnodes that can be added equals to 500 minus the number of existing Core nodes or Tasknodes. For example, if the number of existing Core nodes is 3, a maximum of 497 nodes canbe added. If a cluster fails to be expanded, you can perform capacity expansion for the clusteragain.

If no node is added during cluster creation, you can specify the number of nodes to be addedduring capacity expansion. However, you cannot specify the nodes to be added.

Cluster capacity expansion operations vary according to the cluster version you select.

Expanding a Cluster Charged in On-demand Mode

If the cluster version is MRS 1.6.0 or later, perform the following operations:


Step 2 Choose Cluster > Active Cluster, select a running cluster, and click its name to switch to thecluster information page.

Step 3 Click Resize Cluster and go to the Resize Cluster page.

The expansion operation can only be performed on the running clusters.

Step 4 Set Node Type to Core Node or Task Node, configure the Nodes After Resize parameter. Enable Run Bootstrap Action and click OK.


Issue 01 (2018-09-06) 82

NOTE

l If there is no Task Node in the Node Type drop-down list, follow instructions in RelatedOperations to add the Task node type.

l If you enable Run Bootstrap Action, the bootstrap action script you add during cluster creation willbe executed on all added nodes. Only MRS 1.7.1 or later supports bootstrap actions.

Step 5 In the Expand Node window, click OK.

Step 6 In the Information dialog box, click OK.

Cluster expansion is explained as follows:l During expansion: The cluster status is Expanding. The submitted jobs will be executed

and you can submit new jobs. You are not allowed to continue to expand, restart, modify,or terminate the cluster.

l Successful expansion: The cluster status is Running. The resources used in the old andnew nodes are charged.

l Failed expansion: The cluster status is Running. You are allowed to execute jobs andexpand the cluster again.

After the cluster expansion is successful, you can view node information on the clusterinformation page.

----End

If the cluster version is MRS 1.5.1, perform the following operations:



Step 3 Click Add Node.

Only the capacity of running clusters can be expanded and only Core nodes can be added.

Step 4 Set Number of Nodes and click OK.






l During expansion: The cluster status is Expanding. The submitted jobs will be executedand you can submit new jobs. You are not allowed to continue to expand, restart, modify,or terminate the cluster.




Issue 01 (2018-09-06) 83

After the cluster capacity expansion is successful, you can view node information on thecluster information page.

----End

Expanding a Cluster Charged in Yearly/Monthly ModeIf the cluster version is MRS 1.6.0 or later, perform the following operations:




The expansion operation can only be performed on the running clusters.

Step 4 Set Node Type to Core Node or Task Node, and configure the Nodes After Resizeparameter. The cluster expiration time and the price you pay for adding nodes will bedisplayed.

If there is no Task Node in the Node Type drop-down list, follow instructions in RelatedOperations to add the Task node type.

l Click Submit Order.

On the displayed Purchase MapReduce Service page, click Pay.

l Click Confirm order, not payment.

On the basic information page of the cluster, choose Fees > My Orders and click Pay.

Step 5 After the order is successfully paid, return to the MRS management console and view thecluster status.





After the cluster expansion is successful, you can view node information on the clusterinformation page.

----End

If the cluster version is MRS 1.5.1, perform the following operations:



Step 3 Click Add Node.

Only the capacity of running clusters can be expanded and only Core nodes can be added.


Issue 01 (2018-09-06) 84

Step 4 Set Number of Nodes and click OK.


Cluster expansion is explained as follows:







After the cluster capacity expansion is successful, you can view node information on thecluster information page.

----End

Related Operations

Perform the following operations to add Task nodes:

1. On the cluster information page, click behind Task to add Task Node.

2. On the Add Task Node page, configure Instance Specifications and Nodes. In addition,if Add Data Disk is enabled, configure the storage type, size, and number of data disksand click OK.

3. In the Information dialog box, click OK.

4.6.5 Shrinking a ClusterYou can reduce Core nodes or Task nodes based on service requirements to shrink a cluster sothat MRS has better storage and computing capabilities at lower O&M costs.

NOTE

You can shrink clusters charged in On-demand mode rather than clusters charged in Yearly/Monthlymode.

Background

An MRS cluster supports a maximum of 502 nodes. A cluster has one or two Master nodes bydefault and must have at least one Core node. The maximum number of 500 Core and Tasknodes are supported by default. If more than 500 Core and Task nodes are required, contacttechnical support engineers or invoke a background interface to modify the database.


Issue 01 (2018-09-06) 85

Core and Task nodes can be reduced but Master nodes cannot. After you adjust the number ofnodes when shrinking a cluster, the system will automatically select nodes to delete them. Atleast one Core node and 0 Task node must be left.

Node selection policyl Service components such as ZooKeeper, DBServcie, KrbServer, and LdapServer are

fundamental for stable cluster running. Therefore, their nodes cannot be deleted.l Core nodes are used to store cluster service data. When shrinking a cluster, data on the

nodes to be deleted must be fully migrated to other nodes. Therefore, perform follow-upoperations after cluster shrinking only when all services are decommissioned, such asmaking nodes exit MRS Manager and deleting ECSs. When you select nodes, preferhealthy nodes that store a small volume of data and whose instances can bedecommissioned to avoid node decommission failure. For example, if DataNodes areinstalled on Core nodes in an analysis cluster, healthy DataNodes that store a smallvolume of data will be preferred.

l Task nodes are computing nodes and are not used to store cluster data. Therefore, nodedata migration is not involved. When shrinking a cluster, prefer nodes whose healthstatus is Bad, Unknown, or Partially Healthy. You can view health status of nodes onthe instance management page after logging in to MRS Manager.

Cluster shrinking verification policies

Component decommissioning restrictions vary. Cluster shrinking is allowed only after allcomponent decommissioning restrictions are complied with. Table 4-39 describes thecomponent decommissioning restrictions.

Table 4-39 Component decommissioning restrictions

Component Decommissioning Restriction

HDFS/DataNode

Restriction: After cluster shrinking, nodes must not fewer than currentHDFS replicas and the total volume of HDFS data must not exceed 80%of total data volume of the shrunk HDFS cluster.Cause: This ensures that there is sufficient available space to storeexisting data and some space can be reserved.

HBase/RegionServer

Restriction: Total available memory of RegionServers on nodesexcluding nodes to be deleted must be greater than 1.2 times of thememory used by RegionServers on the nodes to be deleted.Cause: Regions on a node to be decommissioned will be migrated toother nodes. Therefore, available memory of other nodes must besufficient to bear regions migrated from the decommissioned node.

Kafka/Broker Restriction: After shrinking, the number of nodes must not be fewer thanthe maximum number for topic replicas and the used Kafka disk spacemust not exceed 80% of the total Kafka disk space of the cluster.Cause: This avoids insufficient disk space after cluster shrinking.

Storm/Supervisor

Restriction: The number of slots in the shrunk cluster must be sufficientto run the submitted jobs.Cause: This prevents resources from being insufficient to executestreaming processing tasks.


Issue 01 (2018-09-06) 86

Component Decommissioning Restriction

Flume/FlumeServer

Restriction: If FlumeServer is installed and Flume tasks have beenconfigured on a node, the node cannot be deleted.Cause: This prevents the deployed service applications from beingmistakenly deleted.

Procedure




This operation can be performed only on a running cluster in which all nodes are running.

Step 4 Set Node Type to Core Node or Task Node, and configure the Nodes After Resizeparameter.

Step 5 On the Shrink Node page, click OK.


Cluster shrinking is explained as follows:l During shrinking: The cluster status is Shrinking. The submitted jobs will be executed

and you can submit new jobs. You are not allowed to continue to shrink or terminate thecluster. You are advised not to restart the cluster or modify the cluster configuration.

l Successful shrinking: The cluster status is Running. The resources used after nodereduction are charged.

l Failed shrinking: The cluster status is Running. You are allowed to execute jobs andshrink the cluster again.

After the cluster shrink is successful, you can view node information on the clusterinformation page.

----End

4.6.6 Performing Auto Scaling for a ClusterAuto scaling can automatically adjust computing resources based on service requirements andthe policies preset by users, so the number of Task nodes can increase or decrease with serviceload changes, ensuring stable service running. Currently, only clusters charged in On-demand mode support auto scaling.

NOTE

You can perform auto scaling for clusters charged in On-demand mode rather than clusters charged inYearly/Monthly mode.

BackgroundAuto scaling rules:


Issue 01 (2018-09-06) 87

l A user can set a maximum of five rules for expanding or shrinking a cluster, respectively.l The system judges rules set by the user in sequence and cluster expansion rules take

priorities over cluster shrink rules. Place rules according to their importance degrees andput the most important rule in the front to prevent the rules from being repeatedlytriggered due to the unexpected result of cluster expansion or shrink.

l Comparison factors are Greater than, Greater than or equal to, Less than, and Lessthan or equal to.

l Cluster expansion or shrink can be triggered only after the configured metric threshold isreached for consecutive 5n (the default value of n is 1) minutes.

l After each cluster expansion or shrink, there is cooling time. The default cooling time is20 minutes and the minimum cooling time is 0 minutes.

l In each cluster expansion or shrinking, at least one node and at most 100 nodes can beadded or reduced.

Procedure



Step 3 Click to unfold cluster details and click behind Task Node Auto Scaling. The AutoScaling page is displayed.

Step 4 Configure auto scaling rules.

You can adjust the number of nodes to configure auto scaling rules. Node quantity adjustmentaffects prices. Adjust nodes with caution.

Figure 4-4 Auto Scaling

l Auto Scaling: indicates whether to enable auto scaling. Auto scaling is disabled bydefault. After you enable it, you can configure the following parameters.

l Nodes: Enter the minimum and maximum number of nodes. This value range is 0 to 500and applies to all expansion and shrinking rules.


Issue 01 (2018-09-06) 88

l Auto Scaling Rule: To enable Auto Scaling, configure expansion or shrinking rules.

Configuration procedure:

a. Select Expand or Shrink.

b. Click Add Rule. The Add Rule page is displayed.

Figure 4-5 Add Rule

c. Configure the parameters Rule Name, If, Last, Add, Cooling Time.

d. Click OK.

You can view the rules you configured in the Expand or Shrink area on the AutoScaling page.

l Select I agree to authorize MRS to expand or shrink nodes based on the abovepolicy..

Step 5 Click OK.

----End

Related Information

When adding rules, you can refer to Table 4-40 to configure auto scaling metrics.

Table 4-40 Auto scaling metrics

ClusterType

Metric Name Type Description

Streamingcluster

StormSlotAvailable Integer Number of available Storm slotsValue range: 0 to 2147483647

StormSlotAvailable-Percentage

Percentage Percentage of available Storm slots,that is, the proportion of available slotsto total slotsValue range: 0 to 100


Issue 01 (2018-09-06) 89

ClusterType


StormSlotUsed Integer Number of the used Storm slotsValue range: 0 to 2147483647

StormSlotUsedPer-centage

Percentage Percentage of the used Storm slots, thatis, the proportion of the used slots tototal slotsValue range: 0 to 100

Analysiscluster

YARNAppPending Integer Number of pending tasks on YARNValue range: 0 to 2147483647

YARNAppPendingRatio

Ratio Ratio of pending tasks on YARN, thatis, the ratio of pending tasks to runningtasks on YarnValue range: 0 to 2147483647

YARNAppRunning Integer Number of running tasks on YARNValue range: 0 to 2147483647

YARNContainerAl-located

Integer Number of containers allocated toYARNValue range: 0 to 2147483647

YARNContainerPending

Integer Number of pending containers onYARNValue range: 0 to 2147483647

YARNContainerPendingRatio

Ratio Ratio of pending containers on YARN,the ratio of pending containers torunning containers on YarnValue range: 0 to 2147483647

YARNCPUAllocated

Integer Number of virtual CPUs (vCPUs)allocated to YARNValue range: 0 to 2147483647

YARNCPUAvailable

Integer Number of available vCPUs on YARNValue range: 0 to 2147483647

YARNCPUAvailablePercentage

Percentage Percentage of available vCPUs onYARN, that is, the proportion ofavailable vCPUs to total vCPUsValue range: 0 to 100

YARNCPUPending Integer Number of pending vCPUs on YARNValue range: 0 to 2147483647


Issue 01 (2018-09-06) 90

ClusterType


YARNMemoryAllocated

Integer Memory allocated to YARN. The unitis MB.Value range: 0 to 2147483647

YARNMemoryAvailable

Integer Available memory on YARN. The unitis MB.Value range: 0 to 2147483647

YARNMemoryAvailablePercentage

Percentage Percentage of available memory onYARN, that is, the proportion ofavailable memory to total memory onYARNValue range: 0 to 100

YARNMemoryPending

Integer Pending memory on YARNValue range: 0 to 2147483647

NOTE

When the value type is percentage or ratio in Table 4-40, the valid value can be accurate to percentile.The percentage metric value is a decimal value with a percent sign (%) removed. For example, 16.80represents 16.80%.

4.6.7 Terminating a ClusterIf you do not need an MRS cluster after the job execution is complete, you can terminate theMRS cluster. After the MRS cluster is terminated, no fee is charged.

Background

If a cluster is terminated before data processing and analysis are completed, data loss mayoccur. Therefore, exercise caution when terminating a cluster. If MRS cluster deploymentfails, the cluster is automatically terminated.

Yearly/Monthly clusters cannot be terminated.

Procedure


Step 2 In the navigation tree of the MRS management console, choose Cluster > Active Cluster.

Step 3 In the Operation column of the cluster that you want to terminate, click Terminate.

The cluster status changes from Running to Terminating. After the cluster is terminated, thecluster status will change to Terminated and will be displayed in Cluster History. No feewill be charged accordingly.

----End


Issue 01 (2018-09-06) 91

4.6.8 Deleting a Failed TaskThis section describes how to delete a failed MRS task.

Background

If cluster creation, termination, capacity expansion or capacity reduction fails, the failed taskis displayed on the Manage Failed Task page. A failed cluster termination task is alsodisplayed on the Cluster History page. If you do not need the failed task, you can delete it onthe Manage Failed Task page.

Procedure


Step 2 In the navigation tree of the MRS management console, choose Cluster > Active Cluster.

Step 3 Click close to Failed Task.

The Manage Failed Task page is displayed.

Step 4 In the Operation column of the task that you want to delete, click Delete.

This operation deletes only a single failed task.

Step 5 You can click Delete All on the upper left of the task list to delete all tasks.

----End

4.6.9 Managing Jobs in an Active ClusterFor details about managing jobs in an active cluster, see Managing Jobs.

4.6.10 Managing Data FilesAfter Kerberos authentication is disabled, you can create directories, delete directories, andimport, export, or delete files on the File Management page.

Prerequisites

You have administrator rights on MRS Manager.

Background

Data to be processed by MRS is stored in either OBS or HDFS. OBS provides you withmassive, highly reliable, and secure data storage capabilities at a low cost. You can view,manage, and use data through OBS Console or OBS Browser. In addition, you can use theREST APIs to manage or access data. The REST APIs can be used alone or it can beintegrated with service programs.

Before creating jobs, upload the local data to OBS for computing and analysis. MRS allowsdata to be exported from OBS to HDFS for computing and analysis. After the analysis andcomputing are complete, you can either store the data in HDFS or export it to OBS. HDFSand OBS can store compressed data in the format of bz2 or gz.


Issue 01 (2018-09-06) 92

Importing Data

MRS supports data import from the OBS system to HDFS. This function is recommended ifthe data size is small, because the upload speed reduces as the file size increases.

Both files and folders containing files can be imported. The operations are as follows:


cluster information page.3. Click File Management and go to the File Management tab page.4. Select HDFS File List.5. Click the data storage directory, for example, bd_app1.

bd_app1 is just an example. The storage directory can be any directory on the page. Youcan create a directory by clicking Create Folder.The name of the created directory must meet the following requirements:– Contains a maximum of 255 characters, and the full path contains a maximum of

1023 characters.– Cannot be empty.– Cannot contain special characters (/:*?"<|>\;&,'$).– Cannot start or end with a period (.).

6. Click Import Data to configure the paths for HDFS and OBS.

NOTE


– The path for OBSn Must start with s3a://. s3a:// is used by default.n Files and programs encrypted by the KMS cannot be imported.n Empty folders cannot be imported.n Directories and file names can contain letters, Chinese characters, digits,








the HDFS path by default when data is imported.7. Click OK.


Issue 01 (2018-09-06) 93

View the upload progress in File Operation Record. The data import operation isoperated as a Distcp job by MRS. You can check whether the Distcp job is successfullyexecuted in Job Management > Job.

Exporting Data

After data is processed and analyzed, you can either store the data in HDFS or export it to theOBS system.

Both files and folders containing files can be exported. The operations are as follows:


cluster information page.3. Click File Management and go to the File Management tab page.4. Select HDFS File List.5. Click the data storage directory, for example, bd_app1.6. Click Export Data and configure the paths for HDFS and OBS.

NOTE


– The path for OBSn Must start with s3a://. s3a:// is used by default.n Directories and file names can contain letters, Chinese characters, digits,








the HDFS path by default when data is exported.

NOTE

Ensure that the exported folder is not empty. If an empty folder is exported to the OBS system, thefolder is exported as a file. After the folder is exported, its name is changed, for example, from testto test-$folder$, and its type is file.

7. Click OK.View the upload progress in File Operation Record. The data export operation isoperated as a Distcp job by MRS. You can check whether the Distcp job is successfullyexecuted in Job Management > Job.


Issue 01 (2018-09-06) 94

Viewing File Operation Records

When importing or exporting data on the MRS management console, you can choose FileManagement > File Operation Record to view the import or export progress.

Table 4-41 lists the parameters in file operation records.

Table 4-41 Parameters in file operation records


Created Time when data import or export is started

Source Path Source path of datal In data import, Source Path is the OBS path.l In data export, Source Path is the HDFS path.

Target Path Target path of datal In data import, Target Path is the HDFS path.l In data export, Target Path is the OBS path.

Status Status of the data import or export operationl Runningl Completedl Terminatedl Abnormal

Duration (min) Total time used by data import or exportUnit: minute

Result Data import or export resultl Successfull Failed

Operation View Log: You can click View Log to view log information of a job. Fordetails, see Viewing Job Configurations and Logs.

4.6.11 Viewing the Alarm ListThe alarm list provides information about all alarms in the MRS cluster. Examples of alarmsinclude host faults, disk usage exceeding the threshold, and component abnormalities.

In Alarm on the MRS management console, you can only view basic information aboutalarms that are not cleared in MRS Manager. If you want to view alarm details or managealarms, log in to MRS Manager. For details, see Alarm Management.

Alarms are listed in chronological order by default in the alarm list, with the most recentalarms displayed at the top.

Table 4-42 describes alarm parameters.


Issue 01 (2018-09-06) 95

Table 4-42 Alarm parameters


Severity Alarm severityPossible values include:l Criticall Majorl Warningl Minor

Service Name of the service that reports the alarm

Description Alarm description

Generated Alarm generation time


Button Description

In the drop-down list, select an alarm severity to filter alarms.l All: displays all alarms.l Critical: displays Critical alarms.l Major: displays Major alarms.l Warning: displays Warning alarms.l Minor: displays Minor alarms.

Click to manually refresh the alarm list.

4.6.12 Configuring Message NotificationYou can connect MRS to SMN to provide basic SMS notification and email notificationfunctions in specific scenarios.

Scenario

On the MRS management console, you can enable or disable the notification service on theAlarm tab page of the cluster information page. The functions in the following scenarios canbe implemented only after the required cluster function is enabled:

l After a user subscribes to the notification service, the MRS management plane notifiesthe user of success or failure of cluster expansion, shrinking, termination, and autoscaling by email or SMS message.

l The management plane checks alarms about the MRS cluster and sends a notification tothe user's tenants if the alarm is critical and affects service use.


Issue 01 (2018-09-06) 96

l If either of the operations such as deletion, shutdown, specifications modification,restart, and OS update is performed on an ECS in a cluster, the MRS clustermalfunctions. The management plane notifies a user when detecting that the VM of theuser is in either of the preceding states.

ProcedureCreating a Topic

A topic is a specified event for message publication and notification subscription. It serves asa message sending channel, where publishers and subscribers can interact with each other.

1. Log in to the management console.2. Choose Application > Simple Message Notification.

The SMN console is displayed.3. In the navigation pane, choose Topic Management > Topic.

The Topic page is displayed.4. On the Topic page, click Create Topic.

The Create Topic dialog box is displayed.5. In Topic Name, enter a topic name. In Display Name, enter description information.

Adding Subscriptions to a Topic

To deliver messages published to a topic to subscribers, you must add subscription endpointsto the topic. After you add subscription endpoints, SMN automatically sends a confirmationmessage to the subscribers. The subscribers must confirm the subscription within 48 hours sothat they can receive notification messages. Otherwise, the confirmation message becomesinvalid, and you need to send it again.

1. Log in to the management console.2. Choose Application > Simple Message Notification.

The SMN console is displayed.3. In the left navigation pane, choose Topic Management > Topic.

The Topic page is displayed.4. In the topic list, select a topic to which you want to add a subscription. In the Operation

column of the topic on the right side, click Add Subscription.The Add Subscription dialog box is displayed.The possible values of the Protocol parameter are SMS, Email, HTTP, HTTPS,FunctionStage, FunctionGraph, and DMS.The endpoint parameters include the endpoint address, SMS message, email, HTTP, andHTTPS and can be inputted in batches. When you add endpoints in batches, eachendpoint occupies one line and you can input a maximum of 10 endpoints.

5. Click OK.

The newly added subscription is displayed in the subscription list.

Sending Messages to Subscribers

1. Log in to the MRS management console.2. Choose Active Cluster. Click the name of a running cluster to access the cluster

information page.


Issue 01 (2018-09-06) 97

3. Click the Alarm tab.

4. Click Configure Message Notification. The Configure Message Notification page isdisplayed.

5. Enable Message Notification and select a topic.

6. Click OK.

4.6.13 O&M AuthorizationIf you need technical support personnel to help you with troubleshooting, you can use theO&M authorization function to authorize technical support personnel to access your local hostfor fault location.

Procedure1. Log in to the MRS management console.

2. Click in the upper-left corner on the management console and select Region andProject.

3. In the navigation tree of the MRS management console, choose Clusters > ActiveClusters, select a cluster, and click its name to switch to the cluster information page.

4. Click the O&M Management tab. The O&M Management page is displayed.

5. Click O&M Authorization to authorize technical support personnel to access your localhost.

6. After troubleshooting, click Cancel Authorization to cancel the access permission forthe technical support personnel.

4.6.14 Sharing LogsIf you need technical support personnel to help you with troubleshooting, you can use the logsharing function to provide logs in a specific time of period to technical support personnel forfault location.


Issue 01 (2018-09-06) 98

Procedure1. Log in to the MRS management console.

2. Click in the upper-left corner on the management console and select Region andProject.

3. In the navigation tree of the MRS management console, choose Clusters > ActiveClusters, select a cluster, and click its name to switch to the cluster information page.

4. Click the O&M Management tab. The O&M Management page is displayed.5. Click Share Log. The Share Log page is displayed.6. In Start Time and End Time, select date and time.

NOTE

l Set Start Time and End Time according to technical support personnel's suggestions.

l End Time must be later than Start Time. Otherwise, logs cannot be filtered by time.

4.7 Managing Historical ClustersYou can query basic information and jobs of a cluster in the Terminated or Failed state.

4.7.1 Viewing Basic Information About a Historical ClusterTo view historical clusters, choose Cluster > Cluster History. Select a cluster and click itsname to switch to the cluster information page. You can click Cluster Details to view theconfigurations, deployed nodes, and other basic information.

Table 4-44, Table 4-45, Table 4-46, Table 4-47, Table 4-48, and Table 4-49 describe theinformation about cluster configurations and nodes, respectively

Table 4-44 Cluster configuration information


Cluster Name Cluster nameThis parameter is set when a cluster is created.


Issue 01 (2018-09-06) 99


Cluster Status Status of a cluster

Cluster Details Cluster details include basic information, node information, andcomponent information. You can click Cluster Details to hideinformation.



Cluster Version MRS versionCurrently, MRS 1.5.1, MRS 1.6.3, MRS 1.7.1 and MRS 1.7.2 aresupported.This parameter is set when a cluster is created.



processing components.This parameter is set when a cluster is created.

Cluster ID Unique identifier of a clusterThis parameter is automatically assigned when a cluster is created.

Created Time when MRS starts charging MRS clusters of the customer

Billing Mode Billing mode of a clusterCurrently, Yearly/Monthly and On-demand are supported.

AZ AZ of the region in the clusterThis parameter is set when a cluster is created.AZs are associated with each region as follows:l CN North-Beijing1: AZ1 and AZ2l CN East-Shanghai2: AZ1, AZ2 and AZ3l CN South-Guangzhou: AZ1, AZ2 and AZ3

VPC VPC informationThis parameter is set when a cluster is created.A VPC is a secure, isolated, and logical network environment.

Subnet Subnet informationThis parameter is set when a cluster is created.A subnet provides dedicated network resources that are isolated fromother networks, improving network security.


Issue 01 (2018-09-06) 100


ClusterManager IPAddress

Floating IP address for accessing MRS ManagerThis parameter is displayed only after Kerberos authentication isenabled.

Key Pair Key pair nameThis parameter is set when a cluster is created.This parameter is not displayed if you set Login Mode to Password.

Table 4-46 Node information


Master Node Information about the Master nodeFormat: [ECS type-instance specification | node quantity]

Core Node Information about a Core nodeFormat: [ECS type-instance specification | node quantity]

Task Node Information about a Task nodeFormat: [ECS type-instance specification | node quantity]

Active MasterNode IPAddress

IP address of the active Master node in a cluster, which is also the IPaddress of the active management node of MRS Manager.

Task NodeAuto Scaling

Auto Scaling can automatically adjust computing resources based oncustomers' service requirements and preset policies. With auto scaling,the number of instances in MRS tasks can increase and decrease as theservice load increases and decreases, ensuring smooth operation ofservices.

Table 4-47 Component information


Hadoop Version Hadoop version

Spark Version Spark versionOnly a Spark cluster displays this version. Because Spark and Hive mustbe used together, Spark and Hive versions are displayed at the sametime.

HBase Version HBase versionOnly an HBase cluster displays this version.

Hive Version Hive versionOnly a Hive cluster displays this version.


Issue 01 (2018-09-06) 101


Hue Version Hue versionFor MRS 1.3.0, this parameter is displayed only after Kerberosauthentication is enabled. For MRS 1.5.0 or later versions, thisparameter is displayed without the limitation of Kerberos authentication.

Loader Version Loader VersionThis parameter is displayed when the MRS version is MRS 1.5.0 orlater.

Kafka Version Kafka VersionOnly a Kafka cluster displays this version.

Storm Version Storm VersionOnly a Storm cluster displays this version.

Flume Version Flume VersionThis parameter is displayed when the MRS version is MRS 1.5.0 orlater.


Indicates whether to enable Kerberos authentication when logging in toMRS Manager.


Table 4-48 Node description


Resize Cluster For details about adding or deleting a Core/Task node to a cluster, seeExpanding a Cluster or Shrinking a Cluster. This applies to MRS1.6.0 or later.Resize Cluster is unavailable and capacity expansion or reduction is notallowed in any of the following situations:l The cluster is not in the running state.l The number of Core nodes or Task nodes exceeds the maximum

value (500).l The cluster billing mode is not on-demand.

Add Node For details about adding or deleting a Core node to a cluster, seeExpanding a Cluster. This applies to MRS 1.6.0 or earlier.Add Node is unavailable and capacity expansion is not allowed in anyof the following situations:l The cluster is not in the running state.l The number of Core nodes exceeds the maximum value (100).l The cluster billing mode is not on-demand.

Name Name of a cluster node


Issue 01 (2018-09-06) 102



Type Node typel Master

A Master node in an MRS cluster manages the cluster, assignsMapReduce executable files to Core nodes, traces the executionstatus of each job, and monitors DataNode running status.

l CoreA Core node in a cluster processes data and stores processed data inHDFS.

l TaskA Task node in a cluster is used for computing and does not storepersistent data. Task nodes are optional, and the number of Tasknodes can be zero. (Task nodes are supported by MRS 1.6.0 or later.)

IP Address IP address of a cluster node

Specifications Instance specifications of a nodeThis parameter is determined by the CPU, memory, and disks used.NOTE

More advanced instance specifications allow better data processing, although theyhave a higher cluster cost.

DefaultSecurity Group

Security group name for master and Core/Task nodes, which areautomatically assigned when a cluster is created.This is the default security group. Do not modify or delete the securitygroup because modifying or deleting it will cause a cluster exception.


Button Description

Click to manually refresh the node.

4.7.2 Viewing Job Configurations in a Historical ClusterOn the Cluster History page, users can query only clusters in the Failed or Terminated stateand their job information.


Step 2 Choose Cluster > Cluster History, select a cluster, and click its name to switch to the clusterinformation page.

Step 3 Select Job Management.

Step 4 In the Operation column corresponding to the selected job, click View.


Issue 01 (2018-09-06) 103

In the View Job Information window that is displayed, configuration of the selected job isdisplayed.

----End

4.8 Managing JobsYou can query, add, and delete MRS jobs on the Job Management tab page only afterKerberos Authentication is set to Disable.

4.8.1 Introduction to JobsA job is an executable program provided by MRS to process and analyze user data. All addedjobs are displayed in Job Management, where you can add, query, and manage jobs.

Job Types

An MRS cluster allows you to create and manage the following jobs:

l MapReduce: provides the capability to process massive data quickly and in parallel. It isa distributed data processing mode and execution environment. MRS supports thesubmission of the MapReduce Jar program.

l Spark: functions as a distributed computing framework based on memory. MRS supportsthe submission of Spark, Spark Script, and Spark SQL jobs.– Spark: submits the Spark program, executes the Spark application, and computes

and processes user data.– Spark Script: submits the Spark Script script and batch executes Spark SQL

statements.– Spark SQL: uses Spark SQL statements (similar to SQL statements) to query and

analyze user data in real time.l Hive: functions as an open-source data warehouse constructed on Hadoop. MRS

supports the submission of the Hive Script script and batch executes HiveQL statements.

If you fail to create a job in a Running cluster, check the component health status on thecluster management page. For details, see Viewing the System Overview.

Job List

Jobs are listed in chronological order by default in the job list, with the most recent jobsdisplayed at the top. Table 4-50 describes parameters of the job list.

Table 4-50 Parameters of the job list


Name Job nameThis parameter is set when a job is added.

ID Unique identifier of a jobThis parameter is automatically assigned when a job is added.


Issue 01 (2018-09-06) 104


Type Job typePossible types include:l Distcp (data import and export)l MapReducel Sparkl Spark Scriptl Spark SQLl Hive Script

NOTEAfter you import or export data on the File Management page, you can viewthe Distcp job on the Job Management page.

Status Job statusl Runningl Completedl Terminatedl AbnormalNOTE


Result Execution result of a jobl Successfull FailedNOTE

You cannot execute a successful or failed job, but can add or copy the job. Aftersetting job parameters, you can submit the job again.

Created Time when a job starts

Duration (min) Duration of executing a job, specifically from the time when a job isstarted to the time when the job is completed or stopped.Unit: minute


Issue 01 (2018-09-06) 105


Operation l View Log: You can click View Log to view log information of a job.For details, see Viewing Job Configurations and Logs.

l View: You can click View to view job details. For details, seeViewing Job Configurations and Logs.

l More– Stop: You can click Stop to stop a running job. For details, see

Stopping Jobs.– Copy: You can click Copy to copy and add a job. For details, see

Replicating Jobs.– Delete: You can click Delete to delete a job. For details, see

Deleting Jobs.NOTE

l Spark SQL jobs cannot be stopped.

l Deleted jobs cannot be recovered. Therefore, exercise caution whendeleting a job.

l If you configure the system to save job logs to an HDFS or OBS path, thesystem compresses the logs and saves them to the specified path after jobexecution is complete. In this case, the job remains in the Running stateafter execution is complete and changes to the Completed state after thelogs are successfully saved. The time required for saving the logs dependson the log size. The process generally takes a few minutes.


Button Description

In the drop-down list, select a job state to filter jobs.l All (Num): displays all jobs.l Completed (Num): displays jobs in the Completed state.l Running (Num): displays jobs in the Running state.l Terminated (Num): displays jobs in the Terminated state.l Abnormal (Num): displays jobs in the Abnormal state.

Enter a job name in the search bar and click to search for a job.

Click to manually refresh the job list.

4.8.2 Adding a Jar or Script JobYou can submit developed programs to MRS, execute them, and obtain the execution result.This section describes how to create a job.


Issue 01 (2018-09-06) 106

PrerequisitesYou have completed the procedure described in Background.

Procedure




Step 4 On the Job tab page, click Create and go to the Create Job page.




Type Job typePossible types include:l MapReducel Sparkl Spark Scriptl Hive ScriptNOTE

To add jobs of the Spark and Hive types, you need to select Spark and Hivecomponents when creating a cluster and the cluster must be in the running state.Spark Script jobs support Spark SQL only, and Spark supports Spark Core andSpark SQL.




Issue 01 (2018-09-06) 107







wordcount/program/hadoop-mapreduce-examples-2.7.x.jar.– HDFS: The path must start with /user.

l Spark Script must end with .sql; MapReduce and Spark must endwith .jar. sql and jar are case-insensitive.






The path varies depending on the file system:l OBS: The path must start with s3a://.l HDFS: The path must start with /user.A maximum of 1023 characters are allowed, but special characters (*?<">|\) are not allowed. This parameter can be empty.


Issue 01 (2018-09-06) 108








NOTE

l The OBS path supports s3a://, and s3a:// is used by default.

l Files and programs encrypted by the KMS cannot be imported if the OBS path is used.

l The full path of HDFS and OBS contains a maximum of 1023 characters.

Step 5 Confirm job configuration information and click OK.

After jobs are added, you can manage them.

NOTE


----End

4.8.3 Submitting a Spark SQL StatementThis section describes how to use Spark SQL. You can submit a Spark SQL statement toquery and analyze data on the MRS management console page. To submit multiplestatements, separate them from each other using semicolons (;).

Procedure




Issue 01 (2018-09-06) 109


Step 4 Select Spark SQL. The Spark SQL job page is displayed.

Step 5 Enter the Spark SQL statement for table creation.

When entering Spark SQL statements, ensure that they have no more than 10,000 characters.

Syntax:

CREATE [EXTERNAL] TABLE [IF NOT EXISTS] table_name [(col_name data_type[COMMENT col_comment], ...)] [COMMENT table_comment] [PARTITIONED BY(col_name data_type [COMMENT col_comment], ...)] [CLUSTERED BY (col_name,col_name, ...) [SORTED BY (col_name [ASC|DESC], ...)] INTO num_buckets BUCKETS][ROW FORMAT row_format] [STORED AS file_format] [LOCATION hdfs_path];

Use either of the following methods to create a table:

l Method 1: Create an src_data table and write data in every row. The data is stored inthe /user/guest/input directory.create external table src_data(line string) row format delimited fields terminated by'\\n' stored as textfile location '/user/guest/input/';

l Method 2: Create an src_data table and load the data to the src_dada1 table.create table src_data1 (eid int, name String, salary String, destination String) rowformat delimited fields terminated by ',' ;load data inpath '/tttt/test.txt' into table src_data1;

NOTE

The data from OBS cannot be loaded to the created tables in method 2.

Step 6 Enter the Spark SQL statement for table query.

Syntax:

SELECT col_name FROM table_name;

Example:

select * from src_data;

Step 7 Enter the Spark SQL statement for table deletion.

Syntax:

DROP TABLE [IF EXISTS] table_name;

Example:

drop table src_data;

Step 8 Click Check to check the statement correctness.

Step 9 Click Submit.

After submitting Spark SQL statements, you can check whether the execution is successful inLast Execution Result and view detailed execution results in Last Query Result Set.

----End


Issue 01 (2018-09-06) 110

4.8.4 Viewing Job Configurations and LogsThis section describes how to view job configurations and logs.

Backgroundl You can view configurations of all jobs.l For clusters created on MRS of a version earlier than 1.0.7, logs of completed jobs in the

clusters cannot be viewed. For clusters created on MRS 1.0.7 or later, logs of all jobs canbe viewed.

Procedure



Step 3 Click Job Management.

Step 4 In the Operation column corresponding to the selected job, click View.

In the View Job Information window that is displayed, configuration of the selected job isdisplayed.

Step 5 Select a MapReduce job, and click View Log in the Operation column corresponding to theselected job.

In the page that is displayed, log information of the selected job is displayed.

The MapReduce job is only an example. You can view log information about MapReduce,Spark, Spark Script, and Hive Script jobs regardless of their status.

Each tenant can submit 10 jobs and query 10 logs concurrently.

----End

4.8.5 Stopping JobsThis section describes how to stop running MRS jobs.

BackgroundSpark SQL jobs cannot be stopped. After a job is stopped, its status changes to Terminated,and it cannot be executed again.

Procedure




Step 4 Select a running job and choose More > Stop in the Operation column corresponding to theselected job.


Issue 01 (2018-09-06) 111

The job status changes from Running to Terminated.

NOTE

When you submit a job on the Spark SQL page, you can click Cancel to stop the job.

----End

4.8.6 Replicating JobsThis section describes how to replicate MRS jobs.

Background

Currently, all types of jobs except for Spark SQL and Distcp jobs can be replicated.

Procedure




Step 4 In the Operation column corresponding to the to-be-replicated job, choose More > Copy.

The Copy Job dialog box is displayed.

Step 5 Set job parameters, and click OK.


After being successfully submitted, a job changes to the Running state by default. You do notneed to manually execute the job.



Type Job typePossible types include:l MapReducel Sparkl Spark Scriptl Hive ScriptNOTE

To add jobs of the Spark and Hive types, you need to select Spark and Hivecomponents when creating a cluster and the cluster must be in the running state.Spark Script jobs support Spark SQL only, and Spark supports Spark Core andSpark SQL.


Issue 01 (2018-09-06) 112









wordcount/program/hadoop-mapreduce-examples-2.7.x.jar.– HDFS: The path must start with /user.

l Spark Script must end with .sql; MapReduce and Spark must endwith .jar. sql and jar are case-insensitive.








Issue 01 (2018-09-06) 113








----End

4.8.7 Deleting JobsThis section describes how to delete MRS jobs.

Background

Jobs can be deleted one after another or in batches. The deletion operation is irreversible.Exercise caution when performing this operation.

Procedure




Step 4 In the Operation column corresponding to the selected job, choose More > Delete.

This operation deletes only a single job.

Step 5 You can select multiple jobs and click Delete on the upper left of the job list.

This operation deletes multiple jobs at a time.

----End


Issue 01 (2018-09-06) 114

4.9 Querying Operation LogsThe Operation Log page records cluster and job operations. Logs are typically used toquickly locate faults in case of cluster exceptions, helping you resolve problems.

Operation TypesCurrently, only after Kerberos authentication is disabled, two types of operations are recordedin the logs. You can filter and search for a desired type of operations.

l Cluster: Creating, terminating, shrinking, and expanding a clusterl Job: Creating, stopping, and deleting a job

Log ParametersLogs are listed in chronological order by default in the log list, with the most recent logsdisplayed at the top.

Table 4-54 describes parameters in logs.

Table 4-54 Description of parameters in logs


Operation Type Operation typePossible types include:l Clusterl Job

IP Address IP address where an operation is executedNOTE

If MRS cluster deployment fails, the cluster is automatically terminated, and theoperation log of the terminated cluster does not contain the user's IP Addressinformation.

OperationDetails

Operation contentThe content can contain a maximum of 2048 characters.

Operation Time Operation timeFor terminated clusters, only those terminated within the last six monthsare displayed. If you want to view clusters terminated six months ago,contact technical support engineers.


Issue 01 (2018-09-06) 115


Button Description

In the drop-down list, select an operation type to filter logs.l All: displays all logs.l Cluster: displays logs of Cluster.l Job: displays logs of Job.

Filter logs by time.1. Click the button.2. Specify the date and time.3. Click OK.Enter the query start time in the left box and end time in the right box.The end time must be later than the start time. Otherwise, logs cannot befiltered by time.

Enter key words in Operation Details and click to search for logs.

Click to manually refresh the log list.

4.10 Managing Cluster TagsTags are used to identify clusters. Adding tags to clusters can help you identify and manageyour cluster resources.

You can add a maximum of 10 tags to a cluster when creating the cluster or add them on thedetails page of the created cluster.

A tag consists of a tag key and a tag value. Table 4-56 provides tag key and valuerequirements.

Table 4-56 Tag key and value requirements

Parameter Requirement Example

Key A tag key cannot be leftblank.A tag key must be unique ina cluster.A tag key contains amaximum of 36 characters.A tag key cannot containspecial characters (=*<>\,|/)or start or end with spaces.

Organization


Issue 01 (2018-09-06) 116

Parameter Requirement Example

Value A tag value contains amaximum of 43 characters.A tag value cannot containspecial characters (=*<>\,|/)or start or end with spaces.This parameter can be leftblank.

Apache

Adding Tags to a Cluster

You can perform the following operations to add tags to a cluster when creating the cluster.

1. Log in to the management console.2. Choose EI Enterprise Intelligence > MapReduce Service.3. Click Create Cluster. The Create Cluster page is displayed.4. On the Advanced Settings tab page, select Set now.

Enter the key and value of a tag to be added.5. Click Add Tag.

You can add a maximum of 10 tags to a cluster and use intersections of tags to search forthe target cluster.

Searching for the Target Cluster

On the Active Cluster page, search for the target cluster by tag key or tag value.

1. Log in to the management console.2. Choose EI Enterprise Intelligence > MapReduce Service.3. In the upper right corner of the Active Cluster page, click Search by tag to access the

search page.4. Enter the tag of the cluster to be searched.

You can select a tag key or tag value from their drop-down lists. When the tag key or tagvalue is exactly matched, the system can automatically locate the target cluster. If youenter multiple tags, their intersections are used to search for the cluster.

5. Click Search.The system searches for the target cluster by tag key or value.


Issue 01 (2018-09-06) 117

Managing TagsYou can view, add, modify, and delete tags on the Tag tab page of the cluster.

1. Log in to the management console.2. Choose EI Enterprise Intelligence > MapReduce Service.3. On the Active Cluster page, click the name of a cluster for which you want to manage

tags.The cluster details page is displayed.

4. Click the Tag tab and view, add, modify, and delete tags on the tab page.– View

On the Tag tab page, you can view details about tags of the cluster, including thenumber of tags and the key and value of each tag.

– AddClick Add Tag in the upper left corner. In the displayed Add Tag dialog box, enterthe key and value of the tag to be added, and click OK.

– ModifyIn the Operation column of the tag, click Edit. In the displayed Edit Tag page,enter new tag key and value and click OK.

– DeleteIn the Operation column of the tag, click Delete. After confirmation, click OK inthe displayed Delete Tag page.

NOTE

MRS cluster tag updates will be synchronized to every ECS in the cluster. You are advisednot to modify ECS tags on the ECS console to prevent inconsistency between ECS tags andMRS cluster tags. If the number of tags of an ECS in the MRS cluster reaches the upperlimit, you cannot create any tag for the MRS cluster.

4.11 Bootstrap Actions

4.11.1 Introduction to Bootstrap ActionsBootstrap actions indicate that you can run your scripts on a specified cluster node before orafter starting big data components. You can run bootstrap actions to install third-partysoftware, modify the cluster running environment, and perform other customizations. If youchoose to run bootstrap actions when expanding a cluster, the bootstrap actions will be run onthe newly added nodes in the same way.

MRS runs the script you specify as user root. You can run the su - XXX command in thescript to switch the user.

NOTE

The bootstrap action scripts must be executed as user root. Improper use of the script may affect thecluster availability. Therefore, exercise caution when performing this operation.

MRS determines the result based on the return code after the execution of the bootstrap actionscript. If the return code is 0, the script is executed successfully. If the return code is not 0, theexecution fails. If a bootstrap action script fails to be executed on a node, the correspondingboot script will fail to be executed. In this case, you can set Action upon Failure to choose


Issue 01 (2018-09-06) 118

whether to continue to execute the subsequent scripts. Example 1: If a script fails to beexecuted and Action upon Failure is set to Stop, subsequent scripts will not be executed andcluster creation or capacity expansion will fail. Example 2: If you set Action upon Failure toContinue for all scripts during cluster creation, all the scripts will be executed regardless ofwhether the scripts are successfully executed or fail to be executed, and the startup process iscomplete.

You can add a maximum of 18 bootstrap actions, which will be executed before or after thecluster component is started in the order you specified. The bootstrap actions performedbefore or after the component startup must be completed within 60 minutes. Otherwise, thecluster creation or capacity expansion will fail.

4.11.2 Preparing the Bootstrap Action ScriptCurrently, bootstrap actions support Linux shell scripts only. Script files must end with .sh.

Uploading the Installation Packages and Files to an OBS Bucket

Before compiling a script, you need to upload all required installation packages, configurationpackages, and relevant files to the OBS bucket in the same region. Because networks ofdifferent regions are isolated from each other, MRS VMs cannot download OBS files fromother regions. For example, MRS VMs in the CN North-Beijing1 region cannot downloadfiles from OBS buckets in the CN East-Shanghai2 region.

How to Specify a Script to Download Files from the OBS Bucket

You can specify the file to be downloaded from OBS in the script. If you upload files to aprivate bucket, you need to run the hadoop fs command to download the files. The followingexample shows that the s3a://yourbucket/myfile.tar.gz file will be downloaded to the localhost and decompressed to the /your-dir directory.

#!/bin/bashsource /opt/client/bigdata_env;hadoop fs -Dfs.s3a.endpoint=<obs-endpoint> -Dfs.s3a.access.key=<your-ak> -Dfs.s3a.secret.key=<your-sk> -copyToLocal s3a://yourbucket/myfile.tar.gz ./mkdir -p /<your-dir>tar -zxvf myfile.tar.gz -C /<your-dir>

NOTE

l The Hadoop client has been preinstalled on the MRS node. You can run the hadoop fs command todownload or upload data from or to OBS.

l Regions and Endpoints lists the obs-endpoint of each region.

l Sample Scripts shows that the installation packages have been uploaded to the public readable OBSbucket. Therefore, you can run the curl command in the sample script to download the installationpackages.

Uploading the Script to an OBS Bucket

After script compilation, upload the script to the OBS bucket in the same region. At the timeyou specify, each node in the cluster downloads the script from OBS and executes the scriptas user root.

4.11.3 Adding a Bootstrap Action1. Log in to the management console.


Issue 01 (2018-09-06) 119

https://developer.huaweicloud.com/en-us/endpoint?OBS

2. Choose EI Enterprise Intelligence > MapReduce Service.3. Click Create Cluster. The Create Cluster page is displayed.4. On the Advanced Settings tab page, select Configure Now. The Bootstrap Action tab

page is displayed.5. On Bootstrap Action tab page, click Add Now.6. Click Add. The page is displayed, as shown in the following figure.

Figure 4-6 Adding a bootstrap action

Table 4-57 Parameters


Name Name of a bootstrap action scriptThe value can contain only digits, letters, spaces, hyphens (-),and underscores (_) and must not start with a space.The value can contain 1 to 64 characters.NOTE

A name must be unique in the same cluster. You can set the samename for different clusters.

Script Path Script path. The value can be an OBS bucket path or a localVM path.l An OBS bucket path must start with s3a:// and end

with .sh. For example, the path of the example script forinstalling Zeppelin is as follows: s3a://mrs-samples-bootstrap-cn-north-1/zeppelin/zeppelin_install.sh

l A local VM path must start with a slash (/) and endwith .sh.


Issue 01 (2018-09-06) 120


Execution Node Select a type of the node where the bootstrap action script isexecuted.NOTE

l If you select Master, you can choose whether to run the scriptonly on the active Master nodes by enabling or disabling the

switch .

l If you enable it, the script is run only on the active Master nodes.If you disable it, the script is run on all Master nodes. This switchis disabled by default.

Parameters Bootstrap action script parameters

Execution Time Select the time when the bootstrap action script is executed.Currently, the following two options are available: Beforecomponent start and After component start

Action upon Failure Indicates whether to continue to execute subsequent scriptsand creating a cluster after the script fails to be executed.NOTE

You are advised to set this parameter to Continue in the debuggingphase so that the cluster can continue to be installed and started nomatter whether the bootstrap action is successful.

7. Click OK. The following information is displayed.

Figure 4-7 Bootstrap action information

NOTE

After the bootstrap action is successfully added, you can edit or delete it in the Operation column.

4.11.4 View Execution RecordsYou can view the execution result of the bootstrap operation on the Bootstrap Action tabpage of the cluster details page.

Viewing the Execution Result1. Log in to the management console.2. Choose EI Enterprise Intelligence > MapReduce Service.3. In the left navigation pane, choose Clusters > Active Clusters. Click a cluster you want

to query.The cluster details page is displayed.


Issue 01 (2018-09-06) 121

4. On the cluster details page, click the Bootstrap Action tab. Information about thebootstrap actions added during cluster creation is displayed, as shown in the followingfigure.

Figure 4-8 Bootstrap action information

NOTE

l You select Before component start or After component start in the upper right corner toquery information about the related bootstrap actions.

l The last execution result is listed here. For a newly created cluster, the records of bootstrapactions executed during cluster creation are listed. If a cluster is expanded, the records ofbootstrap actions executed on the newly added nodes are listed.

Viewing Execution Logs

You can log in to each node to view run logs stored in /var/log/Bootstrap. If you addbootstrap actions before and after component start, you can distinguish bootstrap action logsof the two phases based on the timestamps.

If the ECS where bootstrap actions are performed has been reclaimed, you need to log in toOBS to view the logs. MRS collects the logs of each node to the OBS bucket you specifiedwhen you create the cluster.

You are advised to print logs in detail in the script so that you can view the detailed run result.MRS redirects the standard output and error output of the script to the log directory of thebootstrap action.

4.11.5 Sample Scripts

Zeppelin

Zeppelin is a web-based notebook that supports interactive data analysis. For moreinformation, visit the Zeppelin official website at http://zeppelin.apache.org/.

This sample script is used to automatically install Zeppelin. Select the corresponding scriptpath based on the region where the cluster is to be created. Enter the script path in ScriptPath on the Bootstrap Action page when adding a bootstrap action during cluster creation.You do not need to enter parameters for this script. Based on the Zeppelin usage habit, youonly need to run the script on the active Master node.

l CN North-Beijing1: s3a://mrs-samples-bootstrap-cn-north-1/zeppelin/zeppelin_install.sh

l CN East-Shanghai2: s3a://mrs-samples-bootstrap-cn-east-2/zeppelin/zeppelin_install.sh

l CN South-Guangzhou: s3a://mrs-samples-bootstrap-cn-south-1/zeppelin/zeppelin_install.sh


Issue 01 (2018-09-06) 122

http://zeppelin.apache.org/

https://mrs-samples-bootstrap-cn-north-1.obs.cn-north-1.myhwclouds.com/zeppelin/zeppelin_install.sh

https://mrs-samples-bootstrap-cn-north-1.obs.cn-north-1.myhwclouds.com/zeppelin/zeppelin_install.sh

https://mrs-samples-bootstrap-cn-east-2.obs.cn-east-2.myhwclouds.com/zeppelin/zeppelin_install.sh

https://mrs-samples-bootstrap-cn-east-2.obs.cn-east-2.myhwclouds.com/zeppelin/zeppelin_install.sh

https://mrs-samples-bootstrap-cn-south-1.obs.cn-north-1.myhwclouds.com/zeppelin/zeppelin_install.sh

https://mrs-samples-bootstrap-cn-south-1.obs.cn-north-1.myhwclouds.com/zeppelin/zeppelin_install.sh

After the bootstrap action is complete, use either of the following methods to verify thatZeppelin is correctly installed.

Method 1: Log in to the active Master node as user root and run /home/apache/zeppelin-0.7.3-bin-all/bin/zeppelin-daemon.sh status. If the message stating "Zeppelin isrunning [ OK ]" is displayed, the installation is successful.

Method 2: Start a Windows ECS in the same VPC. Access port 7510 of the active Masternode in the cluster. If the Zeppelin page is displayed, the installation is successful.

PrestoPresto is an open-source distributed SQL query engine, which is applicable to interactiveanalysis and query. For more information, visit the official website at http://prestodb.io/.

The sample script can be used to automatically install Presto. The script path is as follows:

l CN North-Beijing1: s3a://mrs-samples-bootstrap-cn-north-1/presto/presto_install.sh

l CN East-Shanghai2: s3a://mrs-samples-bootstrap-cn-east-2/presto/presto_install.shl CN South-Guangzhou: s3a://mrs-samples-bootstrap-cn-south-1/presto/

presto_install.sh

Based on the Presto usage habit, you are advised to install dualroles on the active Masternodes and worker on the Core nodes. You are advised to add the boot operation script andconfigure the parameters as follows:

Table 4-58 Bootstrap action script parameters

Script 1 Name: install dualrolesScript Path: Select the path of the presto-install.sh script based onthe region.Execution Node: Active MasterParameters: dualrolesExecution Time: After component startFailed Action: Continue

Script 2 Name: install workerScript Path: Select the path of the presto-install.sh script based onthe region.Execution Node: CoreParameters: workerExecution Time: After component startFailed Action: Continue

After the bootstrap action is complete, you can start a Windows ECS in the same VPC of thecluster and access port 7520 of the active Master node to view the Presto web page.

You can also log in to the active Master node to try Presto and run the following commands asuser root:


Issue 01 (2018-09-06) 123

http://prestodb.io/

https://mrs-samples-bootstrap-cn-north-1.obs.cn-north-1.myhwclouds.com/presto/presto_install.sh

https://mrs-samples-bootstrap-cn-north-1.obs.cn-north-1.myhwclouds.com/presto/presto_install.sh

https://mrs-samples-bootstrap-cn-east-2.obs.cn-east-2.myhwclouds.com/presto/presto_install.sh

https://mrs-samples-bootstrap-cn-south-1.obs.cn-north-1.myhwclouds.com/presto/presto_install.sh

https://mrs-samples-bootstrap-cn-south-1.obs.cn-north-1.myhwclouds.com/presto/presto_install.sh

Command for loading the environment variable:

#source /opt/client/bigdata_env

Command for viewing the process status:

#/home/apache/presto/presto-server-0.201/bin/launcher status

Command for connecting to Presto and performing the operation

#/home/apache/presto/presto-server-0.201/bin/presto --server localhost:7520 --catalogtpch --schema sf100

presto:sf100> select * from nation;

presto:sf100> select count(*) from customer

SupersetSuperset is a web-based enterprise-level and modern BI tool. For more information, visit theSuperset official website at https://superset.incubator.apache.org/.

This sample script is used to automatically install Superset. Select the corresponding scriptpath based on the region where the cluster is to be created. Enter the script path in ScriptPath on the Bootstrap Action page when adding a bootstrap action during cluster creation.You do not need to enter parameters for this script. Based on the Superset usage habit, youonly need to run the script on the active Master node.

l CN North-Beijing1: s3a://mrs-samples-bootstrap-cn-north-1/superset/superset_install.sh

l CN East-Shanghai2: s3a://mrs-samples-bootstrap-cn-east-2/superset/superset_install.sh

l CN South-Guangzhou: s3a://mrs-samples-bootstrap-cn-south-1/superset/superset_install.sh

After the bootstrap action is complete, use either of the following methods to verify thatSuperset is correctly installed.

Method 1: Remotely log in to the active Master node as user root and run the lsof -i:38088command. If the command output contains LISTEN, the installation is successful.

Method 2: Start a Windows ECS in the same VPC. Access port 38088 of the active Masternode in the cluster. If the Superset page is displayed, the installation is successful.


Issue 01 (2018-09-06) 124

https://superset.incubator.apache.org/

https://mrs-samples-bootstrap-cn-north-1.obs.cn-north-1.myhwclouds.com/superset/superset_install.sh

https://mrs-samples-bootstrap-cn-north-1.obs.cn-north-1.myhwclouds.com/superset/superset_install.sh

https://mrs-samples-bootstrap-cn-east-2.obs.cn-east-2.myhwclouds.com/superset/superset_install.sh

https://mrs-samples-bootstrap-cn-east-2.obs.cn-east-2.myhwclouds.com/superset/superset_install.sh

https://mrs-samples-bootstrap-cn-south-1.obs.cn-north-1.myhwclouds.com/superset/superset_install.sh

https://mrs-samples-bootstrap-cn-south-1.obs.cn-north-1.myhwclouds.com/superset/superset_install.sh

5 Remote Operation Guide

5.1 OverviewThis section describes remote login, MRS cluster node types, and node functions.

MRS cluster nodes support remote login. The following remote login methods are available:l GUI login: Use the remote login function provided by the ECS management console to

log in to the Linux interface of the Master node.l SSH login: Applies to Linux ECSs only. You can use a remote login tool (such as

PuTTY) to log in to an ECS. To use this method, you must assign an elastic IP address(EIP) to the ECS.For details about applying for and binding an elastic IP address for Master nodes, see"Assigning an EIP and Binding It to an ECS" under the "Management" section in theVPC User Guide.

NOTICEIf you use a key pair to log in to nodes in a cluster of versions earlier than MRS 1.6.2,you need to log in to them as a Linux user. For details, see Logging In to a Linux ECSUsing a Key Pair (SSH)If you use a key pair to log in to nodes in a cluster of MRS 1.6.2 or later, you need to login to them as user root. For details, see Logging In to a Linux ECS Using a Key Pair(SSH).If you use a password to log in to nodes, follow instructions in Logging In to a LinuxECS Using a Password (SSH). For nodes in a cluster of versions earlier than MRS1.6.2, you cannot use a password to log in to Linux ECS (SSH).

In an MRS cluster, a node is an ECS. Table 5-1 describes node types and functions.

MapReduce ServiceUser Guide 5 Remote Operation Guide

Issue 01 (2018-09-06) 125

Table 5-1 Cluster node types

Node Type Function

Master node Management node of an MRS cluster. It manages and monitorsthe cluster.In the navigation tree of the MRS management console, chooseCluster > Active Cluster, select a running cluster, and click itsname to switch to the cluster information page. On the Nodetab page, view the Name. The node that contains master1 inits name is the Master1 node. The node that contains master2in its name is the Master2 node.You can log in to a Master node either using VNC on the ECSmanagement console or using SSH. After logging in to theMaster node, you can access Core nodes without enteringpasswords.The system automatically deploys the Master nodes in active/standby mode and supports the high availability (HA) featurefor MRS cluster management. If the active management nodefails, the standby management node switches to the active stateand takes over services.To determine whether the Master1 node is the activemanagement node, see Viewing Active and Standby Nodes.

Core node Working node of an MRS cluster. It processes and analyzesdata and stores process data on HDFS.

5.2 Logging In to a Master NodeThis section describes how to log in to the Master nodes of a cluster using the GUI and SSH.

5.2.1 Logging In to an ECS Using VNCThis section describes how to log in to an ECS using VNC on the ECS management console.This login method is mainly used for emergent O&M. In other scenarios, it is recommendedthat you log in to the ECS using SSH.

Logging In to an ECS



Step 3 In Node, click a node name to log in to the ECS management console.

Step 4 In the upper right corner, click Remote Login.

Step 5 If the system prompts Press CTRL+ALT+DELETE to log on, click Send CtrlAltDel in theupper right part of the remote login operation interface for login.


Issue 01 (2018-09-06) 126

Step 6 Enter the username and password for logging in to the Master node as prompted.

1. For clusters of versions earlier than MRS 1.6.2, the login mode supports key pairs only.For the initial login, use the linux username and the cloud.1234 default password. If youhave changed the default password, use the new password. You are advised to change thepassword upon your initial login.

2. For clusters of MRS 1.6.2 or later, if you select Password in Login Mode during clustercreation, as shown in Figure 5-1, you need to enter root in Username and the passwordyou set during cluster creation in Password.

Figure 5-1 Selecting password as the login mode

3. For clusters of MRS 1.6.2 or later, if you select Key Pair in Login Mode during clustercreation, perform the following operations for login.

a. After the cluster is created, bind an EIP to the Master node of the cluster. Fordetails, see Virtual Private Cloud > User Guide > Network Components >Elastic IP Address > Assigning an EIP and Binding It to an ECS.

b. Use user root and the key file to log in to the Master node in SSH mode.

c. Run the passwd root command to set a password for user root.

d. Go back to the login interface, and enter root and the password set in Step 6.3.c tolog in to the node.

Step 7 For details about remote login to an ECS, see "Logging In to an ECS Using VNC" in the ECSUser Guide (Getting Started > Logging In to an ECS > Logging In to an ECS UsingVNC).

----End


Issue 01 (2018-09-06) 127

5.2.2 Logging In to a Linux ECS Using a Key Pair (SSH)This section describes how to log in to a Linux ECS using a key pair.

For details about logging in to a Linux ECS using a key pair, see "Logging In to a Linux ECSUsing a Key Pair (SSH)" in the ECS User Guide (Getting Started > Logging In to an ECS> Logging In to a Linux ECS Using a Key Pair (SSH)).

5.2.3 Logging In to a Linux ECS Using a Password (SSH)This section describes how to log in to a Linux ECS using a Password.

For details about logging in to a Linux ECS using a Password, see "Logging In to a LinuxECS Using a Password (SSH)" in the ECS User Guide (Getting Started > Logging In to anECS > Logging In to a Linux ECS Using a Password (SSH)).

5.3 Viewing Active and Standby NodesThis section describes how to confirm the active and standby management nodes of MRSManager on the Master1 node.

Background

You can log in to other nodes in a cluster from the Master node. After logging in to the Masternode, you can confirm the active and standby management nodes of MRS Manager and runcommands on the corresponding management nodes.

In active/standby mode, a switchover can be implemented between Master1 and Master2. Forthis reason, Master1 may not be the active management node for MRS Manager.

Procedure

Step 1 Confirm the Master node of an MRS cluster.

1. Log in to the MRS management console, choose Cluster > Active Cluster, select arunning cluster, and click its name to switch to the cluster information page. View basicinformation of the specified cluster.

2. On the Node tab page, view the Name. The node that contains master1 in its name is theMaster1 node. The node that contains master2 in its name is the Master2 node.

Step 2 Confirm the active and standby management nodes of MRS Manager.

1. Log in to the Master1 node as user linux. For details, see Logging In to an ECS UsingVNC.The Master node supports Cloud-init. The preset username and password for Cloud-initare linux and cloud.1234, respectively. If you have changed the password, log in to thenode using the new password. See "How Do I Log In to an ECS Once All ImagesSupport Cloud-Init?" in the Elastic Cloud Server User Guide (FAQs > Login FAQs >How Do I Log In to an ECS Once All Images Support Cloud-Init?).

2. Run the following commands to switch to user omm:sudo su - rootsu - omm


Issue 01 (2018-09-06) 128

3. Run the following command to confirm the active and standby management nodes:sh ${BIGDATA_HOME}/om-0.0.1/sbin/status-oms.sh | grep ActivedIn the following example, node-master2-LJXDj indicates the name of the activemanagement node.192-168-1-17 node-master2-LJXDj V100R001C01 2016-10-01 06:58:41 active normal Actived

NOTE

If the Master1 node to which you have logged in is the standby management node and you need tolog in to the active management node, run the following command:

ssh name of the active management nodeFor example, run the following command: ssh node-master2-LJXDj

----End

5.4 Client Management

5.4.1 Updating the Client

Scenario

An MRS cluster provides a client for users to connect to servers, query task results, andmanage data. Before using the MRS client or modifying service configuration parameters andrestarting the services on MRS Manager, users must prepare the client configuration file andupdate the client.

During cluster creation, the original client is installed and saved in the /opt/client directory onall nodes in the cluster by default. After the cluster is created, only the client on Master nodescan be used directly, and the client on Core nodes must be updated before being used.

Procedure

Step 1 Log in to MRS Manager.

Step 2 Click Service, and click Download Client.

Set Client Type to Only configuration files, set Download Path to Server, and click OK togenerate the client configuration file. The generated file is saved in the /tmp/MRS-clientdirectory on the active management node by default. You can modify the file save path asrequired.

Step 3 On the MRS management console, click Active Cluster.

Step 4 In the cluster list, click the specified cluster name and view the Active Master Node IPAddress.

Active Master Node IP Address is the IP address of the active Master node in a cluster,which is also the IP address of the active management node of MRS Manager.

Step 5 Locate the active management node based on the IP address and log in to the activemanagement node as user linux using VNC. For details, see Logging In to an ECS UsingVNC.


Issue 01 (2018-09-06) 129

For clusters of versions earlier than MRS 1.6.2, the Master node supports Cloud-init. Thepreset username and password for Cloud-init are linux and cloud.1234, respectively. If youhave changed the password, log in to the node using the new password. It is recommendedthat you change the password upon the first login.

For clusters of MRS 1.6.2 or later, the Master node supports Cloud-init. The preset usernamefor Cloud-init is root and the password is the one you set during cluster creation.

Step 6 Run the following command to switch the user:

sudo su - omm

Step 7 Run the following command to go to the client directory:

cd /opt/client

Step 8 Run the following command to update the client configuration:

sh refreshConfig.sh Client installation directory Full path of the client configuration filepackage

For example:

sh refreshConfig.sh /opt/client /tmp/MRS-client/MRS_Services_Client.tar

If the following information is displayed, the configuration is updated successfully.

ReFresh components client config is complete.Succeed to refresh components client config.

----End

Fully Updating the Original Client of the Active Master NodeScenario

During cluster creation, the original client is installed and saved in the /opt/client directory onall nodes in the cluster by default.

l For an MRS cluster in non-security mode, you will use the pre-installed client on aMaster node to submit a job on the management console page.

l You can also use the pre-installed client on the Master node to connect to a server, viewtask results, and manage data.

After installing the patch on the cluster, you need to update the client on the Master node toensure that the functions of the built-in client are available.

Procedure



Set Client Type to All client files, set Download Path to Server, and click OK to generatethe client configuration file. The generated file is saved in the /tmp/MRS-client directory onthe active management node by default. You can modify the file save path as required.

Step 3 On the MRS management console, click Active Cluster.

Step 4 In the cluster list, click the specified cluster name and view the Active Master Node IPAddress.


Issue 01 (2018-09-06) 130

Active Master Node IP Address is the IP address of the active Master node in a cluster,which is also the IP address of the active management node of MRS Manager.

Step 5 Locate the active management node based on the IP address and log in to the activemanagement node as user linux using VNC. For details, see Logging In to an ECS UsingVNC.

For clusters of versions earlier than MRS 1.6.2, the Master node supports Cloud-init. Thepreset username and password for Cloud-init are linux and cloud.1234, respectively. If youhave changed the password, log in to the node using the new password. It is recommendedthat you change the password upon the first login.

For clusters of MRS 1.6.2 or later, the Master node supports Cloud-init. The preset usernamefor Cloud-init is root and the password is the one you set during cluster creation.

Step 6 On the ECS, switch to user root and copy the installation package to the /opt directory.

sudo su - root

cp /tmp/MRS-client/MRS_Services_Client.tar /opt

Step 7 Run the following command in the /opt directory and decompress the package to obtain theverification file and client configuration packages:

tar -xvf MRS_Services_Client.tar

Step 8 Run the following command to verify the file package:

sha256sum -c MRS_Services_ClientConfig.tar.sha256

The following information is displayed:

MRS_Services_ClientConfig.tar: OK

Step 9 Run the following command to decompress MRS_Services_ClientConfig.tar.

tar -xvf MRS_Services_ClientConfig.tar

Step 10 Run the following command to move the original client to the /opt/client_bak directory:

mv /opt/client /opt/client_bak

Step 11 Run the following command to install the client to a new directory. The client path mustbe /opt/client.

sh /opt/MRS_Services_ClientConfig/install.sh /opt/client

If the output contains the following information, the client is installed successfully.

Components client installation is complete.

Step 12 Run the following command to modify the user and user group of the /opt/client directory:

chown omm:wheel /opt/client -R

Step 13 Run the following command to configure environmental variables:

source /opt/hadoopclient/bigdata_env

Step 14 If Kerberos authentication is enabled for the cluster, run the following command toauthenticate the user. If Kerberos authentication is not enabled for the cluster, skip this step.

kinit MRS cluster user


Issue 01 (2018-09-06) 131

Example: kinit admin

Step 15 Run a client command of the component.

Run the following command to view the HDFS directory:

hdfs dfs -ls /

----End

Fully Updating the Original Client of the Standby Master Node

Step 1 Repeat Step 1 to Step 5 to log in to the standby Master node and run the following commandto switch to user omm:

sudo su - omm

Step 2 Run the following command on the standby Master node to copy the downloaded clientpackage from the active Master node:

scp omm@master1_host_name:/home/linux/MRS_Services_Client.tar /tmp/MRS-client/

Step 3 Repeat Step 6 to Step 15 to update the client of the standby Master node.

----End

5.4.2 Using the Client on a Cluster Node

ScenarioAfter the client is updated, users can use the client on a Master node or a Core node in thecluster.

PrerequisitesThe client has been updated on the active management node.

Procedurel Using the client on a Master node:

a. On the active management node where the client is updated, that is, a Master node,run the sudo su - omm command to switch the user. Run the following command togo to the client directory:cd /opt/client

b. Run the following command to configure the environment variable:source bigdata_env

c. If the Kerberos authentication is enabled for the current cluster, run the followingcommand to authenticate users. If the Kerberos authentication is disabled for thecurrent cluster, skip this step.kinit MRS cluster userFor example, kinit admin.

d. Run a component client command.For example, run hdfs dfs -ls / to view files in the HDFS root directory.


Issue 01 (2018-09-06) 132

l Using the client on a Core node

a. Update the client on the active management node.

b. Locate the active management node based on the IP address and log in to the activemanagement node as user linux using VNC. For details, see Logging In to an ECSUsing VNC.

c. On the active management node, run the following command to switch the user:

sudo su - omm

d. On the MRS management console, view IP Address in the Node page of thespecified cluster.

Record the IP address of the Core node on which the client is to be used.

e. On the active management node, run the following command to copy the packageto a Core node:

scp -p /tmp/MRS-client/MRS_Services_Client.tar IP address of the Corenode:///opt/client

f. For clusters of versions earlier than MRS 1.6.2, log in to the Core node as userlinux. For clusters of MRS 1.6.2 or later, log in to the Core node as user root. Fordetails, see Logging In to an ECS Using VNC.

For clusters of versions earlier than MRS 1.6.2, the Master node supports Cloud-init. The preset username and password for Cloud-init are linux and cloud.1234,respectively. If you have changed the password, log in to the node using the newpassword. It is recommended that you change the password upon the first login.

For clusters of MRS 1.6.2 or later, the Master node supports Cloud-init. The presetusername for Cloud-init is root and the password is the one you set during clustercreation.

g. On the Core node, run the following command to switch the user:

sudo su - omm

h. Run the following command to update the client configuration:

sh /opt/client/refreshConfig.sh Client installation directory Full path of the clientconfiguration file package

For example:

sh /opt/client/refreshConfig.sh /opt/client /opt/client/MRS_Services_Client.tar

i. Run the following commands to go to the client directory and configure theenvironment variable:

cd /opt/client

source bigdata_env

j. If the Kerberos authentication is enabled for the current cluster, run the followingcommand to authenticate users. If the Kerberos authentication is disabled for thecurrent cluster, skip this step.


For example, kinit admin.

k. Run a component client command.

For example, run hdfs dfs -ls / to view files in the HDFS root directory.


Issue 01 (2018-09-06) 133

5.4.3 Using the Client on Another Node of a VPC

Scenario

After the client is prepared, users can use the client on a node outside the MRS cluster.

NOTE

If the client has been installed on the node outside the MRS cluster but must be updated, update theclient using the same account that is used to install the client, for example, the root account.

Prerequisitesl An ECS has been prepared. For details about the OS and its version of the ECS, see

Table 5-2.

Table 5-2 Reference list

OS Supported Version

SuSE l Recommended: SUSE Linux Enterprise Server 11 SP4 (SUSE11.4)

l Available: SUSE Linux Enterprise Server 11 SP3 (SUSE 11.3)l Available: SUSE Linux Enterprise Server 11 SP1 (SUSE 11.1)l Available: SUSE Linux Enterprise Server 11 SP2 (SUSE 11.2)

RedHat l Recommended: Red Hat-6.6-x86_64 (Red Hat 6.6)l Available: Red Hat-6.4-x86_64 (Red Hat 6.4)l Available: Red Hat-6.5-x86_64 (Red Hat 6.5)l Available: Red Hat-6.7-x86_64 (Red Hat 6.7)

CentOS l Available: CentOS-6.4 (CentOS 6.4)l Available: CentOS-6.5 (CentOS 6.5)l Available: CentOS-6.6 (CentOS 6.6)l Available: CentOS-6.7 (CentOS 6.7)l Available: CentOS-7.2 (CentOS 7.2)

For example, a user can select the image CentOS 7.2 64bit(40GB) to prepare the OS foran ECS.In addition, sufficient disk space is allocated for the ECS, for example, 40GB.

l The ECS and the MRS cluster are in the same VPC.l The IP address configured for the NIC of the ECS is in the same network segment as the

IP address of the MRS cluster.l The security group of the ECS is the same as that of the Master node of the MRS cluster.

If this requirement is not met, modify the ECS security group or configure the inboundand outbound rules of the ECS security group to allow the ECS security group to beaccessed by all security groups of MRS cluster nodes.


Issue 01 (2018-09-06) 134

For details about how to create an ECS that meets this requirement, see "Creating anECS" under the "Getting Started" chapter in the Elastic Cloud Server User Guide.

l To enable users to log in to a Linux ECS using a password (SSH), see "Logging In to aLinux ECS Using a Password (SSH)" in the Elastic Cloud Server User Guide (GettingStarted > Logging In to an ECS > Logging In to a Linux ECS Using a Password(SSH)).

Procedure

Step 1 Create an ECS that meets the requirements in the prerequisites.



Step 4 In Client Type, select All client files.

Step 5 In Download Path, select Remote host.

Step 6 Set Host IP Address to the IP address of the ECS, set Host Port to 22, and set Save Path to /home/linux.l If the default port 22 for logging in to an ECS using SSH has been changed, set Host

Port to the new port.l Save Path contains a maximum of 256 characters.

Step 7 For clusters of versions earlier than MRS 1.6.2, set Login User to linux. For clusters of MRS1.6.2 or later, set Login User to root.

If other users are used, ensure that the users have read, write, and execute permission on thesave path.

Step 8 In SSH Private Key, select and upload the private key used for creating the ECS.

Step 9 Click OK to start downloading the client to the ECS.

If the following information is displayed, the client package is successfully saved. ClickClose.

Client files downloaded to the remote host successfully.

NOTE

Generating a client will occupy a large number of disk I/Os. You are advised not to download a clientwhen the cluster is being installed, started, and patched, or in other unstable states.

Step 10 Log in to the ECS using VNC. See "Logging In to a Linux ECS Using VNC" in the ElasticCloud Server User Guide (Getting Started > Logging In to an ECS > Logging In to aLinux ECS Using VNC).

All images support Cloud-Init. For clusters of versions earlier than MRS 1.6.2, the presetusername and password for Cloud-Init are linux and cloud.1234, respectively. If you havechanged the password, log in to the ECS using the new password. For clusters of MRS 1.6.2or later, the preset username for Cloud-init is root and the password is the one you set duringcluster creation. See "How Do I Log In to an ECS Once All Images Support Cloud-Init?" inthe ECS FAQs. It is recommended that you change the password upon the first login.


sudo su - root


Issue 01 (2018-09-06) 135

cp /home/linux/MRS_Services_Client.tar /opt

Step 12 Run the following command in the /opt directory to decompress the package and obtain theverification file and the configuration package of the client:

tar -xvf MRS_Services_Client.tar

Step 13 Run the following command to verify the configuration package of the client:

sha256sum -c MRS_Services_ClientConfig.tar.sha256

The command output is as follows:

MRS_Services_ClientConfig.tar: OK

Step 14 Run the following command to decompress MRS_Services_ClientConfig.tar:

tar -xvf MRS_Services_ClientConfig.tar

Step 15 Run the following command to install the client to a new directory, for example, /opt/hadoopclient. A directory is automatically generated during installation.

sh /opt/MRS_Services_ClientConfig/install.sh /opt/hadoopclient

If the following information is displayed, the client is successfully installed:


Step 16 Check whether the IP address of the ECS node is connected to the IP address of the clusterMaster node.

For example, run the following command: ping Master node IP address.

l If yes, go to Step 17.

l If no, check whether the VPC and security group are correct and whether the ECS andthe MRS cluster are in the same VPC and security group, and go to Step 17.

Step 17 Run the following command to configure the environment variable:

source /opt/hadoopclient/bigdata_env

Step 18 If the Kerberos authentication is enabled for the current cluster, run the following command toauthenticate users. If the Kerberos authentication is disabled for the current cluster, skip thisstep.


For example, kinit admin.

Step 19 Run the client command of the component.

For example, run the following command to query the HDFS directory.

hdfs dfs -ls /

----End


Issue 01 (2018-09-06) 136

6 MRS Manager Operation Guide

6.1 MRS Manager Introduction

OverviewThe MRS manages and analyzes massive data and helps users rapidly obtain desired datafrom both structured and unstructured data. However, structures of open source componentsare complicated and component installation, configuration, and management are time-consuming and labor-intensive.

MRS Manager provides a unified enterprise-level platform for managing big data clusters. Itprovides the following functions:

l Cluster monitoringEnables you to quickly know the health status of hosts and services.

l Graphical indicator monitoring and customizationEnables you to obtain key system information in time.

l Service property configurationMeets your service performance requirements.

l Cluster, service, and role instance operationsEnables you to start or stop services and clusters with just one click.

MRS Manager InterfaceMRS Manager provides a unified cluster management platform to help users rapidly run andmaintain clusters.

Table 6-1 describes the functions of operation entries.

MapReduce ServiceUser Guide 6 MRS Manager Operation Guide

Issue 01 (2018-09-06) 137

Table 6-1 Function description of MRS Manager operation entries

Operation Entry Function Description

Dashboard Shows the status and key monitoring indicators of all services, aswell as the host status, in histograms, line charts, and tables. Userscan customize a dashboard for the key monitoring indicators anddrag it to any position on the interface. Data can be automaticallyupdated on the dashboard.

Service Provides service monitoring, operation guidance, and configuration,which help users manage services in a unified manner.

Host Provides host monitoring and operation guidance to help usersmanage hosts in a unified manner.

Alarm Provides alarm query and guidance to clear alarms, which enablesusers to quickly identify product faults and potential risks, ensuringproper system running.

Audit Queries and exports audit logs to help users know all users'activities and operations.

Tenant Provides a unified tenant management platform.

System Enables users to manage monitoring and alarm configurations aswell as backup.

On the page of a subfunction of System, you can use the System shortcut menu to go toanother subfunction page.

l Table 6-2 describes the System shortcut menu of a common cluster.l Table 6-3 describes the System shortcut menu of a security cluster.

The following describes how to use the System shortcut menu to go to a function page.

Step 1 On MRS Manager, click System.

Step 2 On the System page, click a link of any function to go to the function page.

For example, in the Backup and Restoration area, click Back Up Data to go to the Back UpData page.

Step 3 Move the cursor to the left boundary of the browser window. The black System shortcutmenu is unfolded. After you move the cursor away, the shortcut menu will be folded.

Step 4 On the shortcut menu, click a function link to go to the function page.

For example, choose Maintenance > Export Log. The Export Log page is displayed.

----End


Issue 01 (2018-09-06) 138

Table 6-2 System shortcut menu of a common cluster

Submenu Function Link

Backup and Restoration Back Up Data

Restore Data

Maintenance Export Log

Export Audit Log

Check Health Status

Monitoring and Alarm Configure Syslog

Configure Alarm Threshold

Configure SNMP

Configure Monitoring Metric Dump

Configure Resource Contribution Ranking

Resource Configure Static Service Pool

Permission Change OMS Database Password

Patch Manage Patch

Table 6-3 System shortcut menu of a security cluster


Backup and Restoration Back Up Data

Restore Data

Maintenance Export Log

Export Audit Log

Check Health Status

Monitoring and Alarm Configure Syslog

Configure Alarm Threshold

Configure SNMP

Configure Monitoring Metric Dump

Configure Resource Contribution Ranking

Permission Manage User

Manage User Group

Manage Role


Issue 01 (2018-09-06) 139


Configure Password Policy

Change OMS Database Password

Patch Manage Patch

Reference InformationMRS is a data analysis service on the public cloud. It is used for management and analysis ofmassive data.

MRS uses the MRS Manager portal to manage big data components, for example,components in the Hadoop ecosystem. Table 6-4 details the differences between MRS on thepublic cloud and on the MRS Manager portal.

Table 6-4 Differences

Concept MRS on the Public Cloud MRS Manager

MapReduce Service Indicates the data analysis serviceon the public cloud. This serviceincludes components such asHive, Spark, Yarn, HDFS, andZooKeeper.

Indicates the MapReducecomponent in the Hadoopecosystem.

6.2 Accessing MRS Manager

ScenarioMRS Manager supports MRS cluster monitoring, configuration, and management. You canopen the Manager page on the MRS Console page.

For clusters with Kerberos authentication disabled, you can open MRS Manager on MRSConsole. For clusters with Kerberos authentication enabled, see Accessing MRS ManagerSupporting Kerberos Authentication to learn how to access MRS Manager.

Procedure

Step 1 Log in to the Management Console of the public cloud, and click MapReduce Service.

Step 2 Click Cluster. In the Active Cluster list, click the specified cluster name to switch to thecluster information page.

Step 3 Click View to open MRS Manager.

If you access MRS Manager after successfully logging in to the MRS console, you do notneed to enter the password again because user admin is used for login by default.

----End


Issue 01 (2018-09-06) 140

6.3 Accessing MRS Manager Supporting KerberosAuthentication

Scenario

After users create MRS clusters that support Kerberos authentication, they can managerunning clusters on MRS Manager.

This section describes how to prepare a work environment on the public cloud platform foraccessing MRS Manager.

Impact on the System

Site trust must be added to the browser when you access MRS Manager for the first time.Otherwise, MRS Manager cannot be accessed.

Prerequisites

You have obtained the password of user admin. The password of user admin is specified bythe user during MRS cluster creation.

Procedure

Step 1 On the MRS management console, click Cluster.

Step 2 In the Active Cluster list, click the specified cluster name.

Record AZ, VPC, and Cluster Manager IP Address of the cluster, and Default SecurityGroup of the Master node.

Step 3 On the ECS management console, create a new ECS.l Ensure that AZ, VPC, and Security Group of the ECS are the same as those of the

cluster to be accessed.l Select a Windows public image. For example, select the Windows Server 2012 R2

Standard 64bit(40GB) standard image.l For details about other parameter configurations, see Elastic Cloud Server > Quick

Start > Purchasing and Logging In to a Windows ECS.

NOTE

If the security group of the ECS is different from Default Security Group of the Master node, you canmodify the configuration using either of the following methods:

l Change the security group of the ECS to the default security group of the Master node. For details,see Changing the Security Group in Elastic Cloud Server User Guide > Management >Modifying ECS Specifications.

l Add two security group rules to the security groups of the Master node and Core node to ensure thatthe ECS can access the cluster and set the protocol to TCP. Set Port Range of the two rules to28443 and 20009, respectively. For details, see Virtual Private Cloud User Guide > Security >Security Group > Adding a Security Group Rule.

Step 4 On the VPC management console, apply for an EIP and bind it to the ECS.


Issue 01 (2018-09-06) 141

See the Virtual Private Cloud > User Guide > Network Components > Elastic IP Address> Assigning an EIP and Binding It to an ECS.

Step 5 Log in to the ECS.

The account, password, EIP, and security group configuration rules of the Windows systemare required for logging in to the ECS. For details about how to log in to the ECS, see ElasticCloud Server User Guide > ECS Instances > Logging In to a Windows ECS.

Step 6 On the Windows remote desktop, use your browser to access MRS Manager.

For example, you can use Internet Explorer 11 in the Windows 2012 OS.

In the browser address bar, enter https://Cluster Manager IP Address:28443/web. Enter thename and password of the MRS cluster user, for example, user admin.

NOTE

l If you access MRS Manager with other MRS cluster usernames, change the password upon yourfirst access. The new password must meet the requirements of the current password complexitypolicies. For details, contact the administrator.

l By default, a user is locked after inputting an incorrect password five consecutive times. The user isautomatically unlocked after 5 minutes.

Step 7 If you want to exit MRS Manager, move the cursor to in the upper-right corner and clickLog Out.

----End

Related OperationsConfiguring mapping between node names and IP addresses

Step 1 Log in to MRS Manager and click Host.

Record the Host Name and OM IP Address of all nodes in a cluster.

Step 2 In the work environment, use Notepad to open the hosts file and add the mappingrelationship between node names and IP addresses to the file.

Each mapping relationship occupies an independent line. The following is an example:192.168.4.127 node-core-Jh3ER192.168.4.225 node-master2-PaWVE192.168.4.19 node-core-mtZ81192.168.4.33 node-master1-zbYN8192.168.4.233 node-core-7KoGY

Save the configurations and exit.

----End

6.4 Viewing Running Tasks in a Cluster

ScenarioAfter you trigger a running task on MRS Manager, the task running process and progress aredisplayed. After the task window is closed, you need to use the task management function toopen the task window.


Issue 01 (2018-09-06) 142

By default, MRS Manager keeps the records of the latest 10 running tasks, such as restartingservices, synchronizing service configurations, and performing health checks.

Procedure

Step 1 On the MRS Manager portal, click and open Task List.

You can view the following information under Task List: Name, Status, Progress, StartTime, and End Time.

Step 2 Click the name of a specified task and view details about the task execution process.

----End

6.5 Monitoring Management

6.5.1 Viewing the System Overview

Scenario

You can view basic statistics about services and clusters on the MRS Manager portal.

Procedure

Step 1 On the MRS Manager portal, choose Dashboard > Real-Time Monitoring.l The Health Status and Roles of each service are displayed in Service Summary.l The following statistics about host indicators are displayed:

– Cluster Host Health Status– Host Network Read Speed Distribution– Host Network Write Speed Distribution– Cluster Disk Information– Host Disk Usage Distribution– Cluster Memory Usage– Host Memory Usage Distribution– Host CPU Usage Distribution– Average Cluster CPU UsageClick Customize to display customized statistics.

Step 2 Set an interval for automatic page refreshing or click to refresh immediately.

The following parameters are supported:

l Refresh every 30 seconds: refreshes the page once every 30 seconds.l Refresh every 60 seconds: refreshes the page once every 60 seconds.l Stop refreshing: stops page refreshing.


Issue 01 (2018-09-06) 143

NOTE

Selecting Full screen maximizes the Real-Time Monitoring window.

----End

6.5.2 Configuring a Monitoring History Report

Scenario

On MRS Manager, the nodes where roles are deployed in a cluster can be classified intomanagement nodes, control nodes, and data nodes. Change trends of key host monitoringindicators on each type of nodes can be calculated and displayed as curve charts in reportsbased on user-defined periods. If a host belongs to multiple types of nodes, the indicatorstatistics will be collected several times.

You can view, customize, and export node monitoring indicator reports on MRS Manager.

Procedure

Step 1 View a monitoring indicator report.

1. On MRS Manager, click Dashboard.2. Click Historical Report to view the report.

By default, the report displays the monitoring indicator statistics of the previous day.

NOTE

Selecting Full screen maximizes the Historical Report window.

Step 2 Customize a monitoring indicator report.

1. Click Customize and select the monitoring indicators to be displayed on MRS Manager.The following indicators are supported and the page displays a maximum of sixcustomized indicators:– Cluster Network Read Speed Statistics– Cluster Disk Write Speed Statistics– Cluster Disk Usage Statistics– Cluster Disk Information– Cluster Disk Read Speed Statistics– Cluster Memory Usage Statistics– Cluster Network Write Speed Statistics– Cluster CPU Usage Statistics

2. Click OK to save the settings and view the selected indicators.

NOTE

Click Clear to deselect all the indicators.

Step 3 Export a monitoring indicator report.

1. Select a period.The options are Last day, Last week, Last month, Last quarter, and Last half year.You can define the start time and end time in Time Range.


Issue 01 (2018-09-06) 144

2. Click Export to generate a report file for the selected cluster monitoring indicators in thespecified period, and select a storage location to save the file.

NOTE

To view the curve charts of the monitoring indicators in a specified period, click View.

----End

6.5.3 Managing Service and Host Monitoring

Scenario

On MRS Manager, you can manage status and indicator information for all services(including role instances) and hosts.

l Status information, including operation, health, configuration, and role instance status.

l Information about key monitoring indicators of services.

l Monitoring indicator exports.

NOTE

You can set an interval for automatic page refreshing or click to refresh the page immediately.


l Refresh every 30 seconds: refreshes the page once every 30 seconds.


l Stop refreshing: stops page refreshing.

Managing Service Monitoring

Step 1 On MRS Manager, click Service to view the status of all services.

The service list includes Service, Operating Status, Health Status, Configuration Status,Roles, and Operation.

l Table 6-5 describes service operating status.

Table 6-5 Service operating status

Status Description

Started Indicates that the service is started.

Stopped Indicates that the service is stopped.

Failed to start Indicates that the service fails to be started.

Failed to stop Indicates that the service fails to be stopped.

Unknown Indicates the initial service status after the backgroundsystem restarts.

l Table 6-6 describes service health status.


Issue 01 (2018-09-06) 145

Table 6-6 Service health status

Status Description

Good Indicates that all role instances in the service are runningproperly.

Bad Indicates that at least one role instance in the service is inthe Bad state or that the dependent service is abnormal.

Unknown Indicates that all role instances in the service are in theUnknown state.

Concerning Indicates that the background system is restarting theservice.

Partially Healthy Indicates that the service that this service depends on isabnormal and the interfaces of the abnormal service cannotbe invoked externally.

l Table 6-7 describes service configuration status.

Table 6-7 Service configuration status

Status Description

Synchronized Indicates that the latest configuration has taken effect.

Expired Indicates that the latest configuration has not taken effectafter the parameter modification. You need to restart therelated services.

Failed Indicates that communication is abnormal or data cannot beread or written during the parameter configuration. Tryclicking Synchronize Configuration to recover theprevious configuration.

Configuring Indicates that the parameter is being configured.

Unknown Indicates that current configuration status cannot beobtained.

By default, services are displayed in ascending order by Service You can click Service,Operating Status, Health Status, or Configuration Status to change the display mode.

Step 2 Click the target service in the service list to view its status and indicator information.

Step 3 Customize monitoring indicators and export customized monitoring information.

1. In the Real-Time Statistics area, click Customize to customize key monitoringindicators.

2. Click History to display the page in which you can query historical monitoringinformation.

3. Select a time period, and click View to display the monitoring data in the specified timeperiod.


Issue 01 (2018-09-06) 146

4. Click Export to export the displayed indicator information.

----End

Managing Role Instance Monitoring

Step 1 On MRS Manager, click Service, and click the target service in the service list.

Step 2 Click Instance to view the role instance status.

The role instance list includes Role, Host Name, OM IP Address, Business IP Address,Rack, Operating Status, Health Status and Configuration Status.

l Table 6-8 describes role instance operating status.

Table 6-8 Role instance operating status

Status Description

Started Indicates that the role instance is started.

Stopped Indicates that the role instance is stopped.

Failed to start Indicates that the role instance fails to be started.

Failed to stop Indicates that the role instance fails to be stopped.

Decommissioning Indicates that the role instance is being decommissioned.

Decommissioned Indicates that the role instance has been decommissioned.

Recommissioning Indicates that the role instance is being re-commissioned.

Unknown Indicates the initial role instance status after thebackground system restarts.

l Table 6-9 describes role instance health status.

Table 6-9 Role instance health status

Status Description

Good Indicates that the role instance is running properly.

Bad Indicates that the role instance running is abnormal. Forexample, a port cannot be accessed because the PID doesnot exist.

Unknown Indicates that the host on which the role instance is runningdoes not connect to the background system.

Concerning Indicates that the background system is restarting the roleinstance.

l Table 6-10 describes role instance configuration status.


Issue 01 (2018-09-06) 147

Table 6-10 Role instance configuration status

Status Description

Synchronized Indicates that the latest configuration has taken effect.

Expired Indicates that the latest configuration has not taken effectafter the parameter modification. You need to restart therelated services.

Failed Indicates that communication is abnormal or data cannot beread or written during the parameter configuration. Tryclicking Synchronize Configuration to recover theprevious configuration.

Configuring Indicates that the parameter is being configured.

Unknown Indicates that configuration status cannot be obtained.

By default, roles are displayed in ascending order by Role. You can click Role, Host Name,OM IP Address, Business IP Address, Rack, Operating Status, Health Status, orConfiguration Status to change the display mode.

You can filter out all instances of the same role in Role.

Click Advanced Search, set search criteria in the role search area, and click Search to viewspecified role information. You can click Reset to reset search criteria. Fuzzy search issupported.

Step 3 Click the target role instance in the role instance list to view its status and indicatorinformation.

Step 4 Customize monitoring indicators and export customized monitoring information. Theoperation process is the same as that of exporting service monitoring indicators.

----End

Managing Host Monitoring

Step 1 On MRS Manager, click Host.

The host list includes Host Name, OM IP Address, Business IP Address, Rack, NetworkSpeed, Operating Status, Health Status, Disk Usage, Memory Usage and CPU Usage.

l Table 6-11 describes the operating status.

Table 6-11 Host operating status

Status Description

Normal The host and service roles on the host are running properly.

Isolated The host is isolated by the user, and service roles on thehost are stopped.


Issue 01 (2018-09-06) 148

l Table 6-12 describes host health status.

Table 6-12 Host health status

Status Description

Good Indicates that the host can properly send heartbeats.

Bad Indicates that the host fails to send heartbeats due totimeout.

Unknown Indicates the initial status of the host when it is beingadded.

By default, roles are displayed in ascending order by Host Name. You can click Host Name,OM IP Address, Business IP Address, Rack, NetWork Speed, Operating Status, HealthStatus, Disk Usage, Memory Usage, or CPU Usage to change the display mode.

Click Advanced Search, set search criteria in the role search area, and click Search to viewspecified host information. You can click Reset to reset search criteria. Fuzzy search issupported.

Step 2 Click the target host in the host list to view its status and indicator information.

Step 3 Customize monitoring indicators and export customized monitoring information.

1. In the Real-Time Statistics area, click Customize to customize key monitoringindicators.

2. Click History to display the page in which you can query historical monitoringinformation.

3. Select a time period, and click View to display the monitoring data in the specified timeperiod.

4. Click Export to export the displayed indicator information.

----End

6.5.4 Managing Resource Distribution

Scenario

You can query the top value curves, bottom value curves, or average data curves of keyservice and host monitoring indicators, that is, the resource distribution information, on MRSManager. MRS Manager allows you to view the monitoring data of the last hour.

You can also modify the resource distribution on MRS Manager to display both the top andbottom value curves in service and host resource distribution figures.

Resource distribution of some monitoring indicators is not recorded.

Procedurel View the resource distribution of service monitoring indicators.

a. On MRS Manager, click Service.


Issue 01 (2018-09-06) 149

b. Select the specific service in the service list.

c. Click Resource Distribution.

Select key service indicators in Metric. MRS Manager displays the resourcedistribution data of the selected service indicators in the last hour.

l View the resource distribution of host monitoring indicators.

a. On MRS Manager, click Host.

b. Click the specific host in the host list.

c. Click Resource Distribution.

Select key host indicators from Metric. MRS Manager displays the resourcedistribution data of the selected indicators in the last hour.

l Configure resource distribution.

a. On MRS Manager, click System.

b. In Configuration, click Configure Resource Contribution Ranking underMonitoring and Alarm.

Modify the displayed resource distribution quantity.

n Set Number of Top Resources to the number of top values.

n Set Number of Bottom Resources to the number of bottom values.

NOTE

The sum of the number of top and bottom values must not be greater than five.

c. Click OK to save the settings.

The Number of top and bottom resources saved successfully is displayed in theupper-right corner.

6.5.5 Configuring Monitoring Metric Dumping

Scenario

You can configure interconnection parameters on MRS Manager to save monitoring indicatordata to a specified FTP server using the FTP or SFTP protocol. In this way, MRS clusters caninterconnect with third-party systems. The FTP protocol does not encrypt data, which createspotential security risks. The SFTP protocol is recommended.

MRS Manager supports the collecting of all monitoring indicator data in managed clusters.The collection period can be 30 seconds, 60 seconds, or 300 seconds. Depending on thecollection period, the data is stored in different monitoring files on the FTP server. To name amonitoring file, adhere to the following pattern: Cluster name_metric_Monitoring indicatordata collection period_File saving time.log.

Prerequisitesl The corresponding ECS of the dump server and the Master node of the MRS cluster are

deployed on the same VPC.

l The Master node can access the IP address and specific ports of the dump server.

l The FTP service of the dump server is running properly.


Issue 01 (2018-09-06) 150

Procedure


Step 2 In Configuration, click Configure Monitoring Metric Dump under Monitoring andAlarm.

Step 3 Table 6-13 describes dump parameters.

Table 6-13 Dump parameters


Dump MonitoringMetric

Mandatory.Specifies whether to enable the audit log export function.

l : Enabled

l : Disabled

FTP IP Address Mandatory.Specifies the IP address of the FTP server for storing monitoring filesafter the interconnection of monitoring indicator data is enabled.

FTP Port Mandatory.Specifies the port for connecting to the FTP server.

FTP Username Mandatory.Specifies the username for logging in to the FTP server.

FTP Password Mandatory.Specifies the password for logging in to the FTP server.

Save Path Mandatory.Specifies the save path of monitoring files on the FTP server.

Dump Interval (s) Mandatory.Specifies the interval for saving monitoring files to the FTP server, inseconds.

Dump Mode Mandatory.Specifies the protocol used to send monitoring files. The optionsinclude FTP and SFTP.

SFTP Public Key Optional.Specifies the public key of the FTP server. This parameter is validafter Dump Mode is set to SFTP. You are advised to set thisparameter. If not, security risks may exist.

Step 4 Click OK. The parameters are set.

----End


Issue 01 (2018-09-06) 151

6.6 Alarm Management

6.6.1 Viewing and Manually Clearing an Alarm

Scenario

You can view and manually clear an alarm on MRS Manager.

Generally, the system automatically clears an alarm when the fault that generated the alarm isrectified. If the alarm is not cleared automatically after the fault is rectified, and if the alarmhas no impact on the system, you can manually clear the alarm.

On the MRS Manager portal, you can view the most recent 100,000 alarms, including thosethat have either been manually or automatically cleared, or not cleared. If the number ofcleared alarms exceeds 100,000 and is about to reach 110,000, the system automaticallydumps the earliest 10,000 cleared alarms to the dump path ${BIGDATA_HOME}/OMSV100R001C00x8664/workspace/data on the active management node. The directorywill be automatically generated when alarms are dumped for the first time.

NOTE

You can set an interval for automatic page refreshing or click to refresh the page immediately.




l Stop refreshing: stops page refreshing.

Procedure

Step 1 On MRS Manager, click Alarm and view the alarm information.l By default, alarms are displayed in descending order by Generated On. You can click

Alarm ID, Alarm Name, Severity, Generated On, Location, or Operation to changethe display mode.

l You can filter out all alarms of the same severity in Severity, including cleared alarmsand uncleared alarms.

l You can click , , or to filter out alarms whose severity is Critical, Major,Minor, or Warning.

Step 2 Click Advanced Search. In the displayed alarm search area, set search criteria and clickSearch to view the information about specified alarms. Click Reset to reset search criteria.

NOTE

You can set Start Time and End Time to specify the time range when alarms are generated.

Rectify the fault by referring to the help information Alarm Reference. If the alarms aregenerated due to other cloud services on which MRS depends, you need to contact themaintenance personnel of the relevant cloud services.

Step 3 If the alarm needs to be manually cleared, click Clear Alarm.


Issue 01 (2018-09-06) 152

NOTE

If you want to clear multiple alarms, select those you want to clear and click Clear Alarm to clear themin batches. A maximum of 300 alarms can be cleared each time.

----End

6.6.2 Configuring an Alarm Threshold

Scenario

You can configure an alarm threshold to learn the indicator health status. After Send Alarm isselected, the system sends an alarm message when the monitored data reaches the alarmthreshold. You can view the alarm information in Alarm.

Procedure


Step 2 In Configuration, click Configure Alarm Threshold under Monitoring and Alarm.

Step 3 Click an indicator, for example, CPU Usage, and click Create Rule.

Step 4 Set the parameters for monitoring indicator rules.

Table 6-14 Parameters for monitoring indicator rules

Parameter Value Description

Rule Name CPU_MAX (example) Specifies the rule name.

Reference Date 3/18/2017 (example) Specifies the date on whichthe reference indicatorhistory is generated.

Threshold Type l Max. valuel Min. value

Specifies whether to use themaximum or minimumvalue of the indicator forsetting the threshold.l If this parameter is set to

Max. value, an alarm isgenerated when theactual value of theindicator is greater thanthe threshold.

l If this parameter is set toMin. value, an alarm isgenerated when theactual value of theindicator is smaller thanthe threshold.


Issue 01 (2018-09-06) 153


Alarm Severity l Criticall Majorl Minorl Warning

Specifies the alarm severity.

Time Range From 00:00 to 23:59(example)

Specifies the period inwhich the rule takes effect.

Threshold 80 (example) Specifies the threshold ofthe rule monitoringindicator.

Date l Workdayl Weekendl Other

Specifies the days when therule takes effect.

Add Date 11/06 (example) This parameter takes effectwhen you set Date toOthers. You can selectmultiple dates.

Step 5 Click OK. The Rule saved successfully is displayed in the upper-right corner.

Send Alarm is selected by default. MRS Manager checks whether the values of monitoringindicators meet the threshold requirements. If the number of times that the values do not meetthe threshold requirements during consecutive checks exceeds the value of Trigger Count, analarm will be sent. The value of Trigger Count can be customized. Check Period (s)specifies the interval for MRS Manager to check monitoring indicators.

Step 6 In the row that contains the newly added rule, click Apply in the Operation column. If adialog box indicating that the rule is applied successfully is displayed in the upper-rightcorner, the rule is added. The icon turns green, indicating that the operation is complete. ClickCancel in the Operation column. If a dialog box indicating that the rule is canceledsuccessfully is displayed in the upper-right corner, the rule is canceled.

----End

6.6.3 Configuring Syslog Northbound Interface

ScenarioYou can configure the northbound interface so that alarms generated on MRS Manager can bereported to your monitoring O&M system using Syslog.

NOTICEThe Syslog protocol is not encrypted. Therefore, data can be easily stolen duringtransmission. This represents a significant security risk.


Issue 01 (2018-09-06) 154

Prerequisitesl The corresponding ECS of the interconnected server and the Master node of the MRS

cluster are deployed on the same VPC.

l The Master node can access the IP address and specific ports of the interconnectedserver.

Procedure


Step 2 In Configuration, click Configure Syslog under Monitoring and Alarm.

The switch of the Syslog Service is disabled by default. Click the switch to enable the Syslogservice.

Step 3 On the displayed page, set Syslog parameters listed in Table 6-15:

Table 6-15 Description of Syslog parameters

Area Parameter Description

Syslog Protocol Server IP Address Specifies the IP address ofthe interconnected server.

Server Port Specifies the port numberfor interconnection.

Protocol Specifies the protocol type.Possible values are:l TCPl UDP

Severity Specifies the messageseverity. Possible values are:l Informationall Emergencyl Alertl Criticall Errorl Warningl Noticel Debug

Facility Specifies the module wherethe log is generated.

Identifier Specifies the product. Thedefault value is MRSManager.


Issue 01 (2018-09-06) 155


Report Message Report Format Specifies the messageformat of alarms. For detailsabout the formatrequirements, see the helpinformation on the WebUI.

Report Alarm Type Specifies the type of alarmsto be reported. Possiblevalues are:l Fault

Syslog alarm informationis reported if theManager generates analarm.

l ClearSyslog alarm informationis reported if theManager clears an alarm.

l EventSyslog alarm informationis reported if theManager generates anevent.

Report Alarm Severity Specifies the severity ofalarms to be reported.Possible values areWarning, Minor, Major,and Critical.

Uncleared AlarmReporting

Periodic Uncleared AlarmReporting

Specifies whether unclearedalarms are reportedperiodically.The switch of the PeriodicUncleared AlarmReporting is disabled bydefault. Click the switch toenable the function.

Report Interval (min) Specifies the interval forperiodical alarm report.This parameter is availableonly when PeriodicUncleared AlarmReporting is enabled. Theinterval is measured inminutes and the defaultvalue is 15. The value rangeis 5 to 1440.


Issue 01 (2018-09-06) 156


Heartbeat Settings Heartbeat Report Specifies whether periodicalreport of Syslog heartbeatmessages is enabled.The switch of the HeartbeatReport is disabled bydefault. Click the switch toenable the function.

Heartbeat Period (min) Specifies the interval forperiodical heartbeat report.This parameter is availableonly when HeartbeatReport is enabled. The unitof the interval is minute andthe default value is 15. Thevalue range is 1 to 60.

Heartbeat Packet Specifies the heartbeatreport content. Thisparameter is available onlywhen Heartbeat Report isenabled. The identifiercannot be empty. The valuemust contain a maximum of256 characters. It cancontain only numbers,letters, underscores (_),vertical bars (|), colons (:),commas (,), periods (.), andspaces.

NOTE

When the heartbeat packets are reported periodically, reporting packets may be interrupted in scenarios(active/standby management node switchover) where some clusters are automatically restored fromfaults. Wait until the restoration is complete.

Step 4 Click OK to complete the settings.

----End

6.6.4 Configuring SNMP Northbound Interface

ScenarioYou can integrate the alarm and monitoring data of MRS Manager to the networkmanagement system (NMS) using the Simple Network Management Protocol (SNMP).


Issue 01 (2018-09-06) 157

Prerequisitesl The corresponding ECS of the interconnected server and the Master node of the MRS

cluster are deployed on the same VPC.l The Master node can access the IP address and specific ports of the interconnected

server.

Procedure


Step 2 In Configuration, click Configure SNMP under Monitoring and Alarm.

The switch of the SNMP Service is disabled by default. Click the switch to enable the SNMPservice

Step 3 On the displayed page, set SNMP parameters listed in Table 6-16:

Table 6-16 Description of SNMP parameters


Version Specifies the version of the SNMP protocol. Possible values are:l v2c: an earlier version of SNMP with low securityl v3: the latest version of SNMP with higher security than

SNMPv2cSNMPv3 is recommended.

Local Port Specifies the local port number. The default value is 20000. Thevalue ranges from 1025 to 65535.

Read-OnlyCommunity

Specifies the read-only community name. This parameter is validwhen Version is set to v2c.

Read-WriteCommunity

Specifies the write community name. This parameter is valid whenVersion is set to v2c.

Security Username Specifies the SNMP security username. This parameter is validwhen Version is set to v3.

AuthenticationProtocol

Specifies the authentication protocol. You are advised to set thisparameter to SHA. This parameter is valid when Version is set tov3.

AuthenticationPassword

Specifies the authentication key. This parameter is valid whenVersion is set to v3.

Confirm Password Used to confirm the authentication key. This parameter is validwhen Version is set to v3.

EncryptionProtocol

Specifies the encryption protocol. You are advised to set thisparameter to AES256. This parameter is valid when Version is setto v3.

EncryptionPassword

Specifies the encryption key. This parameter is valid when Versionis set to v3.


Issue 01 (2018-09-06) 158


Confirm Password Used to confirm the encryption key. This parameter is valid whenVersion is set to v3.

NOTE

l The values of Authentication Password and Encryption Password must contain 8 to 16characters. At least three of the following types of character must be used: uppercase letters,lowercase letters, digits, and special characters. The passwords must be different and cannot be thesame as the security username or the security username written backwards.

l For security purposes, periodically change the values of Authentication Password and EncryptionPassword if the SNMP protocol is used.

l If SNMPv3 is used, a security user will be locked if authentication fails five consecutive times in a5-minute window. They will be unlocked 5 minutes later.

Step 4 Click Create Trap Target under Trap Target, and set the following parameters in the CreateTrap Target dialog box:l Target Symbol

Specifies the ID of the Trap target. This is generally the ID of the network managementor host that receives Trap. The value consists of 1 to 255 characters, including letters anddigits.

l Target IP AddressSpecifies the target IP address. The value of this parameter can be set to an IP address ofA, B, or C class and can communicate with the IP address of the management plane onthe management node.

l Target PortSpecifies the port that receives Trap. The value of this parameter must be that same asthat on the peer end and ranges from 0 to 65535.

l Trap CommunitySpecifies the trap community name. This parameter is valid when Version is set to v2c.

Click OK to finish the settings and exit the Create Trap Target dialog box.

Step 5 Click OK.

----End

6.7 Alarm Reference

6.7.1 ALM-12001 Audit Log Dump Failure

DescriptionCluster audit logs need to be dumped on a third-party server due to the local historical databackup policy. Audit logs can be successfully dumped if the dump server meets theconfiguration conditions. This alarm is generated when the audit log dump fails due toinsufficient disk space on the dump directory on the third-party server or if a user changes theusername, password, or dump directory of the dump server.


Issue 01 (2018-09-06) 159

Attribute

Alarm ID Alarm Severity Automatically Cleared

12001 Minor Yes

Parameters


ServiceName Specifies the service for which the alarm isgenerated.

RoleName Specifies the role for which the alarm isgenerated.

HostName Specifies the host for which the alarm isgenerated.


The system can only store a maximum of 50 dump files locally. If the fault persists on thedump server, the local audit log may be lost.

Possible Causesl The network connection is abnormal.l The username, password, or dump directory of the dump server does not meet the

configuration conditions.l The disk space of the dump directory is insufficient.

Procedure

Step 1 Check whether the username, password, and dump directory are correct.

1. Check on the dump configuration page of MRS Manager to see if they are correct.– If yes, go to Step 3.– If no, go to Step 1.2.

2. Change the username, password, or dump directory, and click OK.3. Wait 2 minutes and check whether the alarm is cleared.

– If yes, no further action is required.– If no, go to Step 2.

Step 2 Reset the dump rule.

1. On MRS Manager, choose System > Dump Audit Log.2. Reset dump rules, set the parameters properly, and click OK.3. Wait 2 minutes and check whether the alarm is cleared.


Issue 01 (2018-09-06) 160


Step 3 Collect fault information.

1. On MRS Manager, choose System > Export Log.2. Contact the public cloud O&M personnel and send the collected log information.

----End

Related InformationN/A

6.7.2 ALM-12002 HA Resource Is Abnormal

DescriptionThe high availability (HA) software periodically checks the WebService floating IP addressesand databases of Manager. This alarm is generated when any of these is abnormal.

This alarm is cleared when the HA software detects that the floating IP addresses or databasesare in the normal state.

AttributeAlarm ID Alarm Severity Automatically Cleared

12002 Major Yes

ParametersParameter Description




RESName Specifies the resource for which the alarm isgenerated.

Impact on the SystemIf the WebService floating IP addresses of Manager are abnormal, users cannot log in to oruse MRS Manager. If databases are abnormal, all core services and related service processes,such as alarm and monitoring functions, are affected.


Issue 01 (2018-09-06) 161

Possible Causesl The floating IP address is abnormal.l An exception occurs in the database.

Procedure

Step 1 Check the status of the floating IP address on the active management node.

1. In the alarm list on MRS Manager, locate the row that contains the alarm, and view itshost address and resource name in the alarm details.

2. Log in to the active management node. Run the following command to switch the user:sudo su - rootsu - omm

3. Go to the ${BIGDATA_HOME}/om-0.0.1/sbin/ directory, run the status-oms.sh scriptto check whether the floating IP address of the active Manager is normal. View thecommand output, locate the row where ResName is floatip, and check whether thefollowing information is displayed.For example:10-10-10-160 floatip Normal Normal Single_active

– If yes, go to Step 2.– If no, go to Step 1.4.

4. Contact the public cloud O&M personnel to check whether the NIC configured with thefloating IP address exists.– If yes, go to Step 2.– If no, go to Step 1.5.

5. Contact the public cloud O&M personnel to rectify NIC faults.Wait 5 minutes and check whether the alarm is cleared.– If yes, no further action is required.– If no, go to Step 2.

Step 2 Check the database status of the active and standby management nodes.

1. Log in to the active and standby management nodes, run the sudo su - root and su -ommdba command to switch to user ommdba, and then run the gs_ctl query command.Check whether the following information is displayed in the command output.Command output of the active management node:Ha state: LOCAL_ROLE: Primary STATIC_CONNECTIONS: 1 DB_STATE: Normal DETAIL_INFORMATION: user/password invalid Senders info: No information Receiver info: No information

Command output of the standby management node:Ha state: LOCAL_ROLE: Standby STATIC_CONNECTIONS: 1 DB_STATE : Normal DETAIL_INFORMATION: user/password invalid Senders info: No information Receiver info: No information

– If yes, go to Step 2.3.– If no, go to Step 2.2.

2. Contact the public cloud O&M personnel to check for and rectify network faults.– If yes, go to Step 2.3.


Issue 01 (2018-09-06) 162

– If no, go to Step 3.

3. Wait 5 minutes and check whether the alarm is cleared.– If yes, no further action is required.– If no, go to Step 3.

Step 3

----End

Related Information

N/A

6.7.3 ALM-12004 OLdap Resource Is Abnormal

Description

This alarm is generated when the Ldap resource in Manager is abnormal and is cleared afterthe Ldap resource in Manager recovers and the alarm handling is complete.

Attribute


12004 Major Yes

Parameters






The OLdap resources are abnormal and the Manager authentication service is unavailable. Asa result, security authentication and user management functions cannot be provided for upper-layer web services. Users may be unable to log in to Manager.

Possible Causes

The LdapServer process in Manager is abnormal.


Issue 01 (2018-09-06) 163

Procedure

Step 1 Check whether the LdapServer process in Manager is in the normal state.

1. Log in to the active management node.2. Run ps -ef | grep slapd to check whether the LdapServer resource process in the $

{BIGDATA_HOME}/om-0.0.1/ directory of the configuration file is running properly.You can determine that the resource is normal by checking the following information:

a. Run sh ${BIGDATA_HOME}/om-0.0.1/sbin/status-oms.sh. You can view thatResHAStatus of the OLdap process is Normal.

b. Run ps -ef | grep slapd. You can view the slapd process occupying port 21750.

– If yes, go to Step 2.– If no, go to Step 3.

Step 2 Run kill -2 PID of LdapServer process and wait 20 seconds. The HA starts the OLdap processautomatically. Check whether the status of the OLdap resource is in the normal state.l If yes, no further action is required.l If no, go to Step 3.



----End


6.7.4 ALM-12005 OKerberos Resource Is Abnormal

DescriptionThe alarm module monitors the status of the Kerberos resource in Manager. This alarm isgenerated when the Kerberos resource is abnormal and is cleared after the Kerberos resourcerecovers and the alarm handling is complete.


12005 Major Yes


Issue 01 (2018-09-06) 164





Impact on the SystemThe Kerberos resources are abnormal and the Manager authentication service is unavailable.As a result, the security authentication function cannot be provided for upper-layer webservices. Users may be unable to log in to Manager.

Possible CausesThe OLdap resource on which the OKerberos depends is abnormal.

Procedure

Step 1 Check whether the OLdap resource on which the OKerberos depends is abnormal in Manager.

1. Log in to the active management node.2. Run the following command to check whether the OLdap resource managed by HA is in

the normal state:sh ${BIGDATA_HOME}/OMSV100R001C00x8664/workspace0/ha/module/hacom/script/status_ha.shThe OLdap resource is in the normal state when it is in the Active_normal state on theactive node and in the Standby_normal state on the standby node.– If yes, go to Step 3.– If no, go to Step 2.

Step 2 See ALM-12004 OLdap Resource Is Abnormal for further assistance. After the OLdapresource status recovers, check whether the OKerberos resource is in the normal state.l If yes, no further action is required.l If no, go to Step 3.



----End



Issue 01 (2018-09-06) 165

6.7.5 ALM-12006 Node Fault

DescriptionController checks the NodeAgent status every 30 seconds. This alarm is generated whenController fails to receive the status report of a NodeAgent for three times consecutively andis cleared when Controller can properly receive the status report of the NodeAgent.


12006 Critical Yes





Impact on the SystemServices on the node are unavailable.

Possible CausesThe network is disconnected or the hardware is faulty.

Procedure

Step 1 Check whether the network is disconnected or the hardware is faulty.

1. In the alarm list on MRS Manager, locate the row that contains the alarm, and view itshost address in the alarm details.

2. Log in to the active management node.3. Run the following command to check whether the faulty node is reachable:

ping IP address of the faulty host

a. If yes, go to Step 2.b. If no, go to Step 1.4.

4. Contact the public cloud O&M personnel to check whether a network fault occurs andrectify the fault.


Issue 01 (2018-09-06) 166


5. Rectify the network fault and check whether the alarm is cleared from the alarm list.– If yes, no further action is required.– If no, go to Step 1.6.

6. Contact the public cloud O&M personnel to check whether a hardware fault (forexample, a CPU fault or memory fault) occurs on the node.– If yes, go to Step 1.7.– If no, go to Step 2.

7. Repair the faulty components and restart the node. Check whether the alarm is clearedfrom the alarm list.– If yes, no further action is required.– If no, go to Step 2.



----End

Related Information

N/A

6.7.6 ALM-12007 Process Fault

Description

The process health check module checks the process status every 5 seconds. This alarm isgenerated when the process health check module detects that the process connection status isBad for three times consecutively and is cleared when the process can be connected.

Attribute


12007 Major Yes

Parameters





Issue 01 (2018-09-06) 167




The service provided by the process is unavailable.

Possible Causesl The instance process is abnormal.l The disk space is insufficient.

Procedure

Step 1 Check whether the instance process is abnormal.

1. In the alarm list on MRS Manager, locate the row that contains the alarm, and view itshost name and service name in the alarm details.

2. On the Alarm page, check whether alarm ALM-12006 Node Fault is generated.– If yes, go to Step 1.3.– If no, go to Step 1.4.

3. See the procedure in ALM-12006 Node Fault to handle the alarm.4. Check whether the installation directory user, user group, and permission of the alarm

role are correct. The user, user group, and the permission must be omm:ficommon 750.– If yes, go to Step 1.6.– If no, go to Step 1.5.

5. Run the following command to set the permission to 750 and User:Group toomm:ficommon:chmod 750 <folder_name>chown omm:ficommon <folder_name>

6. Wait 5 minutes and check whether the alarm is cleared.– If yes, no further action is required.– If no, go to Step 2.1.

Step 2 Check whether the disk space is insufficient.

1. On MRS Manager, check whether the alarm list contains ALM-12017 Insufficient DiskCapacity.– If yes, go to Step 2.2.– If no, go to Step 3.

2. See the procedure in ALM-12017 Insufficient Disk Capacity to handle the alarm.3. Wait 5 minutes and check whether the alarm is cleared.

– If yes, go to Step 2.4.– If no, go to Step 3.


Issue 01 (2018-09-06) 168




----End


6.7.7 ALM-12010 Manager Heartbeat Interruption Between theActive and Standby Nodes

DescriptionThis alarm is generated when the active Manager does not receive a heartbeat signal from thestandby Manager or 7 seconds. It is cleared when the active Manager receives a heartbeatsignal from the standby Manager.


12010 Major Yes





Local Manager HA Name Specifies a local Manager HA.

Peer Manager HA Name Specifies a peer Manager HA.

Impact on the SystemWhen the active Manager process is abnormal, an active/standby failover cannot beperformed, and services are affected.


Issue 01 (2018-09-06) 169

Possible Causes

The link between the active and standby Managers is abnormal.

Procedure

Step 1 Check whether the network between the active and standby Manager servers is in the normalstate.

1. In the alarm list on MRS Manager, locate the row that contains the alarm, and view theIP address of the standby Manager server in the alarm details.

2. Log in to the active management node.3. Run the following command to check whether the standby Manager is reachable:

ping heartbeat IP address of the standby Manager– If yes, go to Step 2.– If no, go to Step 1.4.

4. Contact the public cloud O&M personnel to check whether the network is faulty.– If yes, go to Step 1.5.– If no, go to Step 2.

5. Rectify the network fault and check whether the alarm is cleared from the alarm list.– If yes, no further action is required.– If no, go to Step 2.



----End

Related Information

N/A

6.7.8 ALM-12011 Manager Data Synchronization ExceptionBetween the Active and Standby Nodes

Description

This alarm is generated when the standby Manager fails to synchronize files with the activeManager and is cleared when they succeed in synchronizing files.

Attribute


12011 Critical Yes


Issue 01 (2018-09-06) 170

Parameters





Local Manager HA Name Specifies a local Manager HA.

Peer Manager HA Name Specifies a peer Manager HA.


Because the configuration files on the standby Manager are not updated, some configurationswill be lost after an active/standby switchover. Manager and some components may not runproperly.

Possible Causes

The link between the active and standby Managers is interrupted.

Procedure

Step 1 Check whether the network between the active and standby Manager servers is in the normalstate.

1. In the alarm list on MRS Manager, locate the row that contains the alarm, and view theIP address of the standby Manager in the alarm details.

2. Log in to the active management node. Run the following command to check whetherthe standby Manager is reachable:

ping IP address of the standby Manager

a. If yes, go to Step 2.

b. If no, go to Step 1.3.

3. Contact the public cloud O&M personnel to check whether the network is faulty.

– If yes, go to Step 1.4.


4. Rectify the network fault and check whether the alarm is cleared from the alarm list.

– If yes, no further action is required.



1. On MRS Manager, choose System > Export Log.


Issue 01 (2018-09-06) 171

2. Contact the public cloud O&M personnel and send the collected log information.

----End

Related Information

N/A

6.7.9 ALM-12012 NTP Service Is Abnormal

Description

This alarm is generated when the NTP service on the current node fails to synchronize timewith the NTP service on the active OMS node. It is cleared when they succeed insynchronizing time.

Attribute


12012 Major Yes

Parameters






The time on the node is inconsistent with the time on other nodes in the cluster. Therefore,some MRS applications on the node may not run properly.

Possible Causesl The NTP service on the current node cannot start properly.l The current node fails to synchronize time with the NTP service on the active OMS

node.l The key value authenticated by the NTP service on the current node is inconsistent with

that on the active OMS node.l The time offset between the node and the NTP service on the active OMS node is large.


Issue 01 (2018-09-06) 172

Procedure

Step 1 Check the NTP service on the current node.

1. Check whether the ntpd process is running on the node using the following method. Login to the node and run the sudo su - root command to switch the user. Run the followingcommand to check whether the command output contains the ntpd process:ps -ef | grep ntpd | grep -v grep.– If yes, go to Step 2.1.– If no, go to Step 1.2.

2. Run service ntp start to start the NTP service.3. Wait 10 minutes and check whether the alarm is cleared.

– If yes, no further action is required.– If no, go to Step 2.1.

Step 2 Check whether the current node can synchronize time properly with the NTP service on theactive OMS node.

1. Check whether the node can synchronize time with the NTP service on the active OMSnode based on Additional Info of the alarm.– If yes, go to Step 2.2.– If no, go to Step 3.

2. Check whether the synchronization with the NTP service on the active OMS node isfaulty.Log in to the alarm node and run the sudo su - root command to switch the user. Thenrun the ntpq -np command.In the command output, if an asterisk (*) exists before the IP address of the NTP serviceon the active OMS node, the synchronization is in the normal state. The command outputis as follows:remote refid st t when poll reach delay offset jitter ============================================================================== *10.10.10.162 .LOCL. 1 u 1 16 377 0.270 -1.562 0.014If no asterisk (*) exists before the IP address of the NTP service on the active OMS nodeand the value of refid is .INIT., the synchronization is abnormal. The command output isas follows:remote refid st t when poll reach delay offset jitter ============================================================================== 10.10.10.162 .INIT. 1 u 1 16 377 0.270 -1.562 0.014– If yes, go to Step 2.3.– If no, go to Step 3.

3. Rectify the fault, wait 10 minutes, and then check whether the alarm is cleared.– If yes, no further action is required.– If no, go to Step 3.An NTP synchronization failure is usually related to the system firewall. If the firewallcan be disabled, disable it and check whether the fault is rectified. If the firewall cannotbe disabled, check the firewall configuration policies and ensure that port UDP 123 isenabled. You need to follow specific firewall configuration policies of each system.

Step 3 Check whether the key value authenticated by the NTP service on the current node isconsistent with that on the active OMS node.


Issue 01 (2018-09-06) 173

Run cat /etc/ntp/ntpkeys to check whether the authentication code with a key value index of1 is the same as the value of the NTP service on the active OMS node.

l If yes, go to Step 4.1.

l If no, go to Step 5.

Step 4 Check whether the time offset between the node and the NTP service on the active OMS nodeis large.

1. Check whether the time offset is large in Additional Info of the alarm.



2. On the Host page, select the host of the node, and choose More > Stop All Roles to stopall the services on the node.

If the time on the alarm node is earlier than that on the NTP service of the active OMSnode, adjust the time on the alarm node to be the same as that on the NTP service of theactive OMS node. After doing so, choose More > Start All Roles to start services on thenode.

If the time on the alarm node is later than that on the NTP service of the active OMSnode, wait until the time offset is due and adjust the time on the alarm node. After doingso, choose More > Start All Roles to start services on the node.

NOTE

If you do not wait until the time offset is due, data loss may occur.

3. Wait 10 minutes and check whether the alarm is cleared.






----End

Related Information

N/A

6.7.10 ALM-12016 CPU Usage Exceeds the Threshold

Description

The system checks the CPU usage every 30 seconds and compares it with the threshold. Thisalarm is generated when the CPU usage exceeds the threshold several times (configurable, 10times by default) consecutively.

This alarm is cleared when the average CPU usage is less than or equal to 90% of thethreshold.


Issue 01 (2018-09-06) 174

Attribute


12016 Major Yes

Parameters





Trigger Condition Generates an alarm when the actualindicator value exceeds the specifiedthreshold.


Service processes respond slowly or become unavailable.

Possible Causesl The alarm threshold or Trigger Count is configured inappropriately.

l The CPU configuration does not meet service requirements. As a result, the CPU usagereaches the upper limit.

Procedure

Step 1 Check whether the alarm threshold or Trigger Count is appropriate.

1. Log in to MRS Manager.

2. Choose System > Configure Alarm Threshold > Device > Host > CPU Usage > CPUUsage and change the alarm threshold based on the actual CPU usage.

3. Choose System > Configure Alarm Threshold > Device > Host > CPU Usage > CPUUsage and change Trigger Count based on the actual CPU usage.

NOTE

This option defines the alarm check phase. Interval indicates the alarm check period and TriggerCount indicates the number of times the CPU usage exceeds the threshold. An alarm is generatedif the CPU usage exceeds the threshold several times consecutively.




Issue 01 (2018-09-06) 175


Step 2 Expand the system capacity.

1. In the alarm list on MRS Manager, locate the row that contains the alarm, and view theIP address of the alarm node in the alarm details.

2. Log in to the alarm node.3. Run the cat /proc/stat | awk 'NR==1'|awk '{for(i=2;i<=NF;i++)j+=$i;print "" 100 -

($5+$6) * 100 / j;}' command to check the system CPU usage.4. If the CPU usage exceeds the threshold, expand the CPU capacity.5. Check whether the alarm is cleared.




----End

Related Information

N/A

6.7.11 ALM-12017 Insufficient Disk Capacity

Description

The system checks the host disk usage every 30 seconds and compares it with the threshold.This alarm is generated when the host disk usage exceeds the specified threshold and iscleared when the host disk usage is less than or equal to the threshold.


12017 Major Yes






Issue 01 (2018-09-06) 176


PartitionName Specifies the disk partition for which thealarm is generated.



Service processes become unavailable.

Possible Causes

The disk configuration does not meet service requirements. As a result, the disk usage reachesthe upper limit.

Procedure

Step 1 Log in to MRS Manager and check whether the alarm threshold is appropriate.l If yes, go to Step 2.l If no, go to Step 1.1.

1. Choose System > Configure Alarm Threshold > Device > Disk > Disk Usage > DiskUsage and change the alarm threshold based on the actual disk usage.


Step 2 Check whether the disk is a system disk.

1. In the alarm list on MRS Manager, locate the row that contains the alarm, and view itshost name and disk partition information in the alarm details.

2. Log in to the alarm node.3. Run the df -h command to check the system disk partition usage. Check whether the disk

is mounted to the following directories based on the disk partition name obtained in Step2.1: /, /boot, /home, /opt, /tmp, /var, /var/log, /boot, and /srv/BigData.– If yes, the disk is a system disk. Go to Step 3.1.– If no, the disk is not a system disk. Go to Step 2.4.

4. Run the df -h command to check the system disk partition usage. Determine the role ofthe disk based on the disk partition name obtained in Step 2.1.

5. Check whether the disk is used by HDFS or Yarn.– If yes, expand the disk capacity for the Core node. Go to Step 2.6.– If no, go to Step 4.

6. Wait 2 minutes and check whether the alarm is cleared.– If yes, no further action is required.


Issue 01 (2018-09-06) 177


Step 3 Check whether a large file is written to the disk.

1. Run the find / -xdev -size +500M -exec ls -l {} \; command to view files larger than 500MB on the node. Check whether these files are written to the disk.– If yes, go to Step 3.2.– If no, go to Step 4.

2. Process the large files and check whether the alarm is cleared after 2 minutes.– If yes, no further action is required.– If no, go to Step 4.

3. Expand the disk capacity.4. Wait 2 minutes and check whether the alarm is cleared.




----End

Related Information

N/A

6.7.12 ALM-12018 Memory Usage Exceeds the Threshold

Description

The system checks the memory usage every 30 seconds and compares it with the threshold.This alarm is generated when the host memory usage exceeds the threshold and is clearedwhen it is less than or equal to 90% of the threshold.


12018 Major Yes





Issue 01 (2018-09-06) 178





Service processes respond slowly or become unavailable.

Possible Causes

Memory configuration does not meet service requirements. As a result, the memory usagereaches the upper limit.

Procedure


1. In the alarm list on MRS Manager, locate the row that contains the alarm, and view theIP address of the alarm host in the alarm details.

2. Log in to the alarm node.3. Run the free -m | grep Mem\: | awk '{printf("%s,", ($3-$6-$7) * 100 / $2)}' command

to check the system memory usage.4. If the memory usage exceeds the threshold, expand the memory capacity.5. Wait 5 minutes and check whether the alarm is cleared.




----End

Related Information

N/A

6.7.13 ALM-12027 Host PID Usage Exceeds the Threshold

Description

The system checks the PID usage every 30 seconds and compares it with the threshold. Thisalarm is generated when the PID usage exceeds the threshold and is cleared when it is lessthan or equal to the threshold.


Issue 01 (2018-09-06) 179


12027 Major Yes







No PID is available for new processes and service processes are unavailable.

Possible Causesl Too many processes are running on the node.l The value of pid_max needs to be increased.l The system is abnormal.

Procedure

Step 1 Increase the value of pid_max.

1. On MRS Manager, click the alarm in the real-time alarm list. In the Alarm Details area,obtain the IP address of the host that generated the alarm.

2. Log in to the alarm node.3. Run the cat /proc/sys/kernel/pid_max command to check the value of pid_max.4. If the PID usage exceeds the threshold, run the following command to double the value

of pid_max:echo new pid_max value > /proc/sys/kernel/pid_max.For example,echo 65536 > /proc/sys/kernel/pid_max

5. Wait 5 minutes and check whether the alarm is cleared.– If yes, no further action is required.


Issue 01 (2018-09-06) 180


Step 2 Check whether the system environment is abnormal.

1. Contact the public cloud O&M personnel to check whether the operating system isabnormal.– If yes, go to Step 2 to rectify the fault.– If no, go to Step 3.




----End

Related Information

N/A

6.7.14 ALM-12028 Number of Processes in the D State on the HostExceeds the Threshold

Description

The system checks the number of processes of user omm that are in the D state on the hostevery 30 seconds and compares the number with the threshold. This alarm is generated whenthe number of processes in the D state exceeds the threshold and is cleared when the numberis less than or equal to the threshold.


12028 Major Yes






Issue 01 (2018-09-06) 181



Impact on the SystemExcessive system resources are used and service processes respond slowly.

Possible CausesThe host responds slowly to I/O (disk I/O and network I/O) requests and a process is in the Dstate.

Procedure

Step 1 Check the process that is in the D state.

1. In the alarm list on MRS Manager, locate the row that contains the alarm, and view theIP address of the alarm host in the alarm details.

2. Log in to the alarm node.3. Run the following command to switch the user:

sudo su - rootsu - omm

4. Run the following command to view the PID of the process of user omm that is in the Dstate:ps -elf | grep -v "\[thread_checkio\]" | awk 'NR!=1 {print $2, $3, $4}' | grep omm |awk -F' ' '{print $1, $3}' | grep D | awk '{print $2}'

5. Check whether no command output is displayed.– If yes, the service process is running properly. Go to 1.7.– If no, go to 1.6.

6. Switch to user root and run the reboot command to restart the alarm host.Restarting the host is risky. Ensure that the service process runs properly after the restart.




----End



Issue 01 (2018-09-06) 182

6.7.15 ALM-12031 User omm or Password Is About to Expire

Description

At 00:00 every day, the system starts checking whether user omm and its password are aboutto expire every 8 hours. This alarm is generated when the user or password is going to expirein 15 days.

It is cleared when the validity period of user omm is changed or the password is reset and thealarm handling is complete.

Attribute


12031 Major Yes

Parameters






The node trust relationship is unavailable and Manager cannot manage services.

Possible Causes

User omm or its password is about to expire.

Procedure

Step 1 Check whether user omm and its password in the system are valid.

1. Log in to the faulty node.2. Run the following command to view information about user omm and its password:

chage -l omm3. Check whether the user and password are about to expire based on the system message.

a. View the value of Password expires to check whether the password is about toexpire.


Issue 01 (2018-09-06) 183

b. View the value of Account expires to check whether the user is about to expire.

NOTE

If the parameter value is never, the user and password are valid permanently; if the value is a date,check whether the user and password are going to expire within 15 days.


4. Modify the validity period:– Run the following command to set a validity period for user omm:

chage -E 'specified date' omm– Run the following command to set the number of validity days for user omm:

chage -M 'number of days' omm5. Check whether the alarm is cleared automatically in the next periodic check.




----End


6.7.16 ALM-12032 User ommdba or Password Is About to Expire

DescriptionAt 00:00 every day, the system starts checking whether user ommdba and its password areabout to expire every 8 hours. This alarm is generated when the user or password is going toexpire in 15 days.

It is cleared when the validity period of user ommdba is changed or the password is reset andthe alarm handling is complete.


12032 Major Yes


Issue 01 (2018-09-06) 184






The OMS database cannot be managed and data cannot be accessed.

Possible Causes

User ommdba or its password is about to expire.

Procedure

Step 1 Check whether user ommdba and its password in the system are valid.

1. Log in to the faulty node.2. Run the following command to view information about user ommdba and its password:

chage -l ommdba3. Check whether the user and password are about to expire based on the system message.

a. View the value of Password expires to check whether the password is about toexpire.

b. View the value of Account expires to check whether the user is about to expire.

NOTE

If the parameter value is never, the user and password are valid permanently; if the value is a date,check whether the user and password are going to expire within 15 days.


4. Modify the validity period configuration:– Run the following command to set a validity period for user ommdba:

chage -E 'specified date' ommdba– Run the following command to set the number of validity days for user ommdba:

chage -M 'number of days' ommdba5. Check whether the alarm is cleared automatically in the next periodic check.




Issue 01 (2018-09-06) 185


----End

Related Information

N/A

6.7.17 ALM-12033 Slow Disk Fault

Description

The system runs the iostat command every second to monitor the disk I/O indicator. Thisalarm is generated when the svctm value exceeds 100 ms more than 30 times in 60 seconds,which indicates that the disk is faulty.

This alarm is automatically cleared after the disk is replaced.

Attribute


12033 Major Yes

Parameters





DiskName Specifies the disk for which the alarm isgenerated.


Service performance and service processing capabilities deteriorate. For example, DBServiceactive/standby synchronization is affected and the service becomes unavailable.

Possible Causes

The disk is aged or has bad sectors.


Issue 01 (2018-09-06) 186

Procedure

Contact the public cloud O&M personnel and send the collected log information.

Related Information

N/A

6.7.18 ALM-12034 Periodic Backup Failure

Description

This alarm is generated when a periodic backup task fails to be executed and is cleared whenthe next backup task is executed successfully.

Attribute


12034 Major Yes

Parameters





TaskName Specifies the task.


No backup package is available, so the system cannot be restored if faults occur.

Possible Causes

The alarm cause depends on the task details. Handle the alarm according to the logs and alarmdetails.

Procedure



Issue 01 (2018-09-06) 187


6.7.19 ALM-12035 Unknown Data Status After Recovery TaskFailure

DescriptionIf a recovery task fails, the system automatically rolls back. If the rollback fails, data may belost. When this occurs, an alarm is generated. This alarm is cleared when the next recoverytask is executed successfully.


12035 Critical Yes





TaskName Specifies the task.

Impact on the SystemData may be lost or the data status may be unknown, both of which may affect services.

Possible CausesThe alarm cause depends on the task details. Handle the alarm according to the logs and alarmdetails.

ProcedureContact the public cloud O&M personnel and send the collected log information.



Issue 01 (2018-09-06) 188

6.7.20 ALM-12037 NTP Server Is Abnormal

Description

This alarm is generated when the NTP server is abnormal and is cleared after the NTP serverrecovers.

Attribute


12037 Major Yes

Parameters




HostName Specifies the IP address of the NTP serverfor which the alarm is generated.


If the NTP server configured on the active OMS node is abnormal, the active OMS node failsto synchronize time with the NTP server and a time offset may be generated in the cluster.

Possible Causesl The NTP server network is faulty.l The NTP server authentication fails.l The NTP server time cannot be obtained.l The time obtained from the NTP server is not being continuously updated.

Procedure

Step 1 Check the NTP server network.

1. On the MRS Manager portal, view the real-time alarm list and locate the target alarm.2. In the Alarm Details area, view the additional information to check whether the NTP

server is successfully pinged.– If yes, go to Step 2.– If no, go to Step 1.3.


Issue 01 (2018-09-06) 189

3. Contact the public cloud O&M personnel to check the network configuration and ensurethat the network between the NTP server and the active OMS node is in the normal state.Then, check whether the alarm is cleared.



Step 2 Check whether the NTP server authentication fails.

1. Log in to the active management node.

2. Run the ntpq -np command to check whether the NTP server authentication fails. Ifrefid of the NTP server is .AUTH., the authentication fails.

– If yes, go to Step 5.


Step 3 Check whether the time can be obtained from the NTP server.

1. View the additional information of the alarm to check whether the time can be obtainedfrom the NTP server.


– If no, go to Step 3.2.

2. Contact the public cloud O&M personnel to rectify the NTP server fault. After the NTPserver is in the normal state, check whether the alarm is cleared.



Step 4 Check whether the time obtained from the NTP server is being continuously updated.

1. View the additional information of the alarm to check whether the time obtained fromthe NTP server is being continuously updated.



2. Contact the provider of the NTP server to rectify the NTP server fault. After the NTPserver is in the normal state, check whether the alarm is cleared.






----End

Related Information

N/A


Issue 01 (2018-09-06) 190

6.7.21 ALM-12038 Monitoring Indicator Dump Failure

Description

This alarm is generated when dump fails after monitoring indicator dump is configured onMRS Manager and is cleared when dump is successful.


12038 Major Yes






The upper-layer management system fails to obtain monitoring indicators from the MRSManager system.

Possible Causesl The server cannot be connected.l The save path on the server cannot be accessed.l The monitoring indicator file fails to be uploaded.

Procedure

Step 1 Contact the public cloud O&M personnel to check whether the network connection betweenthe MRS Manager system and the server is in the normal state.l If yes, go to Step 3.l If no, go to Step 2.

Step 2 Contact the public cloud O&M personnel to restore the network and check whether the alarmis cleared.l If yes, no further action is required.l If no, go to Step 3.


Issue 01 (2018-09-06) 191

Step 3 Choose System > Configure Monitoring Metric Dump and check whether the FTPusername, password, port, dump mode, and public key that are configured on theconfiguration page for monitoring indicator dumping are consistent with those on the server.



Step 4 Enter the correct configuration information, click OK, and check whether the alarm iscleared.

l If yes, no further action is required.


Step 5 Choose System > Configure Monitoring Metric Dump and check the configuration items,including FTP Username, Save Path, and Dump Mode.

l If the dumping mode is FTP, go to Step 6.

l If the dumping mode is SFTP, go to Step 7.

Step 6 Log in to the server in FTP mode. In the default path, check whether the relative path SavePath has the read and write permission on FTP Username.



Step 7 Log in to the server in FTP mode. In the default path, check whether the absolute path SavePath has the read and write permission on FTP Username.



Step 8 Add the read and write permission and check whether the alarm is cleared.



Step 9 Log in to the server and check whether the save path has sufficient disk space.



Step 10 Delete any unnecessary files or go to the configuration page for monitoring indicator dumpingto change the save path. Check whether the alarm is cleared.






----End

Related Information

N/A


Issue 01 (2018-09-06) 192

6.7.22 ALM-12039 GaussDB Data Is Not Synchronized

Description

The system checks the data synchronization status between the active and standby GaussDBsevery 10 seconds. This alarm is generated when the synchronization status cannot be queriedsix times consecutively or when the synchronization status is abnormal.

This alarm is cleared when data synchronization is normal.

Attribute


12039 Critical Yes

Parameters





Local GaussDB HA IP Specifies the HA IP address of the localGaussDB.

Peer GaussDB HA IP Specifies the HA IP address of the peerGaussDB.

SYNC_PERSENT Specifies the synchronization percentage.


If the active instance becomes abnormal while data is not synchronized between the activeand standby GaussDBs, data may be lost or abnormal.

Possible Causesl The network between the active and standby nodes is unstable.

l The standby GaussDB is abnormal.

l The disk space of the standby node is full.


Issue 01 (2018-09-06) 193

Procedure

Step 1 Log in to MRS Manager, click Alarm, locate the row that contains the alarm, and view the IPaddress of the standby GaussDB in the alarm details.

Step 2 Log in to the active management node.

Step 3 Run the following command to check whether the standby GaussDB is reachable:

ping heartbeat IP address of the standby GaussDB

If yes, go to Step 6.

If no, go to Step 4.

Step 4 Contact the public cloud O&M personnel to check whether the network is faulty.l If yes, go to Step 5.l If no, go to Step 6.

Step 5 Rectify the network fault and check whether the alarm is cleared from the alarm list.l If yes, no further action is required.l If no, go to Step 6.

Step 6 Log in to the standby GaussDB node.


sudo su - root

su - omm

Step 8 Go to the ${BIGDATA_HOME}/om-0.0.1/sbin/ directory.

Run the following command to check whether the resource status of the standby GaussDB isnormal:

sh status-oms.sh

In the command output, check whether the following information is displayed in the rowwhere ResName is gaussDB:10_10_10_231 gaussDB Standby_normal Normal Active_standby

l If yes, go to Step 9.l If no, go to Step 15.

Step 9 Log in to the standby GaussDB node.


sudo su - root

su - omm

Step 11 Run the echo ${BIGDATA_DATA_HOME}/dbdata_om command to obtain the GaussDBdata directory.

Step 12 Run the df -h command to check the system disk partition usage.

Step 13 Check whether the disk on which the GaussDB data directory is mounted is full.


Issue 01 (2018-09-06) 194


Step 14 Contact the public cloud O&M personnel to expand the disk capacity. After capacityexpansion, wait 2 minutes and check whether the alarm is cleared.l If yes, no further action is required.l If no, go to Step 15.



----End

Related Information

N/A

6.7.23 ALM-12040 Insufficient System Entropy

Description

At 00:00:00 every day, the system checks the entropy five times consecutively. First, thesystem checks whether either the rng-tools or haveged tool is enabled and correctlyconfigured. If not, the system checks the current entropy. This alarm is generated when theentropy is less than 500 in the five checks.

This alarm is cleared in any of the following scenarios:

l True random number mode is configured.l Random numbers are configured in pseudo-random number mode.l Neither the true random number mode nor pseudo-random number mode is configured

but the entropy is greater than or equal to 500 in at least one of the five checks.

Attribute


12040 Major Yes

Parameters





Issue 01 (2018-09-06) 195



Impact on the SystemDecryption failures occur and functions related to decryption are affected, for example,DBService installation.

Possible CausesThe haveged or rngd service is abnormal.

Procedure

Step 1 On the MRS Manager portal, click Alarm.

Step 2 View detailed alarm information to obtain the value of the HostName field.

Step 3 Log in to the node for which the alarm is generated. Run the sudo su - root command toswitch the user.

Step 4 Run the /bin/rpm -qa | grep -w "haveged" command. If the command is executedsuccessfully, run the /sbin/service haveged status |grep "running" command and view thecommand output.l If the command is executed successfully, the haveged service is correctly installed and

configured, and is running properly. Go to Step 8.l If the command is not executed successfully, the haveged service is not running properly.

Go to Step 5.

Step 5 Run the /bin/rpm -qa | grep -w "rng-tools" command. If the command is executedsuccessfully, run the ps -ef | grep -v "grep" | grep rngd | tr -d " " | grep "\-o/dev/random"| grep "\-r/dev/urandom" command and view the command output.l If the command is executed successfully, the rngd service is correctly installed and

configured, and is running properly. Go to Step 8.l If the command is not executed successfully, the rngd service is not running properly. Go

to Step 6.

Step 6 Manually configure the system entropy. For details, see Related Information.

Step 7 Wait until 00:00:00, at which time the system checks the entropy again. Check whether thealarm is cleared automatically.l If yes, no further action is required.l If no, go to Step 8.



----End


Issue 01 (2018-09-06) 196

Related Information

Manually check the system entropy.

Log in to the node and run the sudo su - root command to switch the user. Run the cat /proc/sys/kernel/random/entropy_avail command to check whether the system entropy isgreater than or equal to 500. If the system entropy is less than 500, you can reset it by usingone of the following methods:

l Using the haveged tool (true random number mode): Contact the public cloud O&Mpersonnel to install the tool and start it.

l Using the rng-tools tool (pseudo-random number mode): Contact the public cloud O&Mpersonnel to install the tool.

6.7.24 ALM-13000 ZooKeeper Service Unavailable

Description

The system checks the ZooKeeper service status every 30 seconds. This alarm is generatedwhen the ZooKeeper service is unavailable and is cleared when the ZooKeeper servicerecovers.


13000 Critical Yes





Impact on the Systeml ZooKeeper fails to provide coordination services for upper-layer components.l Components dependent on ZooKeeper may not run properly.

Possible Causesl A ZooKeeper instance is abnormal.l The disk capacity is insufficient.


Issue 01 (2018-09-06) 197

l The network is faulty.l The DNS is installed on the ZooKeeper node.

Procedure

Check the ZooKeeper service instance status.

Step 1 On MRS Manager, choose Service > ZooKeeper > quorumpeer.

Step 2 Check whether the ZooKeeper instances are normal.l If yes, go to Step 6.l If no, go to Step 3.

Step 3 Select instances whose status is not good and choose More > Restart Instance.

Step 4 Check whether the instance status is good after restart.l If yes, go to Step 5.l If no, go to Step 19.

Step 5 On the Alarm tab, check whether the alarm is cleared.l If yes, no further action is required.l If no, go to Step 6.

Check the disk status.

Step 6 On MRS Manager, choose Service > ZooKeeper > quorumpeer, and check the hostinformation of the ZooKeeper instance on each node.


Step 8 In the Disk Usage column, check whether the disk space of each node that containsZooKeeper instances is insufficient (where disk usage exceeds 80%).l If yes, go to Step 9.l If no, go to Step 11.

Step 9 Expand disk capacity. For details, see ALM-12017 Insufficient Disk Capacity.

Step 10 On the Alarm tab, check whether the alarm is cleared.l If yes, no further action is required.l If no, go to Step 11.

Check the network status.

Step 11 On the Linux node that contains the ZooKeeper instance, run the ping command to checkwhether the host names of other nodes that contain the ZooKeeper instance can be pingedsuccessfully.l If yes, go to Step 15.l If no, go to Step 12.

Step 12 Modify the IP addresses in /etc/hosts and add the host name and IP address mapping.

Step 13 Run the ping command again to check whether the host names of other nodes that contain theZooKeeper instance can be pinged successfully.


Issue 01 (2018-09-06) 198



Step 14 On the Alarm tab, check whether the alarm is cleared.



Check the DNS.

Step 15 Check whether the DNS is installed on the node that contains the ZooKeeper instance. On theLinux node that contains the ZooKeeper instance, run the cat /etc/resolv.conf command tocheck whether the file is empty.



Step 16 Run the service named status command to check whether the DNS is started.


l No, go to Step 19.

Step 17 Run the service named stop command to stop the DNS service. If "Shutting down nameserver BIND waiting for named to shut down (28s)" is displayed, the DNS service is stoppedsuccessfully. Comment out any content in /etc/resolv.conf.

Step 18 On the Alarm tab, check whether the alarm is cleared.






----End

Related Information

N/A

6.7.25 ALM-13001 Available ZooKeeper Connections AreInsufficient

Description

The system checks ZooKeeper connections every 30 seconds. This alarm is generated whenthe system detects that the number of used ZooKeeper instance connections exceeds thethreshold (80% of the maximum connections).

This alarm is cleared when the number of used ZooKeeper instance connections is less thanthe threshold.


Issue 01 (2018-09-06) 199


13001 Major Yes







Available ZooKeeper connections are insufficient. When the connection usage reaches 100%,external connections cannot be handled.

Possible Causesl The number of connections to the ZooKeeper node exceeds the threshold.l Connection leakage occurs on some connection processes.l The maximum number of connections does not meet the requirement of the actual

scenario.

Procedure

Step 1 Check the connection status.

1. On the MRS Manager portal, choose Alarm > ALM-13001 Available ZooKeeperConnections Are Insufficient > Location. Check the IP address of the alarm node.

2. Obtain the PID of the ZooKeeper process. Log in to the alarm node and run the pgrep -fproc_zookeeper command.

3. Check whether the PID can be successfully obtained.– If yes, go to Step 1.4.– If no, go to Step 2.

4. Obtain all the IP addresses connected to the ZooKeeper instance and the number ofconnections. Check 10 IP addresses with the top connections. Run the followingcommand based on the obtained PID and IP address: lsof -i|grep $pid | awk '{print $9}'


Issue 01 (2018-09-06) 200

| cut -d : -f 2 | cut -d \> -f 2 | awk '{a[$1]++} END {for(i in a){print i,a[i] | "sort -r -g -k 2"}}' | head -10. ($pid is the PID obtained in the preceding step.)

5. Check whether the node IP addresses and the number of connections are successfullyobtained.



6. Obtain the ID of the port connected to the process. Run the following command based onthe obtained PID and IP address: lsof -i|grep $pid | awk '{print $9}' |cut -d \> -f 2 |grep$IP | cut -d :-f 2. ($pid and $IP are the PID and IP address obtained in the precedingstep.)

7. Check whether the port ID is successfully obtained.



8. Obtain the ID of the connected process. Log in to each IP address and run the followingcommand based on the obtained port ID: lsof -i|grep $port. ($port is the port IDobtained in the preceding step.)

9. Check whether the process ID is successfully obtained.



10. Check whether connection leakage occurs on the process based on the obtained processID.



11. Close the process where connection leakage occurs and check whether the alarm iscleared.



12. On the MRS Manager portal, choose Service > ZooKeeper > Service Configuration >All > quorumpeer > Performance and change the value of maxCnxns to 20000 ormore.

13. Check whether the alarm is cleared.






----End

Related Information

N/A


Issue 01 (2018-09-06) 201

6.7.26 ALM-13002 ZooKeeper Heap Memory or Direct MemoryUsage Exceeds the Threshold

DescriptionThe system checks the memory usage of the ZooKeeper service every 30 seconds. This alarmis generated when the memory usage of a ZooKeeper instance exceeds the threshold (80% ofthe maximum memory).

The alarm is cleared when the memory usage is less than the threshold.


13002 Major Yes






Impact on the SystemIf the available memory for the ZooKeeper service is insufficient, a memory overflow occursand the service breaks down.

Possible Causesl The memory usage of the ZooKeeper instance on the node is overusedl The memory is improperly allocated.

Procedure

Step 1 Check the memory usage.

1. On the MRS Manager portal, choose Alarm > ALM-13002 ZooKeeper Memory UsageExceeds the Threshold > Location. Check the IP address of the instance that generatedthe alarm.


Issue 01 (2018-09-06) 202

2. On the MRS Manager portal, choose Service > ZooKeeper > Instance >quorumpeer(the IP address checked) > Customize > Heap and Direct Memory ofZooKeeper. Check the heap usage.

3. Check whether the used heap memory of ZooKeeper reaches 80% of the maximum heapmemory specified for ZooKeeper.



4. On the MRS Manager portal, choose Service > ZooKeeper > Service Configuration >All > quorumpeer > System. Increase the value of -Xmx in GC_OPTS as required.




6. On the MRS Manager portal, choose Service > ZooKeeper > Instance >quorumpeer(the IP address checked) > Customize > Heap and Direct Memory ofZooKeeper. Check the direct buffer memory usage.

7. Check whether the used direct buffer memory of ZooKeeper reaches 80% of themaximum direct buffer memory specified for ZooKeeper.



8. On the MRS Manager portal, choose Service > ZooKeeper > Service Configuration >All > quorumpeer > System.

Increase the value of -XX:MaxDirectMemorySize in GC_OPTS as required.







----End

Related Information

N/A

6.7.27 ALM-14000 HDFS Service Unavailable

Description

The system checks the service status of NameService every 30 seconds. This alarm isgenerated when the HDFS service becomes unavailable because all NameService services areabnormal.

This alarm is cleared when the HDFS service recovers because at least one NameServiceservice is in the normal state.


Issue 01 (2018-09-06) 203


14000 Critical Yes





Impact on the SystemHDFS fails to provide services for HDFS service-based upper-layer components, such asHBase and MapReduce. As a result, users cannot read or write files.

Possible Causesl The ZooKeeper service is abnormal.l All NameService services are abnormal.

Procedure

Step 1 Check the ZooKeeper service status.

1. Log in to MRS Manager, choose Service, and check whether the health status of theZooKeeper service is Good.– If yes, go to Step 1.2.– If no, go to Step 2.1.

2. Rectify the health status. For details, see ALM-13000 ZooKeeper Service Unavailable.Check whether the health status of the ZooKeeper service is Good.– If yes, go to Step 1.3.– If no, go to Step 3.


Step 2 Handle the NameService service exception alarm.

1. Log in to MRS Manager. On the Alarm page, check whether all NameService serviceshave abnormal alarms.


Issue 01 (2018-09-06) 204


2. See ALM-14010 NameService Service Is Abnormal to handle abnormal NameServiceservices and check whether each alarm is cleared.– If yes, go to Step 2.3.– If no, go to Step 3.




----End

Related Information

N/A

6.7.28 ALM-14001 HDFS Disk Usage Exceeds the Threshold

Description

The system checks the disk usage of the HDFS cluster every 30 seconds and compares it withthe threshold. This alarm is generated when the HDFS disk usage exceeds the threshold and iscleared when the usage is less than or equal to the threshold.


14001 Major Yes





NSName Specifies the NameService service forwhich the alarm is generated.


Issue 01 (2018-09-06) 205




The performance of writing data to HDFS is affected.

Possible Causes

The disk space configured for the HDFS cluster is insufficient.

Procedure

Step 1 Check the disk capacity and delete unnecessary files.

1. On the MRS Manager portal, choose Service > HDFS. The Service Status page isdisplayed.

2. In the Real-Time Statistics area, view the value of the monitoring indicator Percentageof HDFS Capacity to check whether the HDFS disk usage exceeds the threshold.



3. Use the client on the cluster node and run the hdfs dfsadmin -report command to checkwhether the value of DFS Used% is less than 100% minus the threshold.



4. Use the client on the cluster node and run the hdfs dfs -rm -r file or directory commandto delete unnecessary files.




Step 2 Expand the system.

1. Expand the disk capacity.







----End


Issue 01 (2018-09-06) 206

Related Information

N/A

6.7.29 ALM-14002 DataNode Disk Usage Exceeds the Threshold

Description

The system checks the DataNode disk usage every 30 seconds and compares it with thethreshold. This alarm is generated when the value of Percentage of DataNode Capacityexceeds the threshold and is cleared when the value is less than or equal to the threshold.

Attribute


14002 Major Yes

Parameters







The performance of writing data to HDFS is affected.

Possible Causesl The disk space configured for the HDFS cluster is insufficient.l Data skew occurs among DataNodes.

Procedure

Step 1 Check the cluster disk capacity.

1. Log in to MRS Manager. On the Alarm page, check whether alarm ALM-14001 HDFSDisk Usage Exceeds the Threshold exists.


Issue 01 (2018-09-06) 207


2. Follow the procedures in ALM-14001 HDFS Disk Usage Exceeds the Threshold tohandle the alarm and check whether the alarm is cleared.– If yes, go to Step 1.3.– If no, go to Step 3.


Step 2 Check the balance status of DataNodes.

1. Use the client on the cluster node and run the hdfs dfsadmin -report command to viewthe value of DFS Used% on the DataNode that generated the alarm. Compare this valuewith those on other DataNodes and check whether the difference between the values isgreater than 10.– If yes, go to Step 2.2.– If no, go to Step 3.

2. If data skew occurs, use the client on the cluster node and run the hdfs balancer -threshold 10 command.




----End


6.7.30 ALM-14003 Number of Lost HDFS Blocks Exceeds theThreshold

DescriptionThe system checks the number of lost blocks every 30 seconds and compares it with thethreshold. This alarm is generated when the number of lost blocks exceeds the threshold andis cleared when the number is less than or equal to the threshold.


14003 Major Yes


Issue 01 (2018-09-06) 208

Parameters








Data stored in HDFS is lost. HDFS may enter the safe mode and cannot provide writeservices. Lost block data cannot be restored.

Possible Causesl The DataNode instance is abnormal.l Data is deleted.

Procedure

Step 1 Check the DataNode instance.

1. On the MRS Manager portal, choose Service > HDFS > Instance.2. Check whether the status of all DataNode instances is Good.


3. Restart the DataNode instance. Check whether the DataNode instance restartssuccessfully.– If yes, go to Step 2.2.– If no, go to Step 2.1.

Step 2 Delete the damaged file.

1. Use the client on the cluster node. Run the hdfs fsck / -delete command to delete the lostfile. Then rewrite the file and recover the data.



Issue 01 (2018-09-06) 209



----End

Related Information

N/A

6.7.31 ALM-14004 Number of Damaged HDFS Blocks Exceeds theThreshold

Description

The system checks the number of damaged blocks every 30 seconds and compares it with thethreshold. This alarm is generated when the number of damaged blocks exceeds the thresholdand is cleared when the number is less than or equal to the threshold.

Attribute


14004 Major Yes

Parameters








Data is damaged and HDFS fails to read files.


Issue 01 (2018-09-06) 210

Possible Causesl The DataNode instance is abnormal.l Data verification information is damaged.

ProcedureContact the public cloud O&M personnel and send the collected log information.


6.7.32 ALM-14006 Number of HDFS Files Exceeds the Threshold

DescriptionThe system checks the number of HDFS files every 30 seconds and compares it with thethreshold. This alarm is generated when the number of HDFS files exceeds the threshold andis cleared when the number is less than or equal to the threshold.


14006 Major Yes







Impact on the SystemDisk storage space is insufficient, which may result in data import failure. The performance ofthe HDFS system is affected.


Issue 01 (2018-09-06) 211

Possible Causes

The number of HDFS files exceeds the threshold.

Procedure

Step 1 Check whether unnecessary files exist in the system.

1. Use the client on the cluster node and run the hdfs dfs -ls file or directory command tocheck whether the files in the directory can be deleted.– If yes, go to Step 1.2.– If no, go to Step 2.1.

2. Run the hdfs dfs -rm -r file or directory command. Delete unnecessary files, wait 5minutes, and check whether the alarm is cleared.– If yes, no further action is required.– If no, go to Step 2.1.

Step 2 Check the number of files in the system.

1. On the MRS Manager portal, choose System > Configure Alarm Threshold.2. In the navigation tree on the left, choose Service > HDFS > HDFS File > Total

Number of Files.3. In the right pane, modify the threshold in the rule based on the number of current HDFS

files.To check the number of HDFS files, choose Service > HDFS, click Customize in theReal-Time Statistics area on the right, and select the HDFS File monitoring item.




----End

Related Information

N/A

6.7.33 ALM-14007 HDFS NameNode Memory Usage Exceeds theThreshold

Description

The system checks the HDFS NameNode memory usage every 30 seconds and compares itwith the threshold. This alarm is generated when the HDFS NameNode memory usageexceeds the threshold and is cleared when it is less than or equal to the threshold.


Issue 01 (2018-09-06) 212

Attribute


14007 Major Yes

Parameters





Trigger condition Generates an alarm when the actualindicator value exceeds the specifiedthreshold.


The HDFS NameNode memory usage is too high, which affects the data read/writeperformance of the HDFS.

Possible Causes

The HDFS NameNode memory is insufficient.

Procedure

Step 1 Delete unnecessary files.








----End


Issue 01 (2018-09-06) 213


6.7.34 ALM-14008 HDFS DataNode Memory Usage Exceeds theThreshold

DescriptionThe system checks the HDFS DataNode memory usage every 30 seconds and compares itwith the threshold. This alarm is generated when the HDFS DataNode memory usage exceedsthe threshold and is cleared when it is less than or equal to the threshold.


14007 Major Yes






Impact on the SystemThe HDFS DataNode memory usage is too high, which affects the data read/writeperformance of the HDFS.

Possible CausesThe HDFS DataNode memory is insufficient.

Procedure

Step 1 Delete unnecessary files.



Issue 01 (2018-09-06) 214







----End

Related Information

N/A

6.7.35 ALM-14009 Number of Dead DataNodes Exceeds theThreshold

Description

The system checks the number of faulty DataNodes in the HDFS cluster every 30 seconds andcompares it with the threshold. This alarm is generated when the number of faulty DataNodesin the HDFS cluster exceeds the threshold and is cleared when the number is less than orequal to the threshold.

Attribute


14009 Major Yes

Parameters







Issue 01 (2018-09-06) 215


Faulty DataNodes cannot provide HDFS services.

Possible Causesl DataNodes are faulty or overloaded.l The network between the NameNode and the DataNode is disconnected or busy.l NameNodes are overloaded.

Procedure

Step 1 Check whether DataNodes are faulty.

1. Use the client on the cluster node and run the hdfs dfsadmin -report command to checkwhether DataNodes are faulty.– If yes, go to Step 1.2.– If no, go to Step 2.1.

2. On the MRS Manager portal, choose Service > HDFS > Instance to check whether anyDataNode is stopped.– If yes, go to Step 1.3.– If no, go to Step 2.1.

3. Select the DataNode instance, and choose More > Restart Instance to restart it. Wait 5minutes and check whether the alarm is cleared.– If yes, no further action is required.– If no, go to Step 2.1.

Step 2 Check the status of the network between the NameNode and the DataNode.

1. Log in to the faulty DataNode using its service IP address. Run the ping IP address ofthe NameNode command to check whether the network between the DataNode and theNameNode is abnormal.– If yes, go to Step 2.2.– If no, go to Step 3.1.

2. Rectify the network fault. Wait 5 minutes and check whether the alarm is cleared.– If yes, no further action is required.– If no, go to Step 3.1.

Step 3 Check whether the DataNode is overloaded.

1. On the MRS Manager portal, click Alarm and check whether alarm ALM-14008 HDFSDataNode Memory Usage Exceeds the Threshold exists.– If yes, go to Step 3.2.– If no, go to Step 4.1.

2. Follow the procedures in ALM-14008 HDFS DataNode Memory Usage Exceeds theThreshold to handle the alarm and check whether the alarm is cleared.– If yes, go to Step 3.3.– If no, go to Step 4.1.



Issue 01 (2018-09-06) 216


Step 4 Check whether the NameNode is overloaded.

1. On the MRS Manager portal, click Alarm and check whether alarm ALM-14007 HDFSNameNode Memory Usage Exceeds the Threshold exists.– If yes, go to Step 4.2.– If no, go to Step 5.

2. Follow the procedures in ALM-14007 HDFS NameNode Memory Usage Exceeds theThreshold to handle the alarm and check whether the alarm is cleared.– If yes, go to Step 4.3.– If no, go to Step 5.




----End

Related Information

N/A

6.7.36 ALM-14010 NameService Service Is Abnormal

Description

The system checks the NameService service status every 180 seconds. This alarm is generatedwhen the NameService service is unavailable and is cleared when the NameService servicerecovers.

Attribute


14010 Major Yes

Parameters




Issue 01 (2018-09-06) 217




NSName Specifies the name of NameService forwhich the alarm is generated.


HDFS fails to provide services for upper-layer components based on the NameServiceservice, such as HBase and MapReduce. As a result, users cannot read or write files.

Possible Causesl The JournalNode is faulty.l The DataNode is faulty.l The disk capacity is insufficient.l The NameNode enters safe mode.

Procedure

Step 1 Check the status of the JournalNode instance.

1. On the MRS Manager portal, click Service.2. Click HDFS.3. Click Instance.4. Check whether the Health Status of the JournalNode is Good.


5. Select the faulty JournalNode and choose More > Restart Instance. Check whether theJournalNode successfully restarts.– If yes, go to Step 1.6.– If no, go to Step 5.


Step 2 Check the status of the DataNode instance.

1. On the MRS Manager portal, click Service.2. Click HDFS.3. In Operation and Health Summary, check whether the Health Status of all

DataNodes is Good.


Issue 01 (2018-09-06) 218



4. Click Instance. On the DataNode management page, select the faulty DataNode, andchoose More > Restart Instance. Check whether the DataNode successfully restarts.






Step 3 Check disk status.

1. On the MRS Manager portal, click Host.

2. In the Disk Usage column, check whether disk space is insufficient.



3. Expand the disk capacity.




Step 4 Check whether NameNode is in safe mode.

1. Use the client on the cluster node, and run the hdfs dfsadmin -safemode get commandto check whether Safe mode is ON is displayed.

Information after Safe mode is ON is alarm information and is displayed based on actualconditions.



2. Use the client on the cluster node and run the hdfs dfsadmin -safemode leavecommand.







----End

Related Information

N/A


Issue 01 (2018-09-06) 219

6.7.37 ALM-14011 HDFS DataNode Data Directory Is NotConfigured Properly

Description

The DataNode parameter dfs.datanode.data.dir specifies DataNode data directories. Thisalarm is generated in any of the following scenarios:

l A configured data directory cannot be created.

l A data directory uses the same disk as other critical directories in the system.

l Multiple directories use the same disk.

This alarm is cleared when the DataNode data directory is configured properly and thisDataNode is restarted.

Attribute


14011 Major Yes

Parameters






If the DataNode data directory is mounted on critical directories such as the root directory, thedisk space of the root directory will be used up after running for a long time. This causes asystem fault.

If the DataNode data directory is not configured properly, HDFS performance will deteriorate.

Possible Causesl The DataNode data directory fails to be created.

l The DataNode data directory uses the same disk as critical directories, such as / or /boot.

l Multiple directories in the DataNode data directory use the same disk.


Issue 01 (2018-09-06) 220

Procedure

Step 1 Check the alarm cause and information about the DataNode that generated the alarm.

1. On the MRS Manager portal, click Alarm. In the alarm list, click the alarm.2. In the Alarm Details area, view Alarm Cause. In HostName of Location, obtain the

host name of the DataNode that generated the alarm.

Step 2 Delete directories that do not comply with the disk plan from the DataNode data directory.

1. Choose Service > HDFS > Instance. In the instance list, click the DataNode instance onthe alarm node.

2. Click Instance Configuration and view the value of the DataNode parameterdfs.datanode.data.dir.

3. Check whether all DataNode data directories are consistent with the disk plan.– If yes, go to Step 2.4.– If no, go to Step 2.7.

4. Modify the DataNode parameter dfs.datanode.data.dir and delete the incorrectdirectories.

5. Choose Service > HDFS > Instance and restart the DataNode instance.6. Check whether the alarm is cleared.


7. Log in to the DataNode that generated the alarm.– If the alarm cause is "The DataNode data directory fails to be created", go to Step

3.1.– If the alarm cause is "The DataNode data directory uses the same disk as critical

directories, such / or /boot", go to Step 4.1.– If the alarm cause is "Multiple directories in the DataNode data directory use the

same disk", go to Step 5.1.

Step 3 Check whether the DataNode data directory is created.

1. Run the following command to switch the user:sudo su - rootsu - omm

2. Run the ls command to check whether the directories exist in the DataNode datadirectory.– If yes, go to Step 7.– If no, go to Step 3.3.

3. Run the mkdir data directory command to create a directory. Check whether thedirectory is successfully created.– If yes, go to Step 6.1.– If no, go to Step 3.4.

4. On the MRS Manager portal, click Alarm to check whether alarm ALM-12017Insufficient Disk Capacity exists.– If yes, go to Step 3.5.


Issue 01 (2018-09-06) 221

– If no, go to Step 3.6.5. Adjust the disk capacity and check whether alarm ALM-12017 Insufficient Disk

Capacity is cleared. For details, see ALM-12017 Insufficient Disk Capacity.– If yes, go to Step 3.3.– If no, go to Step 7.

6. Check whether user omm has the rwx or x permission for all upper-layer directories ofthe directory. For example, for /tmp/abc/, user omm has the x permission for the tmpdirectory and the rwx permission for the abc directory.– If yes, go to Step 6.1.– If no, go to Step 3.7.

7. Run the chmod u+rwx path or chmod u+x path command as user root to assign therwx or x permission to user omm. Go to Step 3.3.

Step 4 Check whether the DataNode data directory uses the same disk as other critical directories inthe system.

1. Run the df command to obtain the disk mounting information of each directory in theDataNode data directory.

2. Check whether the directories mounted on the disk are critical directories, such as / or /boot.– If yes, go to Step 4.3.– If no, go to Step 6.1.

3. Change the value of the DataNode parameter dfs.datanode.data.dir and delete thedirectories that use the same disk as critical directories.

4. Go to Step 6.1.

Step 5 Check whether multiple directories in the DataNode data directory use the same disk.

1. Run the df command to obtain the disk mounting information of each directory in theDataNode data directory. Record the mounted directory in the command output.

2. Modify the DataNode parameter dfs.datanode.data.dir to reserve one of the directoriesmounted on the same disk directory.

3. Go to Step 6.1.

Step 6 Restart the DataNode and check whether the alarm is cleared.

1. On the MRS Manager portal, choose Service > HDFS > Instance and restart theDataNode instance.

2. Check whether the alarm is cleared.– If yes, no further action is required.– If no, go to Step 7.



----End



Issue 01 (2018-09-06) 222

6.7.38 ALM-14012 HDFS JournalNode Data Is Not Synchronized

DescriptionOn the active NameNode, the system checks data synchronization on all JournalNodes in thecluster every 5 minutes. This alarm is generated when data on a JournalNode is notsynchronized with that on other JournalNodes.

This alarm is cleared in 5 minutes after data on the JournalNodes is synchronized.


14012 Major Yes




IP Specifies the service IP address of theJournalNode instance for which the alarm isgenerated.

Impact on the SystemIf data on more than half of the JournalNodes is not synchronized, the NameNode cannotwork correctly, making the HDFS service unavailable.

Possible Causesl The JournalNode instance has not been started or has been stopped.l The JournalNode instance is working incorrectly.l The network of the JournalNode is unreachable.

Procedure

Step 1 Check whether the JournalNode instance has been started.

1. Log in to MRS Manager and click Alarm. In the alarm list, click the alarm.2. In the Alarm Details area, check Location and obtain the IP address of the JournalNode

that generated the alarm.3. Choose Service > HDFS > Instance. In the instance list, click the JournalNode that

generated the alarm and check whether Operating Status of the node is Started.


Issue 01 (2018-09-06) 223


4. Select the JournalNode instance and choose More > Start Instance to start it.5. Wait 5 minutes and check whether the alarm is cleared.


Step 2 Check whether the JournalNode instance is working correctly.

1. Check whether Health Status of the JournalNode instance is Good.– If yes, go to Step 3.1.– If no, go to Step 2.2.

2. Select the JournalNode instance and choose More > Start Instance to start it.3. Wait 5 minutes and check whether the alarm is cleared.


Step 3 Check whether the network of the JournalNode is reachable.

1. On the MRS Manager portal, choose Service > HDFS > Instance to check the serviceIP address of the active NameNode.

2. Log in to the active NameNode.3. Run the ping Service IP address of the JournalNode command to check whether either a

timeout occurs or the network between the active NameNode and the JournalNode isunreachable.– If yes, go to Step 3.4.– If no, go to Step 4.

4. Contact public cloud O&M personnel to rectify the network fault. Wait 5 minutes andcheck whether the alarm is cleared.– If yes, no further action is required.– If no, go to Step 4.



----End


6.7.39 ALM-16000 Percentage of Sessions Connected to theHiveServer to Maximum Number Allowed Exceeds the Threshold

DescriptionThe system checks the percentage of sessions connected to the HiveServer to the maximumnumber allowed every 30 seconds. This indicator can be viewed on the Hive service


Issue 01 (2018-09-06) 224

monitoring page. This alarm is generated when the percentage exceeds the specified thresholdand is automatically cleared when the percentage is less than or equal to the threshold.

Attribute


16000 Major Yes

Parameters







New connections cannot be created.

Possible Causes

Too many clients are connected to the HiveServer.

Procedure

Step 1 Increase the maximum number of connections to Hive.

1. Log in to the MRS Manager portal.2. Choose Service > Hive > Service Configuration, and set Type to All.3. Increase the value of the hive.server.session.control.maxconnections configuration

item.Suppose the value of the configuration item is A, the threshold is B, and sessionsconnected to the HiveServer is C. Adjust the value of the configuration item according toA x B > C. Sessions connected to the HiveServer can be viewed on the Hive servicemonitoring page.

4. Check whether the alarm is cleared.– If yes, no further action is required.


Issue 01 (2018-09-06) 225





----End

Related Information

N/A

6.7.40 ALM-16001 Hive Warehouse Space Usage Exceeds theThreshold

Description

The system checks the usage of Hive data warehouse space every 30 seconds. The indicatorPercentage of HDFS Space Used by Hive to the Available Space can be viewed on theHive service monitoring page. This alarm is generated when the usage of Hive warehousespace exceeds the specified threshold and is cleared when the usage is less than or equal to thethreshold.

You can reduce the warehouse space usage by expanding the warehouse capacity or releasingused space.

Attribute


16001 Major Yes

Parameters







Issue 01 (2018-09-06) 226


The system fails to write data, which causes data loss.

Possible Causesl The maximum available capacity of the HDFS for Hive is too small.l The system disk space is insufficient.l Data nodes break down.

Procedure

Step 1 Expand the system configuration.

1. Analyze the cluster HDFS capacity usage and increase the maximum available capacityof the HDFS for Hive.Log in to MRS Manager, choose Service > Hive > Service Configuration, and set Typeto All. Increase the value of the hive.metastore.warehouse.size.percent configurationitem. Suppose the value of the configuration item is A, total HDFS storage space is B,the threshold is C, and HDFS space used by Hive is D. Adjust the value of theconfiguration item according to A x B x C > D. The total HDFS storage space can beviewed on the HDFS monitoring page, and HDFS space used by Hive can be viewed onthe Hive service monitoring page.

2. Check whether the alarm is cleared.– If yes, no further action is required.– If no, go to Step 2.1.


1. Add nodes.2. Check whether the alarm is cleared.


Step 3 Check whether data nodes are in the normal state.

1. Log in to MRS Manager and click Alarm.2. Check whether alarm ALM-12006 Node Fault, ALM-12007 Process Fault, or

ALM-14002 DataNode Disk Usage Exceeds the Threshold exists.– If yes, go to Step 3.3.– If no, go to Step 4.

3. Follow the procedures in ALM-12006 Node Fault, ALM-12007 Process Fault, orALM-14002 DataNode Disk Usage Exceeds the Threshold to handle the alarm.





Issue 01 (2018-09-06) 227


----End

Related Information

N/A

6.7.41 ALM-16002 Successful Hive SQL Operations Are Lowerthan the Threshold

Description

Every 30 seconds, the system checks the percentage of successfully executed HiveQLstatements. Percentage of successfully executed HiveQL statements = Number of HiveQLstatements successfully executed by Hive in a specified period/Total number of HiveQLstatements executed by Hive. This indicator can be viewed on the Hive service monitoringpage.

This alarm is generated when the percentage of successfully executed HiveQL statements islower than the specified threshold and is cleared when the percentage is greater than or equalto the threshold.

The name of the host where the alarm is generated can be obtained from the alarm locationinformation. The host IP address is the IP address of the HiveServer node.

Attribute


16002 Major Yes

Parameters







Issue 01 (2018-09-06) 228


The system configuration and performance cannot meet service processing requirements.

Possible Causesl A syntax error occurs in HiveQL commands.l The HBase service is abnormal when a Hive on HBase task is being performed.l Basic services that are depended on are abnormal, such as HDFS, Yarn, and ZooKeeper.

Procedure

Step 1 Check whether the HiveQL commands comply with syntax.

1. Use the Hive client to log in to the HiveServer node where the alarm is generated. Querythe HiveQL syntax standard provided by Apache, and check whether the HiveQLcommands are correct. For details, see https://cwiki.apache.org/confluence/display/hive/languagemanual.– If yes, go to Step 2.1.– If no, go to Step 1.2.

NOTE

To view the user who runs an incorrect statement, download the HiveServerAudit logs of theHiveServer node where this alarm is generated. Set Start Time and End Time to 10 minutesbefore and after the alarm generation time respectively. Open the log file and search for theResult=FAIL keyword to filter the log information about the incorrect statement, and then viewthe user who runs the incorrect statement according to UserName in the log information.

2. Enter correct HiveQL statements, and check whether the command can be properlyexecuted.– If yes, go to Step 4.5.– If no, go to Step 2.1.

Step 2 Check whether the HBase service is abnormal.

1. Check whether a Hive on HBase task is performed.– If yes, go to Step 2.2.– If no, go to Step 3.1.

2. Check whether the HBase service is in the normal state in the service list.– If yes, go to Step 3.1.– If no, go to Step 2.3.

3. Check the alarms displayed on the alarm page and clear them according to Alarm Help.4. Enter correct HiveQL statements, and check whether the command can be properly

executed.– If yes, go to Step 4.5.– If no, go to Step 3.1.

Step 3 Check whether the Spark service is abnormal.

1. Check whether the Spark service is in the normal state in the service list.– If yes, go to Step 4.1.


Issue 01 (2018-09-06) 229


2. Check the alarms displayed on the alarm page and clear them according to Alarm Help.

3. Enter correct HiveQL statements, and check whether the command can be properlyexecuted.



Step 4 Check whether HDFS, Yarn, and ZooKeeper are in the normal state.

1. On the MRS Manager portal, click Service.

2. In the service list, check whether the services, such as HDFS, Yarn, and ZooKeeper arein the normal state.



3. Check the alarms displayed on the alarm page and clear them according to Alarm Help.

4. Enter correct HiveQL statements, and check whether the command can be properlyexecuted.









----End

Related Information

N/A

6.7.42 ALM-16004 Hive Service Unavailable

Description

The system checks the Hive service status every 30 seconds. This alarm is generated when theHive service is unavailable and is cleared when the Hive service is in the normal state.

Attribute


16004 Critical Yes


Issue 01 (2018-09-06) 230

Parameters






The system cannot provide data loading, query, and extraction services.

Possible Causesl The Hive service unavailability may be related to basic services, such as ZooKeeper,

HDFS, Yarn, and DBService or caused by the faults of the Hive processes.– The ZooKeeper, HDFS, Yarn, or DBService services are abnormal.– The Hive service process is faulty. If the alarm is caused by a Hive process fault,

the alarm report has a delay of about 5 minutes.l The network communication between the Hive service and basic services is interrupted.

Procedure

Step 1 Check the HiveServer/MetaStore process status.

1. On MRS Manager, choose Service > Hive > Instance. In the Hive instance list, checkwhether all HiveSserver/MetaStore instances are in the Unknown state.– If yes, go to Step 1.2.– If no, go to Step 2.

2. Above the Hive instance list, choose More > Restart Instance to restart the HiveServer/MetaStore process.

3. In the alarm list, check whether alarm ALM-16004 Hive Service Unavailable iscleared.– If yes, no further action is required.– If no, go to Step 2.


1. In the alarm list on MRS Manager, check whether alarm ALM-12007 Process Fault isgenerated.– If yes, go to Step 2.2.– If no, go to Step 3.

2. In the Alarm Details area of ALM-12007 Process Fault, check whether ServiceNameis ZooKeeper.


Issue 01 (2018-09-06) 231


3. Rectify the fault by following the steps provided in ALM-12007 Process Fault.4. In the alarm list, check whether alarm ALM-16004 Hive Service Unavailable is

cleared.– If yes, no further action is required.

n If no, go to Step 3.

Step 3 Check the HDFS service status.

1. In the alarm list on MRS Manager, check whether alarm ALM-14000 HDFS ServiceUnavailable is generated.– If yes, go to Step 3.2.– If no, go to Step 4.

2. Rectify the fault by following the steps provided in ALM-14000 HDFS ServiceUnavailable.


Step 4 Check the Yarn service status.

1. In the alarm list on MRS Manager, check whether alarm ALM-18000 Yarn ServiceUnavailable is generated.– If yes, go to Step 4.2.– If no, go to Step 4.

2. Rectify the fault by following the steps provided in ALM-18000 Yarn ServiceUnavailable.


Step 5 Check the DBService service status.

1. In the alarm list on MRS Manager, check whether alarm ALM-27001 DBServiceUnavailable is generated.– If yes, go to Step 5.2.– If no, go to Step 6.

2. Rectify the fault by following the steps provided in ALM-27001 DBServiceUnavailable.


Step 6 Check the network connection between Hive and ZooKeeper, HDFS, Yarn, and DBService.


Issue 01 (2018-09-06) 232

1. On MRS Manager, choose Service > Hive.

2. Click Instance.

The HiveServer instance list is displayed.

3. Click Host Name in the row of HiveServer.

The HiveServer host status page is displayed.

4. Record the IP address under Summary.

5. Use the IP address obtained in Step 6.4 to log in to the host that runs HiveServer.

6. Run the ping command to check whether the network connection is in the normal statebetween the host that runs HiveServer and the hosts that run the ZooKeeper, HDFS,Yarn, and DBService services.



The methods of obtaining the IP addresses of the hosts that are running ZooKeeper,HDFS, Yarn, and DBService services, as well as the HiveServer IP address, are thesame.

7. Contact public cloud O&M personnel to recover the network.

8. In the alarm list, check whether the alarm ALM-16004 Hive Service Unavailable iscleared.






----End

Related Information

N/A

6.7.43 ALM-18000 Yarn Service Unavailable

Description

The alarm module checks the Yarn service status every 30 seconds. This alarm is generatedwhen the Yarn service is unavailable and is cleared when the Yarn service recovers.

Attribute


18000 Critical Yes


Issue 01 (2018-09-06) 233





Impact on the Systeml The cluster cannot provide the Yarn service.l Users cannot run new applications.l Submitted applications cannot be run.

Possible Causesl The ZooKeeper service is abnormal.l The HDFS service is abnormal.l There is no active ResourceManager node in the Yarn cluster.l All NodeManager nodes in the Yarn cluster are abnormal.

Procedure


1. In the alarm list on MRS Manager, check whether alarm ALM-13000 ZooKeeperService Unavailable is generated.– If yes, go to Step 1.2.– If no, go to Step 2.1.

2. Rectify the fault by following the steps provided in ALM-13000 ZooKeeper ServiceUnavailable, and check whether the alarm is cleared.– If yes, no further action is required.– If no, go to Step 2.1.


1. In the alarm list on MRS Manager, check whether an alarm related to HDFS isgenerated.– If yes, go to Step 2.2.– If no, go to Step 3.1.

2. Click Alarm, and handle HDFS alarms according to Alarm Help. Check whether thealarm is cleared.– If yes, no further action is required.– If no, go to Step 3.1.


Issue 01 (2018-09-06) 234

Step 3 Check the ResourceManager node status in the Yarn cluster.

1. On MRS Manager, choose Service > Yarn.

2. In Yarn Summary, check whether there is an active ResourceManager node in the Yarncluster.



Step 4 Check the NodeManager node status in the Yarn cluster.

1. On MRS Manager, choose Service > Yarn > Instance.

2. Check Health Status of NodeManager, and check whether there are unhealthy nodes.



3. Rectify the fault by following the steps provided in ALM-18002 NodeManagerHeartbeat Lost or ALM-18003 NodeManager Unhealthy. After the fault is rectified,check whether the alarm is cleared.






----End

Related Information

N/A

6.7.44 ALM-18002 NodeManager Heartbeat Lost

Description

The system checks the number of lost NodeManager nodes every 30 seconds and comparesthe number with the threshold. This alarm is generated when the value of the Lost Nodesindicator exceeds the threshold and is cleared when the value is less than or equal to thethreshold.

Attribute


18002 Major Yes


Issue 01 (2018-09-06) 235






Impact on the Systeml The lost NodeManager node cannot provide the Yarn service.l The number of containers decreases, so the cluster performance deteriorates.

Possible Causesl NodeManager is forcibly deleted without decommission.l All NodeManager instances are stopped or the NodeManager process is faulty.l The host where the NodeManager node resides is faulty.l The network between the NodeManager and ResourceManager is disconnected or busy.

Procedure


Related Information

N/A

6.7.45 ALM-18003 NodeManager Unhealthy

Description

The system checks the number of abnormal NodeManager nodes every 30 seconds andcompares the number with the threshold. This alarm is generated when the value of theUnhealthy Nodes indicator exceeds the threshold and is cleared when the value is less than orequal to the threshold.


18003 Major Yes


Issue 01 (2018-09-06) 236






Impact on the Systeml The faulty NodeManager node cannot provide the Yarn service.l The number of containers decreases, so the cluster performance deteriorates.

Possible Causesl The hard disk space of the host where the NodeManager node resides is insufficient.l User omm does not have the permission to access a local directory on the NodeManager

node.

Procedure


Related Information

N/A

6.7.46 ALM-18006 MapReduce Job Execution Timeout

Description

The alarm module checks MapReduce job execution every 30 seconds. This alarm isgenerated when the execution of a submitted MapReduce job times out. It must be manuallycleared.


18006 Major No


Issue 01 (2018-09-06) 237

Parameters







Because the execution times out, no execution result can be obtained.

Possible Causes

The specified time period is shorter than the execution time. (Executing a MapReduce jobtakes a long time.)

Procedure

Step 1 Check whether time is improperly set.

Set -Dapplication.timeout.interval to a larger value, or do not set the parameter. Execute theMapReduce job again and check whether it is executed successfully.

l If yes, go to Step 2.4.

l If no, go to Step 2.1.

Step 2 Check the Yarn service status.

1. In the alarm list on MRS Manager, check whether alarm ALM-18000 Yarn ServiceUnavailable is generated.– If yes, go to Step 2.2.– If no, go to Step 3.

2. Rectify the fault by following the steps provided in ALM-18000 Yarn ServiceUnavailable.

3. Run the MapReduce job command again to check whether the MapReduce job can beexecuted.– If yes, go to Step 2.4.– If no, go to Step 4.


Issue 01 (2018-09-06) 238

4. In the alarm list, choose Operation > to manually clear the alarm. No further actionis required.

Step 3 Adjust the timeout threshold.

On MRS Manager, choose System > Configure Alarm Threshold > Service > Yarn >Timed Out Tasks, and increase the maximum number of timeout tasks allowed by the currentthreshold rule. Check whether the alarm is cleared.l If yes, no further action is required.l If no, go to Step 4.



----End


6.7.47 ALM-19000 HBase Service Unavailable

DescriptionThe alarm module checks the HBase service status every 30 seconds. This alarm is generatedwhen the HBase service is unavailable and is cleared when the HBase service recovers.


19000 Critical Yes





Impact on the SystemOperations such as reading or writing data and creating tables cannot be performed.


Issue 01 (2018-09-06) 239

Possible Causesl The ZooKeeper service is abnormal.l The HDFS service is abnormal.l The HBase service is abnormal.l The network is abnormal.

Procedure


1. In the service list on MRS Manager, check whether the health status of ZooKeeper isGood.– If yes, go to Step 2.1.– If no, go to Step 1.2.

2. In the alarm list, check whether alarm ALM-13000 ZooKeeper Service Unavailableexists.– If yes, go to Step 1.3.– If no, go to Step 2.1.

3. Rectify the fault by following the steps provided in ALM-13000 ZooKeeper ServiceUnavailable.

4. Wait several minutes and check whether the alarm is cleared.– If yes, no further action is required.– If no, go to Step 2.1.


1. In the alarm list, check whether alarm ALM-14000 HDFS Service Unavailable exists.– If yes, go to Step 2.2.– If no, go to Step 3.

2. Rectify the fault by following the steps provided in ALM-14000 HDFS ServiceUnavailable.

3. Wait several minutes and check whether the alarm is cleared.



----End


6.7.48 ALM-19006 HBase Replication Synchronization Failed

DescriptionThis alarm is generated when disaster recovery (DR) data fails to be synchronized to astandby cluster. It is cleared when DR data is successfully synchronized.


Issue 01 (2018-09-06) 240


19006 Major Yes





Impact on the SystemHBase data in a cluster fails to be synchronized to the standby cluster, causing datainconsistency between active and standby clusters.

Possible Causesl The HBase service on the standby cluster is abnormal.l The network is abnormal.

Procedure

Step 1 Check whether the alarm is automatically cleared.

1. Log in to MRS Manager of the active cluster, and click Alarm.2. In the alarm list, click the alarm and obtain the alarm generation time from Generated

On in Alarm Details. Check whether the alarm persists for over 5 minutes.– If yes, go to Step 2.1.– If no, go to Step 1.3.

3. Wait 5 minutes and check whether the alarm is automatically cleared.– If yes, no further action is required.– If no, go to Step 2.1.

Step 2 Check the HBase service status of the standby cluster.

1. Log in to MRS Manager of the active cluster, and click Alarm.2. In the alarm list, click the alarm and obtain HostName from Location in Alarm Details.3. Log in to the node where the HBase client resides in the active cluster. Run the following

command to switch the user:sudo su - root


Issue 01 (2018-09-06) 241

su - omm4. Run the status 'replication', 'source' command to check the replication synchronization

status of the faulty node.The replication synchronization status of a node is as follows.10-10-10-153: SOURCE: PeerID=abc, SizeOfLogQueue=0, ShippedBatches=2, ShippedOps=2, ShippedBytes=320, LogReadInBytes=1636, LogEditsRead=5, LogEditsFiltered=3, SizeOfLogToReplicate=0, TimeForLogToReplicate=0, ShippedHFiles=0, SizeOfHFileRefsQueue=0, AgeOfLastShippedOp=0, TimeStampsOfLastShippedOp=Mon Jul 18 09:53:28 CST 2016, Replication Lag=0, FailedReplicationAttempts=0 SURCE: PeerID=abc1, SizeOfLogQueue=0, ShippedBatches=1, ShippedOps=1, ShippedBytes=160, LogReadInBytes=1636, LogEditsRead=5, LogEditsFiltered=3, SizeOfLogToReplicate=0, TimeForLogToReplicate=0, ShippedHFiles=0, SizeOfHFileRefsQueue=0, AgeOfLastShippedOp=16788, TimeStampsOfLastShippedOp=Sat Jul 16 13:19:00 CST 2016, Replication Lag=16788, FailedReplicationAttempts=5

5. Obtain PeerID corresponding to a record whose value of FailedReplicationAttempts isgreater than 0.In the preceding step, data on faulty node 10-10-10-153 fails to be synchronized to astandby cluster whose PeerID is abc1.

6. Run the list_peers command to find the cluster and the HBase instance corresponding tothe PeerID.PEER_ID CLUSTER_KEY STATE TABLE_CFS abc1 10.10.10.110,10.10.10.119,10.10.10.133:24002:/hbase2 ENABLED abc 10.10.10.110,10.10.10.119,10.10.10.133:24002:/hbase ENABLED /hbase2 indicates that data is synchronized to the HBase2 instance of the standby cluster.

7. In the service list on MRS Manager of the standby cluster, check whether the healthstatus of the HBase instance obtained in Step 2.6 is Good.– If yes, go to Step 3.1.– If no, go to Step 2.8.

8. In the alarm list, check whether alarm ALM-19000 HBase Service Unavailable exists.– If yes, go to Step 2.9.– If no, go to Step 3.1.

9. Rectify the fault by following the steps provided in ALM-19000 HBase ServiceUnavailable.

10. Wait several minutes and check whether the alarm is cleared.– If yes, no further action is required.– If no, go to Step 3.1.

Step 3 Check the network connection between RegionServers on active and standby clusters.

1. Log in to MRS Manager of the active cluster, and click Alarm.2. In the alarm list, click the alarm and obtain HostName from Location in Alarm Details.3. Log in to the faulty RegionServer node.4. Run the ping command to check whether the network connection between the faulty

RegionServer node and the host where RegionServer of the standby cluster resides is inthe normal state.– If yes, go to Step 4.– If no, go to Step 3.5.

5. Contact public cloud O&M personnel to recover the network.


Issue 01 (2018-09-06) 242

6. After the network recovers, check whether the alarm is cleared in the alarm list.






----End

Related Information

N/A

6.7.49 ALM-25000 LdapServer Service Unavailable

Description

The system checks the LdapServer service status every 30 seconds. This alarm is generatedwhen the system detects that both the active and standby LdapServer services are abnormal. Itis cleared when one or both LdapServer services are normal.

Attribute


25000 Critical Yes

Parameters






No operation can be performed for the KrbServer and LdapServer users in the cluster. Forexample, users, user groups, or roles cannot be added, deleted, or modified, and userpasswords cannot be changed on MRS Manager. Authentication for existing users in thecluster is not affected.


Issue 01 (2018-09-06) 243

Possible Causesl The node where the LdapServer service resides is faulty.l The LdapServer process is abnormal.

Procedure

Step 1 Check whether the nodes where the two SlapdServer instances of the LdapServer servicereside are faulty.

1. On MRS Manager, choose Service > LdapServer > Instance to go to the LdapServerinstance page. Obtain the host name of the node where the two SlapdServer instancesreside.

2. On the Alarm page of MRS Manager, check whether alarm ALM-12006 Node Fault isgenerated.– If yes, go to Step 1.3.– If no, go to Step 2.1.

3. Check whether the host name in the alarm is consistent with the host name in Step 1.1.– If yes, go to Step 1.4.– If no, go to Step 2.1.

4. Rectify the fault by following the steps provided in ALM-12006 Node Fault.5. In the alarm list, check whether alarm ALM-25000 LdapServer Service Unavailable is

cleared.– If yes, no further action is required.– If no, go to Step 3.

Step 2 Check whether the LdapServer process is in the normal state.

1. On the Alarm page of MRS Manager, check whether alarm ALM-12007 Process Faultis generated.– If yes, go to Step 2.2.– If no, go to Step 3.

2. Check whether the service name and host name in the alarm are consistent with those ofLdapServer.– If yes, go to Step 2.3.– If no, go to Step 3.

3. Rectify the fault by following the steps provided in ALM-12007 Process Fault.4. In the alarm list, check whether alarm ALM-25000 LdapServer Service Unavailable is




----End


Issue 01 (2018-09-06) 244

Related Information

N/A

6.7.50 ALM-25004 Abnormal LdapServer Data Synchronization

Description

This alarm is generated when LdapServer data on Manager is inconsistent or LdapServer datais different between LdapServer and Manager. It is cleared when the data becomes consistent.

Attribute


25004 Critical Yes

Parameters






LdapServer data inconsistency occurs because LdapServer data on Manager or in the clusteris damaged. The LdapServer process with damaged data cannot provide services externally,and the authentication functions of Manager and the cluster are affected.

Possible Causesl The network of the node where the LdapServer process locates is faulty.

l The LdapServer process is abnormal.

l The OS restart damages data on LdapServer.

Procedure

Step 1 Check whether the network where the LdapServer nodes reside is faulty.

1. On MRS Manager, click Alarm. Record the IP address of HostName in Location of thealarm as IP1. If multiple alarms exist, record the IP addresses as IP1, IP2, and IP3.


Issue 01 (2018-09-06) 245

2. Contact O&M personnel and log in to the node corresponding to IP1. Run the pingcommand on the node to check whether the IP address of the management plane of theactive OMS node can be pinged.– If yes, go to Step 1.3.– If no, go to Step 2.1.

3. Contact public cloud O&M personnel to recover the network and check whether alarmALM-25004 Abnormal LdapServer Data Synchronization is cleared.– If yes, no further action is required.– If no, go to Step 2.1.

Step 2 Check whether the LdapServer process is in the normal state.

1. On the Alarm page of MRS Manager, check whether alarm ALM-12004 OLdapResource Is Abnormal is generated.– If yes, go to Step 2.2.– If no, go to Step 2.4.

2. Rectify the fault by following the steps provided in ALM-12004 OLdap Resource IsAbnormal.

3. Check whether alarm ALM-25004 Abnormal LdapServer Data Synchronization iscleared.– If yes, no further action is required.– If no, go to Step 2.4.

4. On the Alarm page of MRS Manager, check whether alarm ALM-12007 Process Faultof LdapServer is generated.– If yes, go to Step 2.5.– If no, go to Step 3.1.

5. Rectify the fault by following the steps provided in ALM-12007 Process Fault.6. Check whether alarm ALM-25004 Abnormal LdapServer Data Synchronization is

cleared.– If yes, no further action is required.– If no, go to Step 3.1.

Step 3 Check whether the OS restart damages data on LdapServer.

1. On MRS Manager, click Alarm. Record the IP address of HostName in Location of thealarm as IP1. If multiple alarms exist, record the IP addresses as IP1, IP2, and IP3.Choose Service > LdapServer > Service Configuration, record the LdapServer portnumber as PORT. If the IP address in the alarm location information is the IP address ofthe standby OMS node, the port ID is the default port ID 21750.

2. Log in to the node corresponding to IP1 as user omm and run the ldapsearch -Hldaps://IP1:PORT -x -LLL -b dc=hadoop,dc=com command to check whether errorsare displayed in the queried information. If the IP address is that of the standby OMSnode, run export LDAPCONF=${CONTROLLER_HOME}/ldapserver/ldapserver/local/conf/ldap.conf before running the preceding command.– If yes, go to Step 3.3.– If no, go to Step 4.

3. Recover the LdapServer and OMS nodes using backup data before the alarm isgenerated. For details, see section "Recovering Manager Data" in the AdministratorGuide.


Issue 01 (2018-09-06) 246

NOTE

To restore data, use the OMS data and LdapServer data backed up at the same time. Otherwise, theservice and operation may fail. To recover data when services are running properly, you areadvised to manually back up the latest management data and then recover the data. Otherwise,Manager data produced between the backup and recovery points in time will be lost.

4. Check whether alarm ALM-25004 Abnormal LdapServer Data Synchronization iscleared.






----End

Related Information

N/A

6.7.51 ALM-25500 KrbServer Service Unavailable

Description

The system checks the KrbServer service status every 30 seconds. This alarm is generatedwhen the KrbServer service is abnormal and is cleared when the KrbServer service is in thenormal state.

Attribute


25500 Critical Yes

Parameters






Issue 01 (2018-09-06) 247

Impact on the Systeml No operation can be performed for the KrbServer component in the cluster.l KrbServer authentication of other components will be affected.l The health status of components that depend on KrbServer in the cluster is Bad.

Possible Causesl The node where the KrbServer service resides is faulty.l The OLdap service is unavailable.

Procedure

Step 1 Check whether the node where the KrbServer service locates is faulty.

1. On MRS Manager, choose Service > KrbServer > Instance to go to the KrbServerinstance page. Obtain the host name of the node where the KrbServer service resides.

2. On the Alarm page of MRS Manager, check whether alarm ALM-12006 Node Fault isgenerated.– If yes, go to Step 1.3.– If no, go to Step 2.1.

3. Check whether the host name in the alarm is consistent with the host name in Step 1.1.– If yes, go to Step 1.4.– If no, go to Step 2.1.

4. Rectify the fault by following the steps provided in ALM-12006 Node Fault.5. In the alarm list, check whether alarm ALM-25500 KrbServer Service Unavailable is


Step 2 Check whether the OLdap service is unavailable.

1. On the Alarm page of MRS Manager, check whether alarm ALM-12004 OLdapResource Is Abnormal is generated.– If yes, go to Step 2.2.– If no, go to Step 3.

2. Rectify the fault by following the steps provided in ALM-12004 OLdap Resource IsAbnormal.

3. In the alarm list, check whether alarm ALM-25500 KrbServer Service Unavailable iscleared.– If yes, no further action is required.– If no, go to Step 3.



----End


Issue 01 (2018-09-06) 248

Related Information

N/A

6.7.52 ALM-27001 DBService Unavailable

Description

The alarm module checks the DBService status every 30 seconds. This alarm is generatedwhen the system detects that DBService is unavailable and is cleared when DBServicerecovers.


27001 Critical Yes






The database service is unavailable and cannot provide data import or query functions forupper-layer services, which results in service exceptions.

Possible Causesl The floating IP address does not exist.l There is no active DBServer instance.l The active and standby DBServer processes are abnormal.

Procedure

Step 1 Check whether the floating IP address exists in the cluster environment.

1. On MRS Manager, choose Service > DBService > Instance.2. Check whether the active instance exists.



Issue 01 (2018-09-06) 249

– If no, go to Step 2.1.3. Select the active DBServer instance and record the IP address.4. Log in to the host that corresponds to the preceding IP address, and run the ifconfig

command to check whether the DBService floating IP address exists on the node.– If yes, go to Step 1.5.– If no, go to Step 2.1.

5. Run the ping floating IPaddress command to check whether the DBService floating IPaddress can be pinged.– If yes, go to Step 1.6.– If no, go to Step 2.1.

6. Log in to the host that corresponds to the DBService floating IP address, and run theifconfig interface down command to delete the floating IP address.

7. On MRS Manager, choose Service > DBService > More > Restart Service to restartDBService. Check whether DBService is restarted successfully.– If yes, go to Step 1.8.– If no, go to Step 2.1.

8. Wait about 2 minutes and check whether the alarm is cleared from the alarm list.– If yes, no further action is required.– If no, go to Step 3.1.

Step 2 Check the status of the active DBServer instance.

1. Select the DBServer instance whose role status is abnormal and record the IP address.2. On the Alarm page, check whether alarm ALM-12007 Process Fault occurs in the

DBServer instance on the host that corresponds to the IP address.– If yes, go to Step 2.3.– If no, go to Step 4.

3. Follow procedures in ALM-12007 Process Fault to handle the alarm.4. Wait about 5 minutes and check whether the alarm is cleared from the alarm list.


Step 3 Check the status of the active and standby DBServers.

1. Log in to the host that corresponds to the DBService floating IP address, and run thesudo su - root and su - omm commands to switch to user omm. Run the cd ${BIGDATA_HOME}/FusionInsight/dbservice/ command to go to the installationdirectory of DBService.

2. Run the sh sbin/status-dbserver.sh command to view the status of DBService's activeand standby HA processes. Check whether the status can be viewed successfully.– If yes, go to Step 3.3.– If no, go to Step 4.

3. Check whether the active and standby HA processes are normal.– If yes, go to Step 4.– If no, go to Step 3.4.

4. On MRS Manager, choose Service > DBService > More > Restart Service to restartDBService, and check whether DBService is restarted successfully.


Issue 01 (2018-09-06) 250



5. Wait about 2 minutes and check whether the alarm is cleared from the alarm list.






----End

Related Information

N/A

6.7.53 ALM-27003 DBService Heartbeat Interruption Between theActive and Standby Nodes

Description

This alarm is generated when the active or standby DBService node does not receiveheartbeat messages from the peer node. It is cleared when the heartbeat recovers.

Attribute


27003 Major Yes

Parameters





Local DBService HA Name Specifies a local DBService HA.

Peer DBService HA Name Specifies a peer DBService HA.


Issue 01 (2018-09-06) 251


During the DBService heartbeat interruption, only one node can provide services. If this nodeis faulty, no standby node is available for failover and the services become unavailable.

Possible Causes

The link between the active and standby DBService nodes is abnormal.

Procedure

Step 1 Check whether the network between the active and standby DBService servers is in thenormal state.

1. In the alarm list on MRS Manager, locate the row that contains the alarm, and view theIP address of the standby DBService server in the alarm details.

2. Log in to the active DBService server.

3. Run the ping heartbeat IP address of the standby DBService command to check whetherthe standby DBService server is reachable.



4. Contact the network administrator to check whether the network is faulty.



5. Rectify the network fault and check whether the alarm is cleared from the alarm list.






----End

Related Information

N/A

6.7.54 ALM-27004 Data Inconsistency Between Active andStandby DBServices

Description

The system checks the data synchronization status between the active and standbyDBServices every 10 seconds. This alarm is generated when the synchronization status cannotbe queried six times consecutively or when the synchronization status is abnormal. This alarmis cleared when data synchronization is normal.


Issue 01 (2018-09-06) 252


27004 Critical Yes





Local DBService HA Name Specifies a local DBService HA.

Peer DBService HA Name Specifies a peer DBService HA.

SYNC_PERSENT Specifies the synchronization percentage.

Impact on the SystemData may be lost or abnormal if the active instance becomes abnormal.

Possible Causesl The network between the active and standby nodes is unstable.l The standby DBService is abnormal.l The disk space of the standby node is full.

Procedure

Step 1 Check whether the network between the active and standby nodes is in the normal state.

1. Log in to MRS Manager, click Alarm, click the row where the alarm is located in thealarm list, and view the IP address of the standby DBService in the alarm details.

2. Log in to the active DBService node.3. Run the ping heartbeat IP address of the standby DBService command to check whether

the standby DBService node is reachable.– If yes, go to Step 2.1.– If no, go to Step 1.4.

4. Contact the public cloud O&M personnel to check whether the network is faulty.– If yes, go to Step 1.5.– If no, go to Step 2.1.


Issue 01 (2018-09-06) 253

5. Rectify the network fault and check whether the alarm is cleared from the alarm list.– If yes, no further action is required.– If no, go to Step 2.1.

Step 2 Check whether the standby DBService is in the normal state.

1. Log in to the standby DBService node.2. Run the following command to switch the user:


3. Go to the ${DBSERVER_HOME}/sbin directory and run the ./status-dbserver.shcommand to check whether the GaussDB resource status of the standby DBService is inthe normal state. In the command output, check whether the following information isdisplayed in the row where ResName is gaussDB:For example:10_10_10_231 gaussDB Standby_normal Normal Active_standby


Step 3 Check whether the disk space of the standby node is insufficient.

1. Use PuTTY to log in to the standby DBService node as user root.2. Run the su - omm command to switch to user omm.3. Go to the ${DBSERVER_HOME} directory, and run the following commands to obtain

the DBService data directory:cd ${DBSERVER_HOME}source .dbservice_profileecho ${DBSERVICE_DATA_DIR}

4. Run the df -h command to check the system disk partition usage.5. Check whether the DBService data directory space is full.


6. Perform an upgrade and expand the capacity.7. After capacity expansion, wait 2 minutes and check whether the alarm is cleared.




----End

Related Information

N/A


Issue 01 (2018-09-06) 254

6.7.55 ALM-28001 Spark Service Unavailable

Description

The system checks the Spark service status every 30 seconds. This alarm is generated whenthe Spark service is unavailable and is cleared when the Spark service recovers.

Attribute


28001 Critical Yes

Parameters






The Spark tasks submitted by users fail to be executed.

Possible Causes

Any of the following services is abnormal:

l KrbServerl LdapServerl ZooKeeperl HDFSl Yarnl Hive

Procedure

Step 1 Check whether service unavailability alarms exist in services that Spark depends on.

1. On MRS Manager, click Alarm.2. Check whether any of the following alarms exists in the alarm list:


Issue 01 (2018-09-06) 255

a. ALM-25500 KrbServer Service Unavailable

b. ALM-25000 LdapServer Service Unavailable

c. ALM-13000 ZooKeeper Service Unavailable

d. ALM-14000 HDFS Service Unavailable

e. ALM-18000 Yarn Service Unavailable

f. ALM-16004 Hive Service Unavailable



3. Handle the alarms using the troubleshooting methods provided in the alarm help.

After all the alarms are cleared, wait a few minutes and check whether this alarm iscleared.






----End

Related Information

N/A

6.7.56 ALM-26051 Storm Service Unavailable

Description

The system checks the Storm service availability every 30 seconds. This alarm is generated ifthe Storm service becomes unavailable after all Nimbus nodes in a cluster become abnormal.

This alarm is cleared after the Storm service recovers.

Attribute


26051 Critical Yes

Parameters




Issue 01 (2018-09-06) 256




Impact on the Systeml The cluster cannot provide the Storm service.l Users cannot run new Storm tasks.

Possible Causesl The Kerberos component is faulty.l ZooKeeper is faulty or suspended.l The active and standby Nimbus nodes in the Storm cluster are abnormal.

Procedure

Step 1 Check the Kerberos component status. For clusters without Kerberos authentication, skip thisstep and go to Step 2.

1. On MRS Manager, click Service.2. Check whether the health status of the Kerberos service is Good.


3. Rectify the fault by following instructions in ALM-25500 KrbServer ServiceUnavailable.

4. Perform Step 1.2 again.

Step 2 Check the ZooKeeper component status.

1. Check whether the health status of the ZooKeeper service is Good.– If yes, go to Step 3.1.– If no, go to Step 2.2.

2. If the ZooKeeper service is stopped, start it. For other problems, follow the instructionsin ALM-13000 ZooKeeper Service Unavailable.


Step 3 Check the status of the active and standby Nimbus nodes.

1. Choose Service > Storm > Nimbus.2. In Role, check whether only one active Nimbus node exists.


3. Select the two Nimbus instances and choose More > Restart Instance. Check whetherthe restart is successful.


Issue 01 (2018-09-06) 257


4. Log in to MRS Manager again and choose Service > Storm > Nimbus. Check whetherthe health status of Nimbus is Good.– If yes, go to Step 3.5.– If no, go to Step 4.1.

5. Wait 30 seconds and check whether the alarm is cleared.– If yes, no further action is required.– If no, go to Step 4.1.



----End

Related Information

N/A

6.7.57 ALM-26052 Number of Available Supervisors in Storm IsLower Than the Threshold

Description

The system checks the number of supervisors every 60 seconds and compares it with thethreshold. This alarm is generated if the number of supervisors is lower than the threshold.

To modify the threshold, users can choose System > Threshold Configuration on MRSManager.

This alarm is cleared if the number of supervisors is greater than or equal to the threshold.

Attribute


26052 Major Yes

Parameters





Issue 01 (2018-09-06) 258




Impact on the Systeml Existing tasks in the cluster cannot be executed.

l The cluster can receive new Storm tasks but cannot execute them.

Possible Causes

Supervisors are abnormal in the cluster.

Procedure

Step 1 Check the supervisor status.

1. Choose Service > Storm > Supervisor.

2. In Role, check whether the cluster has supervisor instances that are in the Bad orConcerning state.



3. Select the supervisor instances that are in the Bad or Concerning state and choose More> Restart Instance.

– If the restart is successful, go to Step 1.4.

– If the restart fails, go to Step 2.1.

4. Wait 30 seconds and check whether the alarm is cleared.






----End

Related Information

N/A


Issue 01 (2018-09-06) 259

6.7.58 ALM-26053 Slot Usage of Storm Exceeds the Threshold

DescriptionThe system checks the slot usage of Storm every 60 seconds and compares it with thethreshold. This alarm is generated if the slot usage exceeds the threshold.


This alarm is cleared if the slot usage is lower than or equal to the threshold.


26053 Major Yes






Impact on the SystemUsers cannot run new Storm tasks.

Possible Causesl Supervisors are abnormal in the cluster.l Supervisors are normal but have poor processing capability.

Procedure

Step 1 Check the supervisor status.

1. Choose Service > Storm > Supervisor.2. In Role, check whether the cluster has supervisor instances that are in the Bad or

Concerning state.


Issue 01 (2018-09-06) 260

– If yes, go to Step 1.3.– If no, go to Step 2.1 or Step 3.1.

3. Select the supervisor instances that are in the Bad or Concerning state and choose More> Restart Instance.– If the restart is successful, go to Step 1.4.– If the restart fails, go to Step 4.1.

4. Wait a moment and then check whether the alarm is cleared.– If yes, no further action is required.– If no, go to Step 2.1 or Step 3.1.

Step 2 Increase the number of slots for the supervisors.

1. On MRS Manager, choose Service > Storm > Supervisor > Service Configuration >Type > All.

2. Increase the value of supervisor.slots.ports to increase the number of slots for eachsupervisor. Then restart the instances.

3. Wait a moment and then check whether the alarm is cleared.– If yes, no further action is required.– If no, go to Step 4.1.

Step 3 Expand the capacity of the supervisors.

1. Add nodes.2. Wait a moment and then check whether the alarm is cleared.




----End

Related Information

N/A

6.7.59 ALM-26054 Heap Memory Usage of Storm Nimbus Exceedsthe Threshold

Description

The system checks the heap memory usage of Storm Nimbus every 30 seconds and comparesit with the threshold. This alarm is generated if the heap memory usage exceeds the threshold(80% by default).

To modify the threshold, users can choose System > Threshold Configuration > Service >Storm on MRS Manager.

This alarm is cleared if the heap memory usage is lower than or equal to the threshold.


Issue 01 (2018-09-06) 261


26054 Major Yes







Frequent memory garbage collection or memory overflow may occur, affecting submission ofStorm services.

Possible Causes

The heap memory usage is high or the heap memory is improperly allocated.

Procedure

Step 1 Check the heap memory usage.

1. On MRS Manager, choose Alarm > ALM-26054 Heap Memory Usage of StormNimbus Exceeds the Threshold > Location. Query the HostName of the alarmedinstance.

2. On MRS Manager, choose Service > Storm > Instance> Nimbus (corresponding tothe HostName of the alarmed instance) > Customize > Heap Memory Usage ofNimbus.

3. Check whether the heap memory usage of Nimbus has reached the threshold (80%).– If yes, go to Step 1.4.– If no, go to Step 2.1.

4. Adjust the heap memory.On MRS Manager, choose Service > Storm > Service Configuration > All > Nimbus> System. Increase the value of -Xmx in NIMBUS_GC_OPTS. Click SaveConfiguration. Select Restart the affected services or instances and click OK.


Issue 01 (2018-09-06) 262

5. Check whether the alarm is cleared.– If yes, no further action is required.– If no, go to Step 2.1.



----End

Related Information

N/A

6.7.60 ALM-38000 Kafka Service Unavailable

Description

The system checks the Kafka service availability every 30 seconds. This alarm is generatedwhen the Kafka service becomes unavailable.

This alarm is cleared after the Kafka service recovers.

Attribute


38000 Critical Yes

Parameters






The cluster cannot provide the Kafka service and users cannot run new Kafka tasks.

Possible Causesl The KrbServer component is faulty.


Issue 01 (2018-09-06) 263

l The ZooKeeper component is faulty or fails to respond.

l The Broker node in the Kafka cluster is abnormal.

Procedure

Step 1 Check the KrbServer component status. For clusters without Kerberos authentication, skipthis step and go to Step 2.

1. On MRS Manager, click Service.

2. Check whether the health status of the KrbServer service is Good.



3. Rectify the fault by following instructions in ALM-25500 KrbServer ServiceUnavailable.


Step 2 Check the ZooKeeper component status.

1. Check whether the health status of the ZooKeeper service is Good.



2. If the ZooKeeper service is stopped, start it. For other problems, follow the instructionsin ALM-13000 ZooKeeper Service Unavailable.


Step 3 Check the Broker status.

1. Choose Service > Kafka > Broker.

2. In Role, check whether all instances are normal.



3. Select all instances of Broker and choose More > Restart Instance.

– If the restart is successful, go to Step 3.4.

– If the restart fails, go to Step 4.1.

4. Choose Service > Kafka. Check whether the health status of Kafka is Good.



5. Wait 30 seconds and check whether the alarm is cleared.






----End


Issue 01 (2018-09-06) 264

Related Information

N/A

6.7.61 ALM-38001 Insufficient Kafka Disk Space

Description

The system checks the Kafka disk usage every 60 seconds and compares it with the threshold.This alarm is generated when the disk usage exceeds the threshold.


This alarm is cleared when the Kafka disk usage is lower than or equal to the threshold.


38001 Major Yes





PartitionName Specifies the disk partition where the alarmis generated.



Kafka fails to write data to the disks.

Possible Causesl The Kafka disk configurations (such as disk count and disk size) are insufficient for the

data volume.l The data retention period is long and historical data occupies a large space.


Issue 01 (2018-09-06) 265

l Services are improperly planned. As a result, data is unevenly distributed and some disksare full.

Procedure1. Log in to MRS Manager and click Alarm.2. In the alarm list, click the alarm and view the HostName and PartitionName of the

alarm in Location of Alarm Details.3. In Hosts, click the host obtained in 2.4. Check whether the Disk area contains the PartionName of the alarm.

– If yes, go to 5.– If no, manually clear the alarm and no further action is required.

5. In the Disk area, check whether the usage of the alarmed partition has reached 100%.– If yes, go to 6.– If no, go to 8.

6. In Instance, choose Broker > Instance Configuration. On the Instance Configurationpage that is displayed, set Type to All and query data directory parameter log.dirs.

7. Choose Service > Kafka > Instance. On the Kafka Instance page that is displayed,stop the Broker instance corresponding to that in 2. Then log in to the alarm node andmanually delete the data directory queried in 6. After all subsequent operations arecomplete, start the Broker instance.

8. Choose Service > Kafka > Service Configuration. The Kafka Configuration page isdisplayed.

9. Check whether disk.adapter.enable is true.– If yes, go to 11.– If no, change the value to true and go to 10.

10. Check whether the adapter.topic.min.retention.hours parameter, indicating theminimum data retention period, is properly configured.– If yes, go to 11.– If no, set it to a proper value and go to 11.

NOTE

If the retention period cannot be adjusted for certain topics, the topics can be added todisk.adapter.topic.blacklist.

11. Wait 10 minutes and check whether the disk usage is reduced.– If yes, wait until the alarm is cleared.– If no, go to 12.

12. Go to the Kafka Topic Monitor page and query the data retention period configured forKafka. Determine whether the retention period needs to be shortened based on servicerequirements and data volume.– If yes, go to 13.– If no, go to 14.

13. Find the topics with great data volumes based on the disk partition obtained in 2. Log into the Kafka client and manually shorten the data retention period for these topics usingthe following command:


Issue 01 (2018-09-06) 266

kafka-topics.sh --zookeeper ZooKeeper address:24002/kafka --alter --topic Topicname --config retention.ms=Retention period

14. Check whether partitions are properly configured for topics. For example, if the numberof partitions for a topic with a large data volume is smaller than the number of disks, datamay be unevenly distributed to the disks and the usage of some disks will reach theupper limit.

NOTE

To identify topics with great data volumes, log in to the relevant nodes that are obtained in 2, go tothe data directory (the directory before log.dirs in 6 is modified), and check the disk spaceoccupied by the partitions of the topics.

– If the partitions are improperly configured, go to 15.– If the partitions are properly configured, go to 16.

15. On the Kafka client, add partitions to the topics.kafka-topics.sh --zookeeper ZooKeeper address:24002/kafka --alter --topic Topicname --partitions=Number of new partitions

NOTE

It is advised to set the number of new partitions to a multiple of the number of Kafka disks.

This operation may not quickly clear the alarm. Data will be gradually balanced among the disks.

16. Check whether the cluster capacity needs to be expanded.– If yes, add nodes to the cluster and go to 17.– If no, go to 17.

17. Wait a moment and then check whether the alarm is cleared.– If yes, no further action is required.– If no, go to 18.

18. Contact the public cloud O&M personnel.

Related Information

N/A

6.7.62 ALM-38002 Heap Memory Usage of Kafka Exceeds theThreshold

Description

The system checks the heap memory usage of Kafka every 30 seconds. This alarm isgenerated when the heap memory usage of Kafka exceeds the threshold (80%).

This alarm is cleared when the heap memory usage is lower than the threshold.

Attribute


38002 Major Yes


Issue 01 (2018-09-06) 267

Parameters







Memory overflow may occur, causing service crashes.

Possible Causes

The heap memory usage is high or the heap memory is improperly allocated.

Procedure

Step 1 Check the heap memory usage.

1. On MRS Manager, choose Alarm> ALM-38002 Heap Memory Usage of KafkaExceeds the Threshold > Location. Query the IP address of the alarmed instance.

2. On MRS Manager, choose Service > Kafka > Instance > Broker (corresponding tothe IP address of the alarmed instance) > Customize > Kafka Heap MemoryResource Percentage.

3. Check whether the heap memory usage of Kafka has reached the threshold (80%).



4. On MRS Manager, choose Service > Kafka > Service Configuration > All > Broker >Environment. Increase the value of KAFKA_HEAP_OPTS as required.







----End


Issue 01 (2018-09-06) 268

Related Information

N/A

6.7.63 ALM-24000 Flume Service Unavailable

Description

The alarm module checks the Flume service status every 180 seconds. This alarm is generatedwhen the Flume service is abnormal.

This alarm is cleared when the Flume service recovers.

Attribute


24000 Critical Yes

Parameters






Flume cannot work and data transmission is interrupted.

Possible Causesl HDFS is unavailable.l LdapServer is unavailable.

Procedure

Step 1 Check the HDFS status.

On MRS Manager, check whether alarm ALM-14000 HDFS Service Unavailable is reported.l If yes, clear the alarm according to the handling suggestions of "ALM-14000 HDFS

Service Unavailable".l If no, go to Step 2.


Issue 01 (2018-09-06) 269

Step 2 Check the LdapServer status.

On MRS Manager, check whether alarm ALM-25000 LdapServer Service Unavailable isreported.l If yes, clear the alarm according to the handling suggestions of "ALM-25000 LdapServer

Service Unavailable".l If no, go to Step 3.1.

Step 3 Check whether the HDFS and LdapServer services are stopped.

1. In the service list on MRS Manager, check whether the HDFS and LdapServer servicesare stopped.– If yes, start the HDFS and LdapServer services and go to Step 3.2.– If no, go to Step 4.1.

2. Check whether the "ALM-24000 Flume Service Unavailable" alarm is cleared.– If yes, no further action is required.– If no, go to Step 4.1.



----End

Related Information

N/A

6.7.64 ALM-24001 Flume Agent Is Abnormal

Description

This alarm is generated when the Flume agent monitoring module detects that the Flumeagent process is abnormal.

This alarm is cleared when the Flume agent process recovers.


24001 Minor Yes




Issue 01 (2018-09-06) 270





Functions of the alarmed Flume agent instance are abnormal. Data transmission tasks of theinstance are suspended. In real-time data transmission, data will be lost.

Possible Causesl The JAVA_HOME directory does not exist or the Java permission is incorrect.l The permission of the Flume agent directory is incorrect.

Procedure

Step 1 Check the Flume agent's configuration file.

1. Log in to the host where the faulty node resides. Run the following command to switchto user root:sudo su - root

2. Run the cd Flume installation directory/fusioninsight-flume-1.6.0/conf/ command to goto Flume's configuration directory.

3. Run the cat ENV_VARS command. Check whether the JAVA_HOME directory existsand whether the Flume agent user has execute permission of Java.– If yes, go to Step 2.1.– If no, go to Step 1.4.

4. Specify the correct JAVA_HOME directory and grant the Flume agent user with theexecute permission of Java. Then go to Step 2.4.

Step 2 Check the permission of the Flume agent directory.


2. Run the following command to access the installation directory of the Flume agent:cd Flume agent installation directory

3. Run the ls -al * -R command. Check whether the owner of all files is the Flume agentuser.– If yes, go to Step 3.1.– If no, run the chown command and change the owner of the files to the Flume agent

user. Then go to Step 2.4.4. Check whether the alarm is cleared.



Issue 01 (2018-09-06) 271




----End

Related Information

N/A

6.7.65 ALM-24003 Flume Client Connection Failure

Description

The alarm module monitors the port connection status on the Flume server. This alarm isgenerated when the Flume server fails to receive a connection message from the Flume clientin 3 consecutive minutes.

This alarm is cleared when the Flume server receives a connection message from the Flumeclient.

Attribute


24003 Major Yes

Parameters


ClientIP Specifies the IP address of the Flume client.

ServerIP Specifies the IP address of the Flume server.

ServerPort Specifies the port on the Flume server.


The communication between the Flume client and server fails. The Flume client cannot senddata to the Flume server.

Possible Causesl The network between the Flume client and server is faulty.l The Flume client's process is abnormal.


Issue 01 (2018-09-06) 272

l The Flume client is incorrectly configured.

Procedure

Step 1 Check the network between the Flume client and server.

1. Log in to the host where the alarmed Flume client resides. Run the following commandto switch to user root:sudo su - root

2. Run the ping Flume server IP address command to check whether the network betweenthe Flume client and server is normal.



Step 2 Check whether the Flume client's process is normal.


2. Run the ps -ef|grep flume |grep client command to check whether the Flume clientprocess exists.



Step 3 Check the Flume client configuration.


2. Run the cd Flume installation directory/fusioninsight-flume-1.6.0/conf/ command to goto Flume's configuration directory.

3. Run the cat properties.properties command to query the current configuration file ofthe Flume client.

4. Check whether the properties.properties file is correctly configured according to theconfiguration description of the Flume agent.



5. Modify the properties.properties configuration file.







----End


Issue 01 (2018-09-06) 273

Related Information

N/A

6.7.66 ALM-24004 Flume Fails to Read Data

Description

The alarm module monitors the Flume source status. This alarm is generated when theduration that Flume source fails to read data exceeds the threshold.

Users can modify the threshold as required.

This alarm is cleared when the source reads data successfully.

Attribute


24004 Major Yes

Parameters




ComponentType Specifies the component type for which thealarm is generated.

ComponentName Specifies the component name for which thealarm is generated.


Data collection is stopped.

Possible Causesl The Flume source is faulty.l The network is faulty.

Procedure

Step 1 Check whether the Flume source is normal.


Issue 01 (2018-09-06) 274

1. Check whether the Flume source is the spoolDir type.– If yes, go to Step 1.2.– If no, go to Step 1.3.

2. Query the spoolDir directory and check whether all files have been sent.– If yes, no further action is required.– If no, go to Step 1.5.

3. Check whether the Flume source is the Kafka type.– If yes, go to Step 1.4.– If no, go to Step 1.5.

4. Log in to the Kafka client and run the following commands to check whether all topicdata configured for the Kafka source has been consumed.cd /opt/client/Kafka/kafka/bin./kafka-consumer-groups.sh --bootstrap-server Kafka cluster IP address:21007 --new-consumer --describe --group example-group1 --command-config../config/consumer.properties– If yes, no further action is required.– If no, go to Step 1.5.

5. On MRS Manager, choose Service > Flume > Instance.6. Click the Flume instance of the faulty node and check whether the value of the Source

Speed Metrics is 0.– If yes, go to Step 2.1.– If no, no further action is required.

Step 2 Check the status of the network between the Flume source and faulty node.

1. Check whether the Flume source is the avro type.– If yes, go to Step 2.3.– If no, go to Step 3.1.


3. Run the ping Flume source IP address command to check whether the Flume source canbe pinged.– If yes, go to Step 3.1.– If no, go to Step 2.4.

4. Contact the network administrator to repair the network.5. Wait for a while and check whether the alarm is cleared.




----End


Issue 01 (2018-09-06) 275

Related Information

N/A

6.7.67 ALM-24005 Data Transmission by Flume Is Abnormal

Description

The alarm module monitors the capacity of Flume channels. This alarm is generated when theduration that a channel is full or the number of times that a source fails to send data to thechannel exceeds the threshold.

Users can set the threshold as required by modifying the channelfullcount parameter.

This alarm is cleared when the channel space is released.

Attribute


24005 Major Yes

Parameters




ComponentType Specifies the component type for which thealarm is generated.

ComponentName Specifies the component name for which thealarm is generated.


If the usage of the Flume channel continues to grow, the data transmission time increases.When the usage reaches 100%, the Flume agent process is suspended.

Possible Causesl The Flume sink is faulty.

l The network is faulty.


Issue 01 (2018-09-06) 276

Procedure

Step 1 Check whether the Flume sink is normal.

1. Check whether the Flume sink is the HDFS type.– If yes, go to Step 1.2.– If no, go to Step 1.3.

2. On MRS Manager, check whether alarm ALM-14000 HDFS Service Unavailable isreported and whether the HDFS service is stopped.– If the alarm is reported, clear it according to the handling suggestions of

ALM-14000 HDFS Service Unavailable; if the HDFS service is stopped, start it.Then go to Step 1.7.

– If the alarm is not reported and the HDFS service is running properly, go to Step1.7.

3. Check whether the Flume sink is the HBase type.– If yes, go to Step 1.4.– If no, go to Step 1.7.

4. On MRS Manager, check whether alarm ALM-19000 HBase Service Unavailable isreported and whether the HBase service is stopped.– If the alarm is reported, clear it according to the handling suggestions of

ALM-19000 HBase Service Unavailable; if the HBase service is stopped, start it.Then go to Step 1.7.

– If the alarm is not reported and the HBase service is running properly, go to Step1.7.

5. Check whether the Flume sink is the Kafka type.– If yes, go to Step 1.6.– If no, go to Step 1.7.

6. On MRS Manager, check whether alarmALM-38000 Kafka Service Unavailable isreported and whether the Kafka service is stopped.– If the alarm is reported, clear it according to the handling suggestions of

ALM-38000 Kafka Service Unavailable; if the Kafka service is stopped, start it.Then go to Step 1.7.

– If the alarm is not reported and the Kafka service is running properly, go to Step1.7.

7. On MRS Manager, choose Service > Flume > Instance.8. Click the Flume instance of the faulty node and check whether the value of the Sink

Speed Metrics is 0.– If yes, go to Step 2.1.– If no, no further action is required.

Step 2 Check the status of the network between the Flume sink and faulty node.

1. Check whether the Flume sink is the avro type.– If yes, go to Step 2.3.– If no, go to Step 3.1.

2. Log in to the host where the faulty node resides. Run the following command to switchto user root:


Issue 01 (2018-09-06) 277

sudo su - root3. Run the ping Flume sink IP address command to check whether the Flume sink can be

pinged.– If yes, go to Step 3.1.– If no, go to Step 2.4.

4. Contact the network administrator to repair the network.5. Wait for a while and check whether the alarm is cleared.




----End

Related Information

N/A

6.7.68 ALM-12041 Permission of Key Files Is Abnormal

Description

The system checks the permission, users, and user groups of key directories or files everyhour. This alarm is generated when any of these is abnormal.

This alarm is cleared when the problem that causes abnormal permission, users, or usergroups is solved.

Attribute


12041 Major Yes

Parameters






Issue 01 (2018-09-06) 278


PathName Specifies the file path or file name.

Impact on the SystemSystem functions are unavailable.

Possible CausesThe user has manually modified the file permission, user information, or user groups, or thesystem has experienced an unexpected power-off.

Procedure

Step 1 Check the file permission.

1. On MRS Manager, click Alarm.2. In the details of the alarm, query the HostName (name of the alarmed host) and

PathName (path or name of the involved file).3. Log in to the alarm node.4. Run the ll PathName command to query the current user, permission, and user group of

the file or path.5. Go to the ${BIGDATA_HOME}/nodeagent/etc/agent/autocheck directory and run the

vi keyfile command. Search for the name of the involved file and query the correctpermission of the file.

6. Compare the actual permission of the file with the permission obtained in Step 1.5. Ifthey are different, change the actual permission, user information, and user group to thecorrect values.

7. Wait until the next system check is complete and check whether the alarm is cleared.– If yes, no further action is required.– If no, go to Step 2.



----End


6.7.69 ALM-12042 Key File Configurations Are Abnormal

DescriptionThe system checks key file configurations every hour. This alarm is generated when any keyconfiguration is abnormal.


Issue 01 (2018-09-06) 279

This alarm is cleared when the configuration becomes normal.

Attribute


12042 Major Yes

Parameters





PathName Specifies the file path or file name.


Functions related to the file are abnormal.

Possible Causes

The user has manually modified the file configurations or the system has experienced anunexpected power-off.

Procedure

Step 1 Check the file configurations.

1. On MRS Manager, click Alarm.2. In the details of the alarm, query the HostName (name of the alarmed host) and

PathName (path or name of the involved file).3. Log in to the alarm node.4. Manually check and modify the file configurations according to the criteria in Related

Information.5. Wait until the next system check is complete and check whether the alarm is cleared.





Issue 01 (2018-09-06) 280


----End

Related Informationl Checking /etc/fstab

Check whether partitions configured in /etc/fstab exist in /proc/mounts and whetherswap partitions configured in /etc/fstab match those in /proc/swaps.

l Checking /etc/hostsRun the cat /ect/hosts command. If any of the following situations exists, the fileconfigurations are abnormal.– The /ect/hosts file does not exist.– The host name is not configured in the file.– The IP address of the host is duplicate.– The IP address of the host does not exist in the ipconfig list.– An IP address in the file is used by multiple hosts.

6.7.70 ALM-23001 Loader Service Unavailable

Description

The system checks the Loader service availability every 60 seconds. This alarm is generatedwhen the Loader service is unavailable and is cleared when the Loader service recovers.

Attribute


23001 Critical Yes

Parameters






Data loading, import, and conversion are unavailable.


Issue 01 (2018-09-06) 281

Possible Causesl The services that Loader depends on are abnormal.

– ZooKeeper is abnormal.– HDFS is abnormal.– DBService is abnormal.– Yarn is abnormal.– MapReduce is abnormal.

l The network is faulty. Loader cannot communicate with its dependent services.l Loader is running improperly.

Procedure

Step 1 Check the ZooKeeper status.

1. On MRS Manager, choose Service > ZooKeeper. Check whether the health status ofZooKeeper is normal.– If yes, go to Step 1.3.– If no, go to Step 1.2.

2. Choose More > Restart Service to restart ZooKeeper. After ZooKeeper starts, checkwhether alarm ALM-23001 Loader Service Unavailable is cleared.– If yes, no further action is required.– If no, go to Step 1.3.

3. On MRS Manager, check whether alarm ALM-12007 Process Fault is reported.– If yes, go to Step 1.4.– If no, go to Step 2.1.

4. In Alarm Details of alarm ALM-12007 Process Fault, check whether ServiceName isZooKeeper.– If yes, go to Step 1.5.– If no, go to Step 2.1.

5. Clear the alarm according to the handling suggestions of ALM-12007 Process Fault.6. Check whether alarm ALM-23001 Loader Service Unavailable is cleared.


Step 2 Check the HDFS status.

1. On MRS Manager, check whether alarm ALM-14000 HDFS Service Unavailable isreported.– If yes, go to Step 2.2.– If no, go to Step 3.1.

2. Clear the alarm according to the handling suggestions of ALM-14000 HDFS ServiceUnavailable.

3. Check whether alarm ALM-23001 Loader Service Unavailable is cleared.– If yes, no further action is required.– If no, go to Step 3.1.


Issue 01 (2018-09-06) 282

Step 3 Check the DBService status.

1. On MRS Manager, choose Service > DBService. Check whether the health status ofDBService is normal.– If yes, go to Step 3.1.– If no, go to Step 3.2.

2. Choose More > Restart Service to restart DBService. After DBService starts, checkwhether alarm ALM-23001 Loader Service Unavailable is cleared.– If yes, no further action is required.– If no, go to Step 4.1.

Step 4 Check the MapReduce status.

1. On MRS Manager, choose Service > MapReduce. Check whether the health status ofMapReduce is normal.– If yes, go to Step 5.1.– If no, go to Step 4.2.

2. Choose More > Restart Service to restart MapReduce. After MapReduce starts, checkwhether alarm ALM-23001 Loader Service Unavailable is cleared.– If yes, no further action is required.– If no, go to Step 5.1.

Step 5 Check the Yarn status.

1. On MRS Manager, choose Service > Yarn. Check whether the health status of Yarn isnormal.– If yes, go to Step 5.3.– If no, go to Step 5.2.

2. Choose More > Restart Service to restart Yarn. After Yarn starts, check whether alarmALM-23001 Loader Service Unavailable is cleared.– If yes, no further action is required.– If no, go to Step 5.3.

3. On MRS Manager, check whether alarm ALM-18000 Yarn Service Unavailable isreported.– If yes, go to Step 5.4.– If no, go to Step 6.1.

4. Clear the alarm according to the handling suggestions of ALM-18000 Yarn ServiceUnavailable.

5. Check whether alarmALM-23001 Loader Service Unavailable is cleared.– If yes, no further action is required.– If no, go to Step 6.1.

Step 6 Check the network connections between Loader and its dependent components.

1. On MRS Manager, choose Service > Loader.2. Click Instance. The Sqoop instance list is displayed.3. Record the management IP addresses of all Sqoop instances.4. Log in to the hosts using the IP addresses obtained in Step 6.3. Run the following

commands to switch the user:


Issue 01 (2018-09-06) 283


5. Run the ping command to check whether the network connection between the hostswhere the Sqoop instances reside and the dependent components is normal. (Thedependent components include ZooKeeper, DBService, HDFS, MapReduce, and Yarn.The method to obtain the IP addresses of the dependent components is the same as thatused to obtain the IP addresses of the Sqoop instances.)– If yes, go to Step 7.1.– If no, go to Step 6.6.

6. Contact the network administrator to repair the network.7. Check whether alarm ALM-23001 Loader Service Unavailable is cleared.




----End

6.7.71 ALM-12357 Failed to Export Audit Logs to the OBS

DescriptionIf the user has configured audit log export to the OBS on MRS Manager, the system regularlyexports audit logs to the OBS. This alarm is generated when the system fails to access theOBS.

This alarm is cleared when the system exports audit logs to the OBS successfully.


12357 Major Yes






Issue 01 (2018-09-06) 284


The local system saves a maximum of seven compressed service audit log files. If this alarmpersists, local service audit logs may be lost.

The local system saves a maximum of 50 management audit log files (each file contains100,000 records). If this alarm persists, local management audit logs may be lost.

Possible Causesl Connection to the OBS server fails.

l The specified OBS bucket does not exist.

l The user AK/SK information is invalid.

l The local OBS configuration cannot be obtained.

Procedure

Step 1 Log in to the OBS server and check whether the OBS server can be properly accessed.



Step 2 Contact the maintenance personnel to repair the OBS. Then check whether the alarm iscleared.



Step 3 On MRS Manager, choose System > Export Audit Log. Check whether the AK/SKinformation, bucket name, and path are correct.



Step 4 Correct the information. Then check whether the alarm is cleared when the export task isexecuted again.

NOTE

To check alarm clearance quickly, you can set the start time of audit log collection to 10 or 30 minuteslater than the current time. After checking the result, restore the original start time.






----End

Related Information

N/A


Issue 01 (2018-09-06) 285

6.7.72 ALM-12014 Partition Lost

Description

The system checks the partition status periodically. This alarm is generated when the systemdetects that a partition to which service directories are mounted is lost (because the device isremoved or goes offline, or the partition is deleted).

This alarm must be manually cleared.

Attribute


12014 Major No

Parameters





DirName Specifies the directory for which the alarmis generated.

PartitionName Specifies the device partition for which thealarm is generated.


Service data fails to be written into the partition, and the service system runs abnormally.

Possible Causesl The hard disk is removed.

l The hard disk is offline, or a bad sector exists on the hard disk.

Procedure

Step 1 On MRS Manager, click Alarm, and click the alarm in the real-time alarm list.

Step 2 In the Alarm Details area, obtain HostName, PartitionName and DirName from Location.


Issue 01 (2018-09-06) 286

Step 3 Check whether the disk of PartitionName on HostName is inserted to the correct server slot.



Step 4 Contact hardware engineers to remove the faulty disk.

Step 5 Use PuTTY to log in to the HostName node where an alarm is reported and check whetherthere is a line containing DirName in the /etc/fstab file.



Step 6 Run the vi /etc/fstab command to edit the file and delete the line containing DirName.

Step 7 Contact hardware engineers to insert a new disk. For details, see the hardware productdocument of the relevant model. If the faulty disk is in a RAID group, configure the RAIDgroup. For details, see the configuration methods of the relevant RAID controller card.

Step 8 Wait 20 to 30 minutes (The disk size determines the waiting time), and run the mountcommand to check whether the disk has been mounted to the DirName directory.

l If yes, manually clear the alarm. No further operation is required.





----End

Related Information

N/A

6.7.73 ALM-12015 Partition Filesystem Readonly

Description

The system checks the partition status periodically. This alarm is generated when the systemdetects that a partition to which service directories are mounted enters the read-only mode(due to a bad sector or a faulty file system).

This alarm is cleared when the system detects that the partition to which service directoriesare mounted exits from the read-only mode (because the file system is restored to read/writemode, the device is removed, or the device is formatted).

Attribute


12015 Major Yes


Issue 01 (2018-09-06) 287





DirName Specifies the directory for which the alarmis generated.

PartitionName Specifies the device partition for which thealarm is generated.

Impact on the SystemService data fails to be written into the partition, and the service system runs abnormally.

Possible CausesThe hard disk is faulty. For example, a bad sector exists.

Procedure

Step 1 On MRS Manager, click the alarm in the real-time alarm list.

Step 2 In the Alarm Details area, obtain HostName and PartitionName from Location. HostNameis the node where the alarm is reported, and PartitionName is the partition of the faulty disk.

Step 3 Contact hardware engineers to check whether the disk is faulty. If the disk is faulty, remove itfrom the server.

Step 4 After the disk is removed, alarm ALM-12014 Partition Lost is reported. Handle the alarm.For details, see ALM-12014 Partition Lost. After the alarm ALM-12014 Partition Lost iscleared, alarm ALM-12015 Partition Filesystem Readonly is automatically cleared.

----End


6.7.74 ALM-12043 DNS Resolution Duration Exceeds theThreshold

DescriptionThe system checks the DNS resolution duration every 30 seconds and compares the actualDNS resolution duration with the threshold (the default threshold is 20,000 ms). This alarm is


Issue 01 (2018-09-06) 288

generated when the system detects that the DNS resolution duration exceeds the threshold forseveral times (2 times by default) consecutively.

To change the threshold, choose System > Threshold Configuration > Device > Host >Network Status > DNS Name Resolution Duration > DNS Name Resolution Duration.

When the hit number is 1, this alarm is cleared when the DNS resolution duration is less thanor equal to the threshold. When the hit number is not 1, this alarm is cleared when the DNSresolution duration is less than or equal to 90% of the threshold.

Attribute


12043 Major Yes

Parameters





Impact on the Systeml Kerberos-based secondary authentication is slow.l The ZooKeeper service is abnormal.l The node is faulty.

Possible Causesl The node is configured with the DNS client.l The node is equipped with the DNS server and the DNS server is started.

Procedure

Check whether the node is configured with the DNS client.

Step 1 On MRS Manager, click Alarm.

Step 2 Check the value of HostName in the detailed alarm information to obtain the name of thehost involved in this alarm.

Step 3 Use PuTTY to log in to the node for which the alarm is generated as user root.


Issue 01 (2018-09-06) 289

Step 4 Run the cat /etc/resolv.conf command to check whether the DNS client is installed.

If information similar to the following is displayed, the DNS client is installed and started:

namesever 10.2.3.4 namesever 10.2.3.4


Step 5 Run the vi /etc/resolv.conf command to comment out the following content using the numbersigns (#) and save the file:# namesever 10.2.3.4 # namesever 10.2.3.4

Step 6 Check whether this alarm is cleared after 5 minutes.l If yes, no further action is required.l If no, go to Step 7.

Check whether the node is equipped with the DNS server and the DNS server is started.

Step 7 Run the service named status command to check whether the DNS server is installed on thenode:

If information similar to the following is displayed, the DNS server is installed and started:

Checking for nameserver BIND version: 9.6-ESV-R7-P4 CPUs found: 8 worker threads: 8 number of zones: 17 debug level: 0 xfers running: 0 xfers deferred: 0 soa queries in progress: 0 query logging is ON recursive clients: 4/0/1000 tcp clients: 0/100 server is up and running


Step 8 Run the service named stop command to stop the DNS service.

Step 9 Check whether this alarm is cleared after 5 minutes.l If yes, no further action is required.l If no, go to Step 10.

Collect fault information.

Step 10 On MRS Manager, choose System > Export Log.

Step 11 Contact the public cloud O&M personnel and send the collected log information.

----End

Related Information

N/A


Issue 01 (2018-09-06) 290

6.7.75 ALM-12045 Network Read Packet Dropped Rate Exceedsthe Threshold

DescriptionThe system checks the network read packet dropped rate every 30 seconds and compares theactual packet dropped rate with the threshold (the default threshold is 0.5%). This alarm isgenerated when the system detects that the network read packet dropped rate exceeds thethreshold for several times (5 times by default) consecutively.

To change the threshold, choose System > Threshold Configuration > Device > Host >Network Reading > Network Read Packet Rate Information > Read Packet DroppedRate.

When the hit number is 1, this alarm is cleared when the network read packet dropped rate isless than or equal to the threshold. When the hit number is greater than 1, this alarm iscleared when the network read packet dropped rate is less than or equal to 90% of thethreshold.

Alarm detection is disabled by default. If you want to enable this function, check whetheralarm sending can be enabled based on section "Check the system environment."


12045 Major Yes





NetworkCardName Specifies the network port for which thealarm is generated.


Impact on the SystemThe service performance deteriorates or services time out.


Issue 01 (2018-09-06) 291

Precautions: In SUSE (kernel: 3.0 or later) or Red Hat 7.2, because the system kernelmodifies the mechanism for counting read and discarded packets, this alarm may be generatedeven when the network is normal. Services are not adversely affected. You are advised tocheck whether the alarm is caused by this problem based on section "Check the systemenvironment."

Possible Causesl An OS exception occurs.l The NIC has configured the active/standby bond mode.l The alarm threshold is set improperly.l The network is abnormal.

Procedure

Check the network packet dropped rate.

Step 1 Use PuTTY to log in to any node for which the alarm is not generated in the cluster as useromm and run the ping IP address -c 100 command to check whether network packet lossoccurs.# ping 10.10.10.12 -c 5 PING 10.10.10.12 (10.10.10.12) 56(84) bytes of data. 64 bytes from 10.10.10.11: icmp_seq=1 ttl=64 time=0.033 ms 64 bytes from 10.10.10.11: icmp_seq=2 ttl=64 time=0.034 ms 64 bytes from 10.10.10.11: icmp_seq=3 ttl=64 time=0.021 ms 64 bytes from 10.10.10.11: icmp_seq=4 ttl=64 time=0.033 ms 64 bytes from 10.10.10.11: icmp_seq=5 ttl=64 time=0.030 ms --- 10.10.10.12 ping statistics --- 5 packets transmitted, 5 received, 0% packet loss, time 4001ms rtt min/avg/max/mdev = 0.021/0.030/0.034/0.006 ms

NOTE

l IP address: indicates the value of HostName in the alarm location information. To query the valueof OM IP and Business IP, click Host on MRS Manager.

l -c: indicates the check times. The default value is 100.


Check the system environment.

Step 2 Use PuTTY to log in as user omm to the active OMS node or the node for which the alarm isgenerated.

Step 3 Run the cat /etc/*-release command to check the OS type.l If EulerOS is used, go to Step 4.

# cat /etc/*-releaseEulerOS release 2.0 (SP2)EulerOS release 2.0 (SP2)

l If SUSE is used, go to Step 5.# cat /etc/*-releaseSUSE Linux Enterprise Server 11 (x86_64)VERSION = 11PATCHLEVEL = 3

l If another OS is used, go to Step 10.

Step 4 Run the cat /etc/euleros-release command to check whether the OS version is


Issue 01 (2018-09-06) 292

EulerOS 2.2.

# cat /etc/euleros-releaseEulerOS release 2.0 (SP2)l If yes, the alarm sending function cannot be enabled. Go to Step 6.l If no, go to Step 10.

Step 5 Run the cat /proc/version command to check whether the SUSE kernel version is 3.0 or later.# cat /proc/versionLinux version 3.0.101-63-default (geeko@buildhost) (gcc version 4.3.4 [gcc-4_3-branch revision 152973] (SUSE Linux) ) #1 SMP Tue Jun 23 16:02:31 UTC 2015 (4b89d0c)l If yes, the alarm sending function cannot be enabled. Go to Step 6.l If no, go to Step 10.

Step 6 Log in to MRS Manager and choose System > Configuration > Threshold Configuration.

Step 7 In the navigation tree of the Threshold Configuration page, choose Network Reading >Network Read Packet Rate Information > Read Packet Dropped Rate. In the area on theright, check whether Send Alarm is selected.l If yes, the alarm sending function has been enabled. Go to Step 8.l If no, the alarm sending function has been disabled. Go to Step 9.

Step 8 In the area on the right, deselect Send Alarm to disable the checking of Network ReadPacket Dropped Rate Exceeds the Threshold.

Step 9 On the Alarm page of MRS Manager, search for the 12045 alarm. If the alarm is not clearedautomatically, clear it manually. No further action is required.

NOTE

The ID of alarm Network Read Packet Dropped Rate Exceeds the Threshold is 12045.

Check whether the NIC has configured the active/standby bond mode.

Step 10 Use PuTTY to log in to the alarm node as user omm. Run the ls -l /proc/net/bondingcommand to check whether directory /proc/net/bonding exists on the alarm node.l If yes, the NIC has configured the active/standby bond mode, as shown in the following.

Go to Step 11.# ls -l /proc/net/bonding/total 0-r--r--r-- 1 root root 0 Oct 11 17:35 bond0

l If no, the NIC has not configured the active/standby bond mode, as shown in thefollowing. Go to Step 13.# ls -l /proc/net/bonding/ls: cannot access /proc/net/bonding/: No such file or directory

Step 11 Run the cat /proc/net/bonding/bond0 command and check whether the value of BondingMode is fault-tolerance.

NOTE

bond0 indicates the name of the bond configuration file. Use the file name queried in Step 10 inpractice.

# cat /proc/net/bonding/bond0 Ethernet Channel Bonding Driver: v3.7.1 (April 27, 2011)

Bonding Mode: fault-tolerance (active-backup)Primary Slave: eth1 (primary_reselect always)Currently Active Slave: eth1


Issue 01 (2018-09-06) 293

MII Status: upMII Polling Interval (ms): 100Up Delay (ms): 0Down Delay (ms): 0

Slave Interface: eth0MII Status: upSpeed: 1000 MbpsDuplex: fullLink Failure Count: 1Slave queue ID: 0

Slave Interface: eth1MII Status: upSpeed: 1000 MbpsDuplex: fullLink Failure Count: 1Slave queue ID: 0l If yes, the NIC has configured the active/standby bond mode. Go to Step 12.l If no, the NIC has not configured the active/standby bond mode. Go to Step 13.

Step 12 Check whether the NIC of the NetworkCardName parameter in the alarm details is thestandby NIC.l If yes, manually clear the alarm on the Alarms page because the alarm on the standby

cannot be automatically cleared. No further action is required.l If no, go to Step 13.

NOTE

Method of determining whether an NIC is standby: In the /proc/net/bonding/bond0 configurationfile, check whether the NIC name of the NetworkCardName parameter is the same as the SlaveInterface, but is different from Currently Active Slave (indicating the current active NIC). If theanswer is yes, the NIC is a standby one.

Check whether the threshold is set properly.

Step 13 Log in to MRS Manager and check whether the alarm threshold is set properly. (By default,0.5% is a proper value. However, users can configure the value as required.)l If yes, go to Step 16.l If no, go to Step 14.

Step 14 Based on actual usage condition, choose System > Threshold Configuration > Device >Host > Network Reading > Network Read Packet Rate Information > Read PacketDropped Rate to modify the alarm threshold.

For details, see Figure 6-1.


Issue 01 (2018-09-06) 294

Figure 6-1 Setting alarm thresholds

Step 15 Wait 5 minutes and check whether the alarm is cleared.l If yes, no further action is required.l If no, go to Step 16.

Check whether the network is normal.

Step 16 Contact the system administrator to check whether the network is abnormal.l If yes, go to Step 17 to rectify the network fault.l If no, go to Step 18.





----End



Issue 01 (2018-09-06) 295

6.7.76 ALM-12046 Network Write Packet Dropped Rate Exceedsthe Threshold

Description

The system checks the network write packet dropped rate every 30 seconds and compares theactual packet dropped rate with the threshold (the default threshold is 0.5%). This alarm isgenerated when the system detects that the network write packet dropped rate exceeds thethreshold for several times (5 times by default) consecutively.

To change the threshold, choose System > Threshold Configuration > Device > Host >Network Writing > Network Write Packet Rate Information > Write Packet DroppedRate.

When the hit number is 1, this alarm is cleared when the network write packet dropped rateis less than or equal to the threshold. When the hit number is greater than 1, this alarm iscleared when the network write packet dropped rate is less than or equal to 90% of thethreshold.

Attribute


12046 Major Yes

Parameters








The service performance deteriorates or services time out.


Issue 01 (2018-09-06) 296

Possible Causesl The alarm threshold is set improperly.l The network is abnormal.

ProcedureCheck whether the threshold is set properly.


Step 2 Based on actual usage condition, choose System > Threshold Configuration > Device >Host > Network Writing > Network Write Packet Rate Information > Write PacketDropped Rate to modify the alarm threshold.








----End


6.7.77 ALM-12047 Network Read Packet Error Rate Exceeds theThreshold

DescriptionThe system checks the network read packet error rate every 30 seconds and compares theactual packet error rate with the threshold (the default threshold is 0.5%). This alarm isgenerated when the system detects that the network read packet error rate exceeds thethreshold for several times (5 times by default) consecutively.


Issue 01 (2018-09-06) 297

To change the threshold, choose System > Threshold Configuration > Device > Host >Network Reading > Network Read Packet Rate Information > Read Packet Error Rate.

When the hit number is 1, this alarm is cleared when the network read packet error rate isless than or equal to the threshold. When the hit number is greater than 1, this alarm iscleared when the value of the network read packet error rate is less than or equal to 90% ofthe threshold.

Attribute


12047 Major Yes

Parameters








The communication interrupts intermittently and services time out.


Procedure


Step 1 Log in to MRS Manager and check whether the alarm threshold is set properly. (By default,0.5% is a proper value. However, users can configure the value as required.)l If yes, go to Step 4.


Issue 01 (2018-09-06) 298


Step 2 Based on actual usage condition, choose System > Threshold Configuration > Device >Host > Network Reading > Network Read Packet Rate Information > Read PacketError Rate to modify the alarm threshold.

Step 3 Wait 5 minutes and check whether the alarm is cleared.




Step 4 Contact the system administrator to check whether the network is abnormal.

l If yes, go to Step 5 to rectify the network fault.








----End

Related Information

N/A

6.7.78 ALM-12048 Network Write Packet Error Rate Exceeds theThreshold

Description

The system checks the network write packet error rate every 30 seconds and compares theactual packet error rate with the threshold (the default threshold is 0.5%). This alarm isgenerated when the system detects that the network write packet error rate exceeds thethreshold for several times (5 times by default) consecutively.

To change the threshold, choose System > Threshold Configuration > Device > Host >Network Writing > Network Write Packet Rate Information > Write Packet Error Rate.

When the hit number is 1, this alarm is cleared when the network write packet error rate isless than or equal to the threshold. When the hit number is greater than 1, this alarm iscleared when the value of the network write packet error rate is less than or equal to 90% ofthe threshold.


Issue 01 (2018-09-06) 299

Attribute


12048 Major Yes

Parameters








The communication interrupts intermittently and services time out.


Procedure



Step 2 Based on actual usage condition, choose System > Threshold Configuration > Device >Host > Network Writing > Network Write Packet Rate Information > Write PacketError Rate to modify the alarm threshold.

Step 3 Wait 5 minutes and check whether the alarm is cleared.l If yes, no further action is required.


Issue 01 (2018-09-06) 300








----End


6.7.79 ALM-12049 Network Read Throughput Rate Exceeds theThreshold

DescriptionThe system checks the network read throughput rate every 30 seconds and compares theactual throughput rate with the threshold (the default threshold is 80%). This alarm isgenerated when the system detects that the network read throughput rate exceeds the thresholdfor several times (5 times by default) consecutively.

To change the threshold, choose System > Threshold Configuration > Device > Host >Network Reading > Network Read Throughput Rate > Read Throughput Rate.

When the hit number is 1, this alarm is cleared when the network read throughput rate is lessthan or equal to the threshold. When the hit number is greater than 1, this alarm is clearedwhen the network read throughput rate is less than or equal to 90% of the threshold.


12049 Major Yes


Issue 01 (2018-09-06) 301

Parameters








The service system runs improperly or is unavailable.

Possible Causesl The alarm threshold is set improperly.l The network port rate cannot meet the current service requirements.

Procedure


Step 1 Log in to MRS Manager and check whether the alarm threshold is set properly. (By default,80% is a proper value. However, users can configure the value as required.)l If yes, go to Step 2.l If no, go to Step 4.

Step 2 Based on actual usage condition, choose System > Threshold Configuration > Device >Host > Network Reading > Network Read Throughput Rate > Read Throughput Rate tomodify the alarm threshold.


Check whether the network port rate can meet the service requirements.

Step 4 On MRS Manager, click the alarm in the real-time alarm list. In the Alarm Details area,obtain the IP address of the host and the network port name for which the alarm is generated.

Step 5 Use PuTTY to log in to the host for which the alarm is generated as user root.


Issue 01 (2018-09-06) 302

Step 6 Run the ethtool network port name command to check the maximum speed of the currentnetwork port.

NOTE

In the VM environment, you cannot run a command to query the network port rate. It is recommendedthat you contact the system administrator to confirm whether the network port rate meets therequirements.

Step 7 If the network read throughput rate exceeds the threshold, contact the system administrator toincrease the network port rate.

Step 8 Check whether the alarm is cleared.






----End

Related Information

N/A

6.7.80 ALM-12050 Network Write Throughput Rate Exceeds theThreshold

Description

The system checks the network write throughput rate every 30 seconds and compares theactual throughput rate with the threshold (the default threshold is 80%). This alarm isgenerated when the system detects that the network write throughput rate exceeds thethreshold for several times (5 times by default) consecutively.

To change the threshold, choose System > Threshold Configuration > Device > Host >Network Writing > Network Write Throughput Rate > Write Throughput Rate.

When the hit number is 1, this alarm is cleared when the network write throughput rate isless than or equal to the threshold. When the hit number is greater than 1, this alarm iscleared when the network write throughput rate is less than or equal to 90% of the threshold.

Attribute


12050 Major Yes


Issue 01 (2018-09-06) 303

Parameters








The service system runs improperly or is unavailable.

Possible Causesl The alarm threshold is set improperly.l The network port rate cannot meet the current service requirements.

Procedure


Step 1 Log in to MRS Manager and check whether the alarm threshold is set properly. (By default,80% is a proper value. However, users can configure the value as required.)l If yes, go to Step 4.l If no, go to Step 2.

Step 2 Based on actual usage condition, choose System > Threshold Configuration > Device >Host > Network Writing > Network Write Throughput Rate > Write Throughput Rateto modify the alarm threshold.


Check whether the network port rate can meet the service requirements.

Step 4 On MRS Manager, click the alarm in the real-time alarm list. In the Alarm Details area,obtain the IP address of the host and the network port name for which the alarm is generated.



Issue 01 (2018-09-06) 304

Step 6 Run the ethtool network port name command to check the maximum speed of the currentnetwork port.

NOTE

In the VM environment, you cannot run a command to query the network port rate. It is recommendedthat you contact the system administrator to confirm whether the network port rate meets therequirements.

Step 7 If the network write throughput rate exceeds the threshold, contact the system administrator toincrease the network port rate.

Step 8 Check whether the alarm is cleared.l If yes, no further action is required.l If no, go to Step 10.




----End


6.7.81 ALM-12051 Disk Inode Usage Exceeds the Threshold

DescriptionThe system checks the disk Inode usage every 30 seconds and compares the actual Inodeusage with the threshold (the default threshold is 80%). This alarm is generated when theInode usage exceeds the threshold for several times (5 times by default) consecutively.

To change the threshold, choose System > Threshold Configuration > Device > Host >Disk > Disk Inode Usage > Disk Inode Usage.

When the hit number is 1, this alarm is cleared when the disk Inode usage is less than orequal to the threshold. When the hit number is greater than 1, this alarm is cleared when thedisk Inode usage is less than or equal to 90% of the threshold.


12051 Major Yes


Issue 01 (2018-09-06) 305





PartitionName Specifies the disk partition for which thealarm is generated.


Impact on the SystemData cannot be properly written to the file system.

Possible Causesl Massive small files are stored in the disk.l The system is abnormal.

ProcedureMassive small files are stored in the disk.

Step 1 On MRS Manager, click the alarm in the real-time alarm list. In the Alarm Details area,obtain the IP address of the host and the disk partition for which the alarm is generated.


Step 3 Run the df -i partition name command to check the current disk Inode usage.

Step 4 If the Inode usage exceeds the threshold, manually check small files stored in the diskpartition and confirm whether these small files can be deleted.l If yes, delete files and go to Step 5.l If no, adjust the capacity. For details, see the FusionInsight HD Capacity Adjustment

Guide. Go to Step 6.


Check whether the system environment is abnormal.

Step 6 Contact the operating system maintenance personnel to check whether the operating system isabnormal.


Issue 01 (2018-09-06) 306

l If yes, go to Step 7 to rectify the fault.l If no, go to Step 9.





----End

Related Information

N/A

6.7.82 ALM-12052 TCP Temporary Port Usage Exceeds theThreshold

Description

The system checks the TCP temporary port usage every 30 seconds and compares the actualusage with the threshold (the default threshold is 80%). This alarm is generated when the TCPtemporary port usage exceeds the threshold for several times (5 times by default)consecutively.

To change the threshold, choose System > Threshold Configuration > Device > Host >Network Status > TCP Ephemeral Port Usage > TCP Ephemeral Port Usage.

When the hit number is 1, this alarm is cleared when the TCP temporary port usage is lessthan or equal to the threshold. When the hit number is greater than 1, this alarm is clearedwhen the TCP temporary port usage is less than or equal to 90% of the threshold.

Attribute


12052 Major Yes

Parameters





Issue 01 (2018-09-06) 307




Impact on the SystemServices on the host cannot establish external connections, and therefore they are interrupted.

Possible Causesl The temporary port cannot meet the current service requirements.l The system is abnormal.

ProcedureExpand the temporary port number range.

Step 1 On MRS Manager, click the alarm in the real-time alarm list. In the Alarm Details area,obtain the IP address of the host for which the alarm is generated.

Step 2 Use PuTTY to log in to the host for which the alarm is generated as user omm.

Step 3 Run the cat /proc/sys/net/ipv4/ip_local_port_range |cut -f 1 command to obtain the value ofthe start port and run the cat /proc/sys/net/ipv4/ip_local_port_range |cut -f 2 command toobtain the value of the end port. The total number of temporary ports is the value of the endport minus the value of the start port. If the total number of temporary ports is smaller than28,232, the random port range of the OS is narrow. Contact the system administrator toincrease the port range.

Step 4 Run the ss -ant 2>/dev/null | grep -v LISTEN | awk 'NR > 2 {print $4}'|cut -d ':' -f 2 | awk'$1 >"Value of the start port" {print $1}' | sort -u | wc -l command to calculate the numberof used temporary ports.

Step 5 The formula for calculating the usage of the temporary ports is: Usage of the temporary ports= (Number of used temporary ports/Total number of temporary ports) x 100%. Check whetherthe temporary port usage exceeds the threshold..l If yes, go to Step 7.l If no, go to Step 6.



Step 7 Run the following command to import the temporary file and view the frequently used portsin the port_result.txt file:

netstat -tnp > $BIGDATA_HOME/tmp/port_result.txt


Issue 01 (2018-09-06) 308

netstat -tnp

Active Internet connections (w/o servers)

Proto Recv Send LocalAddress ForeignAddress State PID/ProgramName tcp 0 0 10-120-85-154:45433 10-120-8:25009 CLOSE_WAIT 94237/java tcp 0 0 10-120-85-154:45434 10-120-8:25009 CLOSE_WAIT 94237/java tcp 0 0 10-120-85-154:45435 10-120-8:25009 CLOSE_WAIT 94237/java ...

Step 8 Run the following command to view the processes that occupy a large number of ports:

ps -ef |grep PID

NOTE

l PID is the processes ID queried in Step 7.

l Run the following command to collect information about all processes and check the processes thatoccupy a large number of ports:

ps -ef > $BIGDATA_HOME/tmp/ps_result.txt

Step 9 After obtaining the administrator's approval, clear the processes that occupy a large number ofports. Wait 5 minutes and check whether the alarm is cleared.l If yes, no further action is required.l If no, go to Step 11.




----End


6.7.83 ALM-12053 File Handle Usage Exceeds the Threshold

DescriptionThe system checks the file handle usage every 30 seconds and compares the actual usage withthe threshold (the default threshold is 80%). This alarm is generated when the file handleusage exceeds the threshold for several times (5 times by default) consecutively.

To change the threshold, choose System > Threshold Configuration > Device > Host >Host Status > Host File Handle Usage > Host File Handle Usage.

When the hit number is 1, this alarm is cleared when the host file handle usage is less than orequal to the threshold. When the hit number is greater than 1, this alarm is cleared when thehost file handle usage is less than or equal to 90% of the threshold.


Issue 01 (2018-09-06) 309

Attribute


12053 Major Yes

Parameters







The I/O operations, such as opening a file or connecting to network, cannot be performed andprograms are abnormal.

Possible Causesl The number of file handles cannot meet the current service requirements.l The system is abnormal.

Procedure

Increase the number of file handles.

Step 1 On MRS Manager, click the alarm in the real-time alarm list. In the Alarm Details area,obtain the IP address of the host for which the alarm is generated.


Step 3 Run the ulimit -n command to check the current maximum number of file handles of thesystem.

Step 4 If the file handle usage exceeds the threshold, contact the system administrator to increase thenumber of file handles.



Issue 01 (2018-09-06) 310


Step 6 Contact the system administrator to check whether the operating system is abnormal.

l If yes, go to Step 7 to rectify the fault.








----End

Related Information

N/A

6.7.84 ALM-12054 The Certificate File Is Invalid

Description

The system checks whether the certificate file is invalid (has expired or is not yet valid) on23:00 every day. This alarm is generated when the certificate file is invalid.

This alarm is cleared if the status of the newly imported certificate is valid.

Attribute


12054 Major Yes

Parameters






Issue 01 (2018-09-06) 311


The system reminds users that the certificate file is invalid. If the certificate file expires, somefunctions are restricted and cannot be used properly.

Possible Causes

No HA root certificate or HA user certificate is imported, certificate import fails or thecertificate file is invalid.

Procedure

Locate the alarm cause.

Step 1 On MRS Manager, view the real-time alarm list and locate the target alarm.

In the Alarm Details area, view the additional information about the alarm.

l If CA Certificate is displayed in the additional information, use PuTTY to log in to theactive OMS node as user omm and go to Step 2.

l If HA root Certificate is displayed in the additional information, check Location toobtain the name of the host involved in this alarm. Then use PuTTY to log in to the hostas user omm and go to Step 3.

l If HA server Certificate is displayed in the additional information, check Location toobtain the name of the host involved in this alarm. Then use PuTTY to log in to the hostas user omm and go to Step 4.

Check the validity period of the certificate file.

Step 2 Check whether the current system time is in the validity period of the CA certificate.

Run the openssl x509 -noout -text -in ${CONTROLLER_HOME}/security/cert/root/ca.crt command to check the effective time and due time of the CA certificate.


Step 3 Check whether the current system time is in the validity period of the HA root certificate.

Run the openssl x509 -noout -text -in ${CONTROLLER_HOME}/security/certHA/root-ca.crt command to check the effective time and expiration time of the HA root certificate.


Step 4 Check whether the current system time is in the validity period of the HA user certificate.

Run the openssl x509 -noout -text -in ${CONTROLLER_HOME}/security/certHA/server.crt command to check the effective time and expiration time of the HA user certificate.


The example of the effective time and expiration time of the HA/CA certificate:Certificate: Data: Version: 3 (0x2)


Issue 01 (2018-09-06) 312

Serial Number: 97:d5:0e:84:af:ec:34:d8 Signature Algorithm: sha256WithRSAEncryption Issuer: C=CountryName, ST=State, L=Locality, O=Organization, OU=IT, CN=HADOOP.COM Validity Not Before: Dec 13 06:38:26 2016 GMT //The effective time. Not After : Dec 11 06:38:26 2026 GMT //The expiration time.

Import the certificate file.

Step 5 Import a new CA certificate file.

Apply for or generate a CA certificate file and import it to the system. For details, see sectionReplacing HA Certificates in the Administrator Guide. Manually clear the alarm and checkwhether this alarm is generated again during periodic check.

l If yes, go to Step 8

l If no, no further action is required.

Step 6 Import a new HA certificate file.

Apply for or generate an HA certificate file and import it to the system. For details, seesection Replacing HA Certificates in the Administrator Guide. Manually clear the alarm andcheck whether this alarm is generated again during periodic check.


l If no, no further action is required.




----End

Related Information

N/A

6.7.85 ALM-12055 The Certificate File Is About to Expire

Description

The system checks the certificate file on 23:00 every day. This alarm is generated if the timeleft before the certificate file expires is shorter than the threshold. In this case, the certificatefile is about to expire. For details about how to configure the alarm threshold duration, seesection Configuring the Threshold for the Alarm Stating That the Certificate Is About toExpire in the Administrator Guide.

This alarm is cleared if the status of the newly imported certificate is valid.


Issue 01 (2018-09-06) 313

Attribute


12055 Minor Yes

Parameters






The system reminds users that the certificate file is about to expire. If the certificate fileexpires, some functions are restricted and cannot be used properly.

Possible Causes

The remaining validity period of the CA certificate, HA root certificate (root-ca.crt), or HAuser certificate (server.crt) is smaller than the alarm threshold.

Procedure

Locate the alarm cause.

Step 1 On MRS Manager, view the real-time alarm list and locate the target alarm.

In the Alarm Details area, view the additional information about the alarm.

l If CA Certificate is displayed in the additional information, use PuTTY to log in to theactive OMS node as user omm and go to Step 2.

l If HA root Certificate is displayed in the additional information, check Location toobtain the name of the host involved in this alarm. Then use PuTTY to log in to the hostas user omm and go to Step 3.

l If HA server Certificate is displayed in the additional information, check Location toobtain the name of the host involved in this alarm. Then use PuTTY to log in to the hostas user omm and go to Step 4.

Check the validity period of the certificate file.

Step 2 Check whether the remaining validity period of the CA certificate is smaller than the alarmthreshold.


Issue 01 (2018-09-06) 314

Run the openssl x509 -noout -text -in ${CONTROLLER_HOME}/security/cert/root/ca.crt command to check the effective time and due time of the CA certificate.


Step 3 Check whether the remaining validity period of the HA root certificate is smaller than thealarm threshold.

Run the openssl x509 -noout -text -in ${CONTROLLER_HOME}/security/certHA/root-ca.crt command to check the effective time and due time of the HA root certificate.


Step 4 Check whether the remaining validity period of the HA user certificate is smaller than thealarm threshold.

Run the openssl x509 -noout -text -in ${CONTROLLER_HOME}/security/certHA/server.crt command to check the effective time and expiration time of the HA user certificate.


The example of the effective time and expiration time of the HA/CA certificate:Certificate: Data: Version: 3 (0x2) Serial Number: 97:d5:0e:84:af:ec:34:d8 Signature Algorithm: sha256WithRSAEncryption Issuer: C=CountryName, ST=State, L=Locality, O=Organization, OU=IT, CN=HADOOP.COM Validity Not Before: Dec 13 06:38:26 2016 GMT //The effective time. Not After : Dec 11 06:38:26 2026 GMT //The expiration time.

Import the certificate file.

Step 5 Import a new CA certificate file.

Apply for or generate a CA certificate file and import it to the system. For details, see sectionReplacing HA Certificates in the Administrator Guide. Manually clear the alarm and checkwhether this alarm is generated again during periodic check.

l If yes, go to Step 8.l If no, no further action is required.

Step 6 Import a new HA certificate file.

Apply for or generate an HA certificate file and import it to the system. For details, seesection Replacing HA Certificates in the Administrator Guide. Manually clear the alarm andcheck whether this alarm is generated again during periodic check.

l If yes, go to Step 8.l If no, no further action is required.



Issue 01 (2018-09-06) 315



----End


6.7.86 ALM-18008 Heap Memory Usage of Yarn ResourceManagerExceeds the Threshold

DescriptionThe system checks the heap memory usage of Yarn ResourceManager every 30 seconds andcompares the actual usage with the threshold. The alarm is generated when the heap memoryusage of Yarn ResourceManager exceeds the threshold (80% of the maximum memory bydefault).

To change the threshold, choose System > Threshold Configuration > Service > Yarn. Thisalarm is cleared when the heap memory usage of Yarn ResourceManager is less than or equalto the threshold.


18008 Major Yes







Issue 01 (2018-09-06) 316


Overhigh heap memory usage of the Yarn ResourceManager deteriorates Yarn tasksubmission and running performance or even causes OOM, which results in unavailable Yarnservice.

Possible Causes

The heap memory of the Yarn ResourceManager instance is overused or inappropriatelyallocated.

Procedure

Check the heap memory usage.

Step 1 On MRS Manager, click Alarm and select the alarm whose Alarm ID is 18008. Then checkthe IP address and role name of the instance in Location.

Step 2 On MRS Manager, choose Service > Yarn > Instance > ResourceManager > Customize >Percentage of Used Heap Memory of the ResourceManager.

Step 3 Check whether the used heap memory of ResourceManager reaches 80% of the maximumheap memory specified for ResourceManager.l If yes, go to Step 4.l If no, go to Step 6.

Step 4 On MRS Manager, choose Service > Yarn > Service Configuration > All >ResourceManager > System. Increase the value of -Xmx in the GC_OPTS parameter asrequired, click Save Configuration, and select Restart the affected services or instance.Click OK to restart the role instance.




Step 7 Select the following node from the Service drop-down list and click OK.l NodeAgentl Yarn

Step 8 Set Start Time for log collection to 10 minutes ahead of the alarm generation time and EndTime to 10 minutes behind the alarm generation time, and click Download.


----End

Related Information

N/A


Issue 01 (2018-09-06) 317

6.7.87 ALM-18009 Heap Memory Usage of MapReduceJobHistoryServer Exceeds the Threshold

Description

The system checks the heap memory usage of MapReduce JobHistoryServer every 30seconds and compares the actual usage with the threshold. The alarm is generated when theheap memory usage of MapReduce JobHistoryServer exceeds the threshold (80% of themaximum memory by default).

To change the threshold, choose System > Threshold Configuration > Service >MapReduce. This alarm is cleared when the heap memory usage of MapReduceJobHistoryServer is less than or equal to the threshold.

Attribute


18009 Major Yes

Parameters







Overhigh heap memory usage of the MapReduce JobHistoryServer deteriorates performanceof MapReduce log archiving or even causes OOM, which results in unavailable MapReduceservice.

Possible Causes

The heap memory of the MapReduce JobHistoryServer instance is overused orinappropriately allocated.


Issue 01 (2018-09-06) 318

Procedure

Check the memory usage.

Step 1 On MRS Manager, click Alarm and select the alarm whose Alarm ID is 18009. Then checkthe IP address and role name of the instance in Location.

Step 2 On MRS Manager, choose Service > MapReduce > Instance > JobHistoryServer >Customize > Percentage of Used Heap Memory of the JobHistoryServer.

Step 3 JobHistoryServer indicates the corresponding HostName of the instance for which the alarmis generated. Check the heap memory usage.

Step 4 Check whether the used heap memory of JobHistoryServer reaches 80% of the maximumheap memory specified for JobHistoryServer.



Step 5 On MRS Manager, choose Service > MapReduce > Service Configuration > All >JobHistoryServer > System. Increase the value of -Xmx in the GC_OPTS parameter asrequired, click Save Configuration, and select Restart the affected services or instance.Click OK to restart the role instance.

Step 6 Check whether the alarm is cleared.





Step 8 Select the following node from the Service drop-down list and click OK.

l NodeAgent

l MapReduce



----End

Related Information

N/A

6.7.88 ALM-20002 Hue Service Unavailable

Description

The system checks the Hue service status every 60 seconds. This alarm is generated when theHue service is unavailable and is cleared when the Hue service recovers.


Issue 01 (2018-09-06) 319


20002 Critical Yes






The system cannot provide data loading, query, and extraction services.

Possible Causesl The internal KrbServer service on which the Hue service depends is abnormal.l The internal DBService service on which the Hue service depends is abnormal.l The network connection to the DBService is abnormal.

Procedure

Check whether the KrbServer is abnormal.

Step 1 On the MRS Manager home page, click Service. In the service list, check whether theKrbServer health status is Good.l If yes, go to Step 4.l If no, go to Step 2.

Step 2 Click Restart in the Operation column of the KrbServer to restart the KrbServer.

Step 3 Wait several minutes, and check whether ALM-20002 Hue Service Unavailable is cleared.l If yes, no further action is required.l If no, go to Step 4.

Check whether the DBService is abnormal.

Step 4 On the MRS Manager home page, click Service.

Step 5 In the service list, check whether the DBService health status is Good.l If yes, go to Step 8.


Issue 01 (2018-09-06) 320


Step 6 Click Restart in the Operation column of the DBService to restart the DBService.

NOTE

To restart the service, enter the MRS Manager administrator password and select Start or restartrelated services..


Check whether the network connection to the DBService is normal.

Step 8 Choose Service > Hue > Instance, record the IP address of the active Hue.

Step 9 Use PuTTY to log in to the active Hue.

Step 10 Run the ping command to check whether communication between the host that runs theactive Hue and the hosts that run the DBService is normal. (Obtain the IP addresses of thehosts that run the DBService in the same way as that for obtaining the IP address of the activeHue.)l If yes, go to Step 13.l If no, go to Step 11.

Step 11 Contact the administrator to restore the network.




Step 14 Select the following nodes from the Service drop-down list and click OK:l Huel Controller


The Hue is restarted.

Step 16 On MRS Manager, choose Service > Hue.

Step 17 Choose More Actions > Restart service, and click OK.


Step 19 Contact Technical Support.


----End


Issue 01 (2018-09-06) 321

Related Information

N/A

6.7.89 ALM-43001 Spark Service Unavailable

Description

The system checks the Spark service status every 60 seconds. This alarm is generated whenthe Spark service is unavailable and is cleared when the Spark service recovers.

Attribute


43001 Critical Yes

Parameters






The tasks submitted by users fail to be executed.

Possible Causesl The KrbServer service is abnormal.l The LdapServer service is abnormal.l The ZooKeeper service is abnormal.l The HDFS service is abnormal.l The Yarn service is abnormal.l The corresponding Hive service is abnormal.

Procedure

Step 1 Check whether service unavailability alarms exist in services on which Spark depends.

1. On MRS Manager, click Alarm.


Issue 01 (2018-09-06) 322

2. Check whether the following alarms exist in the alarm list:– ALM-25500 KrbServer Service Unavailable– ALM-25000 LdapServer Service Unavailable– ALM-13000 ZooKeeper Service Unavailable– ALM-14000 HDFS Service Unavailable– ALM-18000 Yarn Service Unavailable– ALM-16004 Hive Service Unavailable

If yes, go to Step 1.3If no, go to Step 2.

3. Handle the service unavailability alarms based on the troubleshooting methods providedin the alarm help.After all the service unavailability alarms are cleared, wait a few minutes and checkwhether this alarm is cleared.– If yes, no further action is required.– If no, go to Step 2.


1. On MRS Manager, choose System > Export Log.2. Select the following nodes from the Service drop-down list and click OK (Hive is the

specific Hive service determined based on ServiceName in the alarm locationinformation).– KrbServer– LdapServer– ZooKeeper– HDFS– Yarn– Hive

3. Set Start Time for log collection to 10 minutes ahead of the alarm generation time andEnd Time to 10 minutes behind the alarm generation time, and click Download.


----End

Related Information

N/A

6.7.90 ALM-43006 Heap Memory Usage of the JobHistory ProcessExceeds the Threshold

Description

The system checks the heap memory usage of the JobHistory process every 30 seconds. Thealarm is generated when the heap memory usage of the JobHistory process exceeds thethreshold (90% of the maximum memory).


Issue 01 (2018-09-06) 323

Attribute


43006 Major Yes

Parameters






Overhigh heap memory usage of the JobHistory process deteriorates JobHistory runningperformance or even causes OOM, which results in unavailable JobHistory process.

Possible Causes

The heap memory of the JobHistory process is overused or inappropriately allocated.

Procedure

Step 1 Check heap memory usage.

1. On MRS Manager, click Alarm and select the alarm whose Alarm ID is 43006. Thencheck the IP address and role name of the instance in Location.

2. On MRS Manager, choose Service > Spark > Instance and click the JobHistory forwhich the alarm is generated to enter the Instance Status page. Then choose Customize> Statistics for the heap memory of the JobHistory Process and click OK.

3. Check whether the used heap memory of the JobHistory process reaches 90% of themaximum heap memory specified for JobHistory.– If yes, go to Step 1.4.– If no, go to Step 2.

4. On MRS Manager, choose Service > Spark > Service Configuration, and set Type toAll. Choose JobHistory > Default. Increase the value ofSPARK_DAEMON_MEMORY as required.



Issue 01 (2018-09-06) 324



2. Select Spark from the Service drop-down list and click OK.



----End

Related Information

N/A

6.7.91 ALM-43007 Non-Heap Memory Usage of the JobHistoryProcess Exceeds the Threshold

Description

The system checks the non-heap memory usage of the JobHistory process every 30 seconds.The alarm is generated when the non-heap memory usage of the JobHistory process exceedsthe threshold (90% of the maximum memory).

Attribute


43007 Major Yes

Parameters






Overhigh non-heap memory usage of the JobHistory process deteriorates JobHistory runningperformance or even causes OOM, which results in unavailable JobHistory process.


Issue 01 (2018-09-06) 325

Possible Causes

The non-heap memory of the JobHistory process is overused or inappropriately allocated.

Procedure

Step 1 Check non-heap memory usage.


2. On MRS Manager, choose Service > Spark > Instance and click the JobHistory forwhich the alarm is generated to enter the Instance Status page. Then choose Customize> Statistics for the non-heap memory of the JobHistory Process and click OK.

3. Check whether the used non-heap memory of the JobHistory process reaches 90% of themaximum non-heap memory specified for JobHistory.



4. On MRS Manager, choose Service > Spark > Service Configuration, and set Type toAll. Choose JobHistory > Default. Increase the value of -XX:MaxMetaspaceSize inSPARK_DAEMON_JAVA_OPTS as required.








a. Contact the public cloud O&M personnel and send the collected log information.

----End

Related Information

N/A

6.7.92 ALM-43008 Direct Memory Usage of the JobHistory ProcessExceeds the Threshold

Description

The system checks the direct memory usage of the JobHistory process every 30 seconds. Thealarm is generated when the direct memory usage of the JobHistory process exceeds thethreshold (90% of the maximum memory).


Issue 01 (2018-09-06) 326

Attribute


43008 Major Yes

Parameters






Overhigh direct memory usage of the JobHistory process deteriorates JobHistory runningperformance or even causes OOM, which results in unavailable JobHistory process.

Possible Causes

The direct memory of the JobHistory process is overused or inappropriately allocated.

Procedure

Step 1 Check direct memory usage.


2. On MRS Manager, choose Service > Spark > Instance and click the JobHistory forwhich the alarm is generated to enter the Instance Status page. Then choose Customize> Direct Memory of JobHistory and click OK.

3. Check whether the used direct memory of the JobHistory process reaches 90% of themaximum direct memory specified for JobHistory.– If yes, go to Step 1.4.– If no, go to Step 2.

4. On MRS Manager, choose Service > Spark > Service Configuration, and set Type toAll. Choose JobHistory > Default. Increase the value of -XX:MaxDirectMemorySizein SPARK_DAEMON_JAVA_OPTS as required.



Issue 01 (2018-09-06) 327






----End

Related Information

N/A

6.7.93 ALM-43009 JobHistory GC Time Exceeds the Threshold

Description

The system checks the garbage collection (GC) time of the JobHistory process every 60seconds. This alarm is generated when the detected GC time exceeds the threshold (exceeds12 seconds for three consecutive checks.) To change the threshold, choose System >Threshold Configuration > Service > Spark > Garbage Collection (GC) Time ofJobHistory > Total GC time in milliseconds. This alarm is cleared when the JobHistory GCtime is shorter than or equal to the threshold.

Attribute


43009 Major Yes

Parameters






If the GC time exceeds the threshold, the JobHistory process running performance will beaffected and the JobHistory process will even be unavailable.


Issue 01 (2018-09-06) 328

Possible Causes

The heap memory of the JobHistory process is overused or inappropriately allocated, causingfrequent occurrence of the GC process.

Procedure

Step 1 Check the GC time.


2. On MRS Manager, choose Service > Spark > Instance and click the JobHistory forwhich the alarm is generated to enter the Instance Status page. Then choose Customize> Garbage Collection (GC) Time of JobHistory and click OK.

3. Check whether the GC time is longer than 12 seconds.



4. On MRS Manager, choose Service > Spark > Service Configuration, and set Type toAll. Choose JobHistory > Default. Increase the value ofSPARK_DAEMON_MEMORY as required.






2. In the Service drop-down list box, select Spark and click OK.



----End

Related Information

N/A

6.7.94 ALM-43010 Heap Memory Usage of the JDBCServerProcess Exceeds the Threshold

Description

The system checks the heap memory usage of the JDBCServer process every 30 seconds. Thealarm is generated when the heap memory usage of the JDBCServer process exceeds thethreshold (90% of the maximum memory).


Issue 01 (2018-09-06) 329

Attribute


43010 Major Yes

Parameters






Overhigh heap memory usage of the JDBCServer process deteriorates JDBCServer runningperformance or even causes OOM, which results in unavailable JDBCServer process.

Possible Causes

The heap memory of the JDBCServer process is overused or inappropriately allocated.

Procedure

Step 1 Check heap memory usage.


2. On MRS Manager, choose Service > Spark > Instance and click the JDBCServer forwhich the alarm is generated to enter the Instance Status page. Then choose Customize> Statistics for the heap memory of the JDBCServer Process and click OK.

3. Check whether the used heap memory of the JDBCServer process reaches 90% of themaximum heap memory specified for JDBCServer.– If yes, go to Step 1.4.– If no, go to Step 2.

4. On MRS Manager, choose Service > Spark > Service Configuration, and set Type toAll. Choose JDBCServer > Tuning. Increase the value ofSPARK_DRIVER_MEMORY as required.



Issue 01 (2018-09-06) 330






----End

Related Information

N/A

6.7.95 ALM-43011 Non-Heap Memory Usage of the JDBCServerProcess Exceeds the Threshold

Description

The system checks the non-heap memory usage of the JDBCServer process every 30 seconds.The alarm is generated when the non-heap memory usage of the JDBCServer process exceedsthe threshold (90% of the maximum memory).

Attribute


43011 Major Yes

Parameters






Overhigh non-heap memory usage of the JDBCServer process deteriorates JDBCServerrunning performance or even causes OOM, which results in unavailable JDBCServer process.


Issue 01 (2018-09-06) 331

Possible Causes

The non-heap memory of the JDBCServer process is overused or inappropriately allocated.

Procedure

Step 1 Check non-heap memory usage.


2. On MRS Manager, choose Service > Spark > Instance and click the JDBCServer forwhich the alarm is generated to enter the Instance Status page. Then choose Customize> Statistics for the non-heap memory of the JDBCServe Process and click OK.

3. Check whether the used non-heap memory of the JDBCServer process reaches 90% ofthe maximum non-heap memory specified for JDBCServer.



4. On MRS Manager, choose Service > Spark > Service Configuration, and set Type toAll. Choose JDBCServer > Tuning. Increase the value of-XX:MaxMetaspaceSize inspark.driver.extraJavaOptions as required.









----End

Related Information

N/A

6.7.96 ALM-43012 Direct Memory Usage of the JDBCServerProcess Exceeds the Threshold

Description

The system checks the direct memory usage of the JDBCServer process every 30 seconds.The alarm is generated when the direct memory usage of the JDBCServer process exceeds thethreshold (90% of the maximum memory).


Issue 01 (2018-09-06) 332

Attribute


43012 Major Yes

Parameters






Overhigh direct memory usage of the JDBCServer process deteriorates JDBCServer runningperformance or even causes OOM, which results in unavailable JDBCServer process.

Possible Causes

The direct memory of the JDBCServer process is overused or inappropriately allocated.

Procedure

Step 1 Check direct memory usage.


2. On MRS Manager, choose Service > Spark > Instance and click the JDBCServer forwhich the alarm is generated to enter the Instance Status page. Then choose Customize> Direct Memory of JDBCServer and click OK.

3. Check whether the used direct memory of the JDBCServer process reaches 90% of themaximum direct memory specified for JDBCServer.– If yes, go to Step 1.4– If no, go to Step 2.

4. On MRS Manager, choose Service > Spark > Service Configuration, and set Type toAll. Choose JDBCServer > Tuning. Increase the value of -XX:MaxDirectMemorySizein spark.driver.extraJavaOptions as required.



Issue 01 (2018-09-06) 333


1. On MRS Manager, choose System > Export Log.2. Select Spark from the Service drop-down list and click OK.3. Set Start Time for log collection to 10 minutes ahead of the alarm generation time and

End Time to 10 minutes behind the alarm generation time, and click Download.4. Contact the public cloud O&M personnel and send the collected log information.

----End

Related Information

N/A

6.7.97 ALM-43013 JDBCServer GC Time Exceeds the Threshold

Description

The system checks the garbage collection (GC) time of the JDBCServer process every 60seconds. This alarm is generated when the detected GC time exceeds the threshold (exceeds12 seconds for three consecutive checks.) To change the threshold, choose System >Threshold Configuration > Service > Spark > Garbage Collection (GC) Time ofJDBCServer > Total GC time in milliseconds. This alarm is cleared when the JDBCServerGC time is shorter than or equal to the threshold.

Attribute


43013 Major Yes

Parameters






If the GC time exceeds the threshold, the JDBCServer process running performance will beaffected and the JDBCServer process will even be unavailable.


Issue 01 (2018-09-06) 334

Possible CausesThe heap memory of the JDBCServer process is overused or inappropriately allocated,causing frequent occurrence of the GC process.

Procedure

Step 1 Check the GC time.


2. On MRS Manager, choose Service > Spark > Instance and click the JDBCServer forwhich the alarm is generated to enter the Instance Status page. Then choose Customize> Garbage Collection (GC) Time of JDBCServer and click OK.

3. Check whether the GC time is longer than 12 seconds.– If yes, go to Step 1.4.– If no, go to Step 2.

4. On MRS Manager, choose Service > Spark > Service Configuration, and set Type toAll. Choose JDBCServer > Tuning. Increase the value ofSPARK_DRIVER_MEMORY as required.



1. On MRS Manager, choose System > Export Log.2. In the Service drop-down list box, select Spark and click OK.3. Set Start Time for log collection to 10 minutes ahead of the alarm generation time and

End Time to 10 minutes behind the alarm generation time, and click Download.4. Contact the public cloud O&M personnel and send the collected log information.

----End


6.8 Object Management

6.8.1 IntroductionAn MRS cluster contains different types of basic objects. Table 6-17 describes these objects.


Issue 01 (2018-09-06) 335

Table 6-17 MRS basic objects

Object Description Example

Service Function set that can complete specificoperations.

KrbServer service and LdapServerservice

Serviceinstance

Specific instance of a service, oftenreferred to as a service.

KrbServer service

Servicerole

Functional entity that forms a completeservice, often referred to as a role.

KrbServer consists of theKerberosAdmin role and theKerberosServer role.

Roleinstance

Specific instance of a service rolerunning on a host.

KerberosAdmin running on Host2and KerberosServer running onHost3

Host Elastic Cloud Server (ECS) running aLinux OS.

Host1 to Host5

Rack Physical entity that contains multiplehosts connecting to the same switch.

Rack1 contains Host1 to Host5.

Cluster Logical entity that consists of multiplehosts and provides various services.

Cluster1 consists of five hosts(Host1 to Host5) and providesservices such as KrbServer andLdapServer.

6.8.2 Querying Configurations

Scenario

On MRS Manager, users can query the configurations of services (including roles) and roleinstances.

Procedurel Query service configurations.

a. On MRS Manager, click Service.b. Select the target service from the service list.c. Click Service Configuration.d. Set Type to All. All configuration parameters of the service are displayed in the

navigation tree. The root nodes in the navigation tree represent the service namesand role names.

e. In the navigation tree, choose a parameter and change its value. You can also enterthe parameter name in Search to search for it and view the result.The parameters under both the service and role nodes are configuration parameters.

l Query role instance configurations.

a. On MRS Manager, click Service.


Issue 01 (2018-09-06) 336

b. Select the target service from the service list.c. Click the Instance tab.d. Click the target role instance in the role instance list.e. Click Instance Configuration.f. Set Type to All. All configuration parameters of the service are displayed in the

navigation tree. The root nodes in the navigation tree represent the service namesand role names.

g. In the navigation tree, choose a parameter and change its value. You can also enterthe parameter name in Search to search for it and view the result.

6.8.3 Managing Services

ScenarioOn MRS Manager, users can perform the following operations:

l Start a service that is in the Stopped, Stopped_Failed, or Start_Failed state.l Stop unused or abnormal services.l Restart abnormal services or configure expired services to restore or enable the services.

Procedure

Step 1 On MRS Manager, click Service.

Step 2 Locate the row that contains the target service, click Start, Stop, or Restart to start, stop, orrestart the service.

Services are interrelated. If a service is started, stopped, or restarted, services dependent on itwill be affected.

The services will be affected in the following ways:

l If a service is to be started, the lower-layer services dependent on it must be started first.l If a service is stopped, the upper-layer services dependent on it are unavailable.l If a service is restarted, the running upper-layer services dependent on it must be

restarted.

----End

6.8.4 Configuring Service Parameters

ScenarioOn MRS Manager, users can view and modify the default service configurations based on siterequirements. Configurations can be imported and exported.

Impact on the Systeml After the attributes of HBase, HDFS, Hive, Spark, Yarn, and MapReduce are configured,

the client configurations need to be downloaded to update the files.l The parameters of DBService cannot be modified if only one DBService role instance

exists in the cluster.


Issue 01 (2018-09-06) 337

Procedurel Modify a service.

a. Click Service.

b. Select the target service from the service list.

c. Click the Service Configuration tab.

d. Set Type to All. All configuration parameters of the service are displayed in thenavigation tree. The root nodes in the navigation tree represent the service namesand role names.

e. In the navigation tree, choose a parameter and change its value. You can also enterthe parameter name in Search to search for it and view the result.

You can click to restore a parameter value.

NOTE

You can also use host groups to change role instance configurations in batches. Choose arole name in Role, and then choose <select hosts> in Host. Enter a name in Host GroupName, select the target host from All Hosts and add it to Selected Hosts. Click OK to add itto the host group. The added host group can be selected from Host and is only valid on thecurrent page. The page cannot be saved after being refreshed.

f. Click Save Configuration, select Restart the affected services or instances, andclick OK to restart the service.

Click Finish when the system displays Operation succeeded. The service issuccessfully started.

NOTE

If you do not restart Yarn after upgrading its queue configuration, you can choose More >Refresh the queue for the configuration to take effect.

l Export service configuration parameters.

a. Click Service.

b. Select a service.

c. Click Service Configuration.

d. Click Export Service Configuration. Select a path for saving the configurationfiles.

l Import service configuration parameters.

a. Click Service.

b. Select a service.

c. Click Service Configuration.

d. Click Import Service Configuration.

e. Select the target configuration file.

f. Click Save Configuration, and select Restart the affected services or instances.Click OK.

When Operation succeeded is displayed, click Finish. The service is startedsuccessfully.


Issue 01 (2018-09-06) 338

6.8.5 Configuring Customized Service Parameters

Scenario

Each component of MRS supports all open source parameters. MRS Manager supports themodification of some parameters for key application scenarios. Some component clients maynot include all parameters with open source features. To modify the component parametersthat are not directly supported by MRS Manager, users can add new parameters forcomponents by using the configuration customization function on MRS Manager. Newlyadded parameters are saved in component configuration files and take effect after thecomponent is restarted.

Impact on the Systeml After the service attributes are configured, the service needs to be restarted. The service

cannot be accessed during the restart.l After the attributes of HBase, HDFS, Hive, Spark, Yarn, and MapReduce are configured,

the client configurations need to be downloaded to update the files.

Prerequisites

You have learned the meanings of parameters to be added, configuration files to take effect,and impact on components.

Procedure


Step 2 Select the target service from the service list.

Step 3 Click Service Configuration.

Step 4 Set Type to All.

Step 5 In the navigation tree, choose Customization. The customized parameters of the currentcomponent are displayed on MRS Manager.

The configuration files that save the newly added customized parameters are displayed inParameter File. Different configuration files may support open source parameters with thesame names. After the parameters in different files are set to different values, theconfiguration effect depends on the sequence of the configuration files that are loaded bycomponents. Service-level and role-level customized parameters are supported. Performconfiguration based on the actual service requirements. Customized parameters for a singlerole instance are not supported.

Step 6 Based on the configuration files and parameter functions, enter parameter names supported bycomponents in Name and enter the parameter values in the Value column of the row wherethe parameters are located.

l You can click or to add or delete a customized parameter. A customized

parameter can be deleted only after you click to add the parameter.

l You can click to restore a parameter value.


Issue 01 (2018-09-06) 339

Step 7 Click Save Configuration, select Restart the affected services or instances, and click OKto restart the service.

When Operation succeeded is displayed, click Finish. The service is started successfully.

----End

Task Example

Configuring Customized Hive Parameters

Hive depends on HDFS. By default, Hive accesses the client of HDFS. The configurationparameters to take effect are controlled by HDFS in a unified manner. For example, HDFSparameter ipc.client.rpc.timeout affects the RPC timeout period for all clients to connect tothe HDFS server. If you need to modify the timeout period for Hive to connect to HDFS, youcan use the configuration customization function. After this parameter is added to the core-site.xml file of Hive, it can be identified by the Hive service and replace the HDFSconfiguration.

Step 1 On MRS Manager, choose Service > Hive > Service Configuration.

Step 2 Set Type to All.

Step 3 In the navigation tree, choose Customization of the Hive service level. The service-levelcustomized parameters supported by Hive are displayed on MRS Manager.

Step 4 In the Name: column of the core.site.customized.configs parameter in core-site.xml, enteripc.client.rpc.timeout, and enter the new parameter value in Value. For example, enter150000. The unit is millisecond.

Step 5 Click Save Configuration, select Restart the affected services or instances, and click OKto restart the service.

When Operation succeeded is displayed, click Finish. The service is successfully started.

----End

6.8.6 Synchronizing Service Configurations

Scenario

If Configuration Status of any service is Expired or Failed, users can synchronizeconfigurations for the cluster or service to recover its configuration status. If the configurationstatus of all services in the cluster is Failed, synchronize the cluster configurations with thebackground configurations.


After synchronizing service configuration, users need to restart the service that had an expiredconfiguration. The service is unavailable during the restart.

Procedure




Issue 01 (2018-09-06) 340

Step 3 Click More in the upper pane, and select Synchronize Configuration from the drop-downlist.

Step 4 In the dialog box that is displayed, select Restart services or instances whoseconfigurations have expired, and click OK.


----End

6.8.7 Managing Role Instances

Scenario

Users can start a role instance that is in the Stopped, Stopped_Failed, or Start_Failed state,stop an unused or abnormal role instance, or restart an abnormal role instance to recover itsfunctions.

Procedure



Step 3 Click the Instance tab.

Step 4 Select the check box on the left of the target role instance.

Step 5 Choose More > Start Instance, Stop Instance, or Restart Instance to perform the requiredoperation.

----End

6.8.8 Configuring Role Instance Parameters

Scenario

View and modify default role instance configurations on MRS Manager. Parameters must beconfigured based on site requirements. Configurations can be imported and exported.


After the attributes of HBase, HDFS, Hive, Spark, Yarn, and MapReduce are configured, theclient configurations need to be downloaded to update the files.

Procedurel Modify role instance configurations.

a. On MRS Manager, click Service.b. Select the target service from the service list.c. Click the Instance tab.d. Click the target role instance in the role instance list.e. Click the Instance Configuration tab.


Issue 01 (2018-09-06) 341

f. Set Type to All. All configuration parameters of the role instances are displayed inthe navigation tree.

g. In the navigation tree, choose a parameter and change its value. You can also enterthe parameter name in Search to search for the parameter and view the result.

You can click to restore a parameter value.h. Click Save Configuration, select Restart the role instance, and click OK to

restart the role instance.When Operation succeeded is displayed, click Finish. The service is startedsuccessfully.

l Export configuration data of a role instance.

a. On MRS Manager, click Service.b. Select a service.c. Select a role instance or click Instance.d. Select a role instance on a specified host.e. Click Instance Configuration.f. Click Export Instance Configuration to export the configuration data of a

specified role instance, and choose a path for saving the configuration file.l Import configuration data of a role instance.

a. Click Service.b. Select a service.c. Select a role instance or click Instance.d. Select a role instance on a specified host.e. Click Instance Configuration.f. Click Import Instance Configuration to import configuration data of a specified

role instance.g. Click Save Configuration and select Restart the role instance. Click OK.

When Operation succeeded is displayed, click Finish. The service is startedsuccessfully.

6.8.9 Synchronizing Role Instance Configuration

Scenario

When the Configuration Status of a role instance is Expired or Failed, users cansynchronize the configuration data of the role instance with the background configuration.


After synchronizing a role instance configuration, you need to restart the role instance thathad an expired configuration. The role instance is unavailable during the restart.

Procedure

Step 1 On MRS Manager, click Service and choose a service name.


Issue 01 (2018-09-06) 342


Step 3 Click the target role instance in the role instance list.

Step 4 Click More in the upper pane, and select Synchronize Configuration from the drop-downlist.

Step 5 In the dialog box that is displayed, select Restart services or instances whoseconfigurations have expired, and click OK to restart a role instance.


----End

6.8.10 Decommissioning and Recommissioning Role Instances

Scenario

If a Core or Task node is faulty, the cluster status may become abnormal. In an MRS cluster,data can be stored on different Core nodes. Users can decommission the specified roleinstance on MRS Manager to stop the role instance from providing services. After faultrectification, users can recommission the role instance.

For MRS versions earlier than MRS 1.6.0, the following role instances can bedecommissioned and recommissioned.l DataNode role instance on HDFSl NodeManager role instance on Yarn

For MRS 1.6.0 or later, the following role instances can be decommissioned andrecommissioned.l DataNode role instance on HDFS

l NodeManager role instance on Yarnl RegionServer role instance on HBasel Broker role instance on Kafka

Restrictions:l If the number of DataNodes is less than or equal to the number of HDFS copies,

decommissioning cannot be performed. For example, if the number of HDFS copies isthree and the number of DataNodes is less than four in the system, decommissioningcannot be performed. After the decommissioning is performed for 30 minutes, an errorwill be reported, which forces MRS Manager to exit the decommissioning.

l If the number of Kafka Broker instances is less than or equal to that of Kafka Brokercopies, decommissioning cannot be performed. For example, if the number of KafkaBroker copies is two and the number of nodes is less than three in the system,decommissioning cannot be performed. Role instance decommissioning will fail onMRS Manager and exit.

l To reuse a decommissioned role instance, users must recommission and restart it.

Procedure



Issue 01 (2018-09-06) 343

Step 2 In the service list, click the target service.


Step 4 Select the check box in front of the specified role instance name.

Step 5 Click More, and select Decommission Role Instance or Recommission from the drop-downlist.

NOTE

If the target service is restarted in another browser or window while the instance decommissioningoperation is in progress, MRS Manager displays a message indicating that the decommissioning issuspended and Operating Status is Started. However, the instance decommissioning is actuallycomplete in the background. You need to decommission the instance again to synchronize the status.

----End

6.8.11 Managing a Host

Scenario

To check an abnormal or faulty host, users need to stop all host roles on MRS Manager. Torecover host services after the host fault is rectified, restart all roles.

Procedure


Step 2 Select the check box of the target host.

Step 3 Choose More > Start All Roles or More > Stop All Roles to perform the required operation.

----End

6.8.12 Isolating a Host

Scenario

If a host is found to be abnormal or faulty, affecting cluster performance or preventingservices from being provided, users can temporarily exclude that host from the availablenodes in the cluster. In this way, the client can access other available nodes. In scenarioswhere patches are to be installed in a cluster, users can also exclude a specified node frompatch installation.

Users can isolate a host manually on MRS Manager based on the actual service requirementsor O&M plan. Only non-management nodes can be isolated.

Impact on the Systeml After a host is isolated, all role instances on the host will be stopped. You cannot start,

stop, or configure the host or any instances on the host.

l After a host is isolated, statistics of the monitoring status and indicator data of the hosthardware and instances cannot be collected or displayed.


Issue 01 (2018-09-06) 344

Procedure


Step 2 Select the check box of the host to be isolated.

Step 3 Choose More > Isolate Host.

Step 4 In Isolate Host, click OK.

When Operation succeeded is displayed, click Finish. The host is isolated successfully, andthe value of Operating Status becomes Isolated.

NOTE

The isolation of a host can be canceled and the host can be added to the cluster again. For details, seeCanceling Isolation of a Host.

----End

6.8.13 Canceling Isolation of a Host

Scenario

After a host fault is rectified, users must cancel the isolation of the host so that the host can beused properly.

Users can cancel the isolation of a host on MRS Manager.

Prerequisitesl The host status is Isolated.l The host fault has been rectified.

Procedure


Step 2 Select the checkbox of the host that you want to cancel its isolation.

Step 3 Choose More > Cancel Host Isolation.

Step 4 In Cancel Host Isolation, click OK.

When Operation succeeded is displayed, click Finish. Host isolation is canceledsuccessfully, and the value of Operational Status becomes Normal.

Step 5 Click the name of the host for which isolation has been canceled. Status of the host isdisplayed. Click Start All Roles.

----End

6.8.14 Starting and Stopping a Cluster

Scenarios

A cluster is a collection of service components. Users can start or stop all services in a cluster.


Issue 01 (2018-09-06) 345

Procedure


Step 2 Click More above the service list and choose Start Cluster or Stop Cluster from the drop-down list.

----End

6.8.15 Synchronizing Cluster Configurations

Scenarios

If Configuration Status of any service is Expired or Failed, users can synchronizeconfigurations to recover the configuration status.

l If the configuration status of all services in the cluster is Failed, synchronize the clusterconfigurations with the background configurations.

l If the configuration status of some services in the cluster is Failed, synchronize thespecified service configurations with the background configurations.


After synchronizing cluster configurations, users need to restart the service that has an expiredconfiguration. The service is unavailable during the restart.

Procedure


Step 2 Click More above the service list, and choose Synchronize Configuration from the drop-down list.

Step 3 In the dialog box that is displayed, select Restart services or instances whoseconfigurations have expired, and click OK.

When Operation succeeded is displayed, click Finish. The cluster is successfully started.

----End

6.8.16 Exporting Configuration Data of a Cluster

Scenarios

Users can export all configuration data of a cluster from MRS Manager to meet actual servicerequirements. The exported file can be used to rapidly update service configurations.

Procedure


Step 2 Click More above the service list, and choose Export Cluster Configuration from the drop-down list.


Issue 01 (2018-09-06) 346

The exported file is used to update service configurations. For details, see Import serviceconfiguration parameters in Configuring Service Parameters.

----End

6.9 Log Management

6.9.1 Viewing and Exporting Audit Logs

Scenario

On MRS Manager, view and export audit logs for post-event tracing, fault cause locating, andresponsibility classification of security events.

The system records the following log information:

l User activity information, such as user login and logout, and modifications to systemuser and system user group information

l Information about user operation instructions, such as cluster startup and shutdown, andsoftware upgrades.

Procedurel View the audit logs.

a. On MRS Manager, click Audit to view the default audit logs.If the content of the audit log contains more than 256 characters, click the unfoldbutton to unfold audit details and then click log file to download the complete logfile.n By default, audit logs are displayed in descending order by Occurred On. You

can click Operation Type, Severity, Occurred On, User, Host, Service,Instance, or Operation Result to change the display mode.

n You can filter out all audit logs of the same severity in Severity, includingboth cleared and uncleared alarms.

Export the audit logs, which contain the following information:n Sno: indicates the number of audit logs generated by MRS Manager. The

number is incremented by 1 when a new audit log is generated.n Operation Type: indicates the type of user operations. User operations are

classified into the following scenarios: User_Manager, Cluster, Service,Host, Alarm, Collect Log, Auditlog, Backup And Restoration, Tenant.User_Manager is supported only by clusters with Kerberos authenticationenabled. Each scenario contains different operation types. For example, Alarmcontains Export alarms, Cluster contains Start Cluster, and Tenant containsAdd Tenant.

n Severity: indicates the security level of each audit log, including Critical,Major, Minor, and Information.

n Start Time: indicates the CET or CEST time when a user operation starts.n End Time: indicates the CET or CEST time when a user operation ends.n User IP Address: indicates the IP address used by a user.


Issue 01 (2018-09-06) 347

n User: indicates the name of a user who performs the operations.

n Host: indicates the node where a user operation is performed. The informationis not saved if the operation does not involve a node.

n Service: indicates the service on which a user operation is performed. Theinformation is not saved if the operation does not involve a service.

n Instance: indicates the role instance on which a user operation is performed.The information is not saved if the operation does not involve a role instance.

n Operation Result: indicates the user operation result, including Successful,Failed, and Unknown.

n Content: indicates execution information of the user operation.

b. Click Advanced Search. In the audit log search area, set search criteria and clickSearch to view the desired audit logs. Click Reset to reset search criteria.

NOTE

You can set Start Time and End Time to specify the time range when logs are generated.

l Export the audit logs.

In the audit log list, select the check box of a log and click Export, or click Export All.

6.9.2 Exporting Services Logs

Scenario

Export the logs of each service role from MRS Manager.

Prerequisitesl You have obtained the Access Key ID (AK) and Secret Access Key (SK) for the

corresponding account. For details, see My Credential > User Guide > How Do IManage Access Keys?.

l You have created a bucket in the Object Storage Service (OBS) system for thecorresponding account. For details, see the Object Storage Service > User Guide >Quick Start > Common Operations Using OBS Console > Creating a Bucket.

Procedure


Step 2 Click Export Log under Maintenance.

Step 3 Click Service, set Host to the IP address of the host where the service is deployed, and setStart Time and End Time.

Step 4 In Export to, specify a path for saving logs. This parameter is available only for clusters withKerberos authentication enabled.

l Local PC: stores logs in a local directory. If you select this option, go to Step 7.

l OBS: stores data in the OBS system and is used by default. If you select this option, goto Step 5.

Step 5 In OBS Path, specify the path where service logs are stored in the OBS system.


Issue 01 (2018-09-06) 348

Fill in the full path. The path must not start with /. You do not need to create the path inadvance because the system creates it automatically. The full OBS path contains a maximumof 900 bytes.

Step 6 In Bucket Name, enter the name of the created OBS bucket. In AK and SK, enter the AccessKey ID and Secret Access Key for the account.

Step 7 Click OK to export logs.

----End

6.9.3 Configuring Audit Log Export Parameters

ScenarioIf MRS audit logs are stored in the system for a long time, the disk space for data directoriesmay become insufficient. You can set export parameters to automatically export audit logs toa specified directory on the OBS.

NOTE

The audit logs to be exported include both the service and management audit logs.

l Service audit logs are compressed automatically at 03:00 a.m. every day and save in /var/log/Bigdata/audit/bk/ on the active management node. The file name format is <yyyy-MM-dd_HH-mm-ss>.tar.gz. A maximum of seven log files can be saved by default. After the number of files hasreached the maximum value, the earliest file will be deleted when a new one is generated.

l When you export management audit logs, the logs that are generated between the time of the lastsuccessful export and the current time are exported. When the number of records in a managementaudit log reaches 100,000, the system dumps the first 90,000 records to a local file and retains therest 10,000 records in the database. The log file is dumped to ${BIGDATA_DATA_HOME}/dbdata_om/dumpData/iam/operatelog on the active management node. The file name format isOperateLog_store_YY_MM_DD_HH_MM_SS.csv. A maximum of 50 management audit logscan be saved.

Prerequisitesl You have obtained the AK and SK of the username. For details, see My Credential >

User Guide > How Do I Manage Access Keys?.l You have created a bucket in the OBS. For details, see Object Storage Service > User

Guide > Quick Start > Common Operations Using OBS Console > Creating aBucket.

Procedure


Step 2 Click Export Audit Log in Maintenance.


Issue 01 (2018-09-06) 349

Table 6-18 Parameters for exporting audit logs


Export Audit Logl

l

MandatoryIndicates whether toenable the audit logexport function.

l : Enabled

l : Disabled

Start Time 7/24/2017 09:00:00 (examplevalue)

MandatoryIndicates the start timefor audit log export.

Period 1 day (example value) MandatoryIndicates the interval forexporting audit logs. Thevalue ranges from 1 dayto 5 days.

Bucket mrs-bucket (example value) MandatoryIndicates the name of theOBS bucket to whichaudit logs are exported.

OBS Path opt/omm/oms/auditLog(example value)

MandatoryIndicates the OBS pathfor exporting audit logs.

AK XXX (example value) MandatoryIndicates the user'sAccess Key ID (AK)information.

SK XXX (example value) MandatoryIndicates the user's SecretAccess Key (SK)information.

NOTE

The OBS path is divided into service_auditlog and manager_auditlog, which are used to save serviceand management audit logs, respectively.

----End

6.10 Health Check Management


Issue 01 (2018-09-06) 350

6.10.1 Performing a Health Check

Scenario

To ensure that cluster parameters, configurations, and monitoring are correct and that thecluster can run stably for a long time, you can perform a health check during routinemaintenance.

NOTE

A system health check includes MRS Manager, service-level, and host-level health checks:

l MRS Manager health checks focus on whether the unified management platform can providemanagement functions.

l Service-level health checks focus on whether components can provide services properly.

l Host-level health checks focus on whether host indicators are normal.

The system health check has three types of check items: Health Status, related alarms, and customizedmonitoring indicators for each check object. The health check results are not always the same as theHealth Status on the portal.

Procedurel Manually perform the health check for all services.

a. On MRS Manager, click Service.b. Choose More > Start Cluster Health Check to start the health check for all

services.

NOTE

l The cluster health checks include MRS Manager, service, and host status checks.

l To perform cluster health checks, you can also choose System > Check Health Status > StartCluster Health Check on MRS Manager.

l To export the health check result, click Export Report in the upper left corner.

l Manually perform the health check for a service.

a. On MRS Manager, click Service, and click the target service in the service list.b. Choose More > Start Service Health Check to start the health check for a

specified service.l Manually perform the health check for a host.

a. On MRS Manager, click Host.b. Select the check box of the target host.c. Choose More > Start Host Health Check to start the health check for the host.

l Perform an automatic health check.

a. On MRS Manager, click System.b. Under Maintenance, click Check Health Status.c. Click Configure Health Check to configure automatic health check items.

Periodic Health Check indicates whether to enable the automatic health checkfunction. The switch of the Periodic Health Check is disabled by default. Click theswitch to enable the function. Select Daily, Weekly, or Monthly as required.Click OK to save the configuration. The Health check configuration savedsuccessfully is displayed in the upper-right corner.


Issue 01 (2018-09-06) 351

6.10.2 Viewing and Exporting a Check Report

Scenario

You can view the health check result on MRS Manager and export it for further analysis.

NOTE

A system health check includes MRS Manager, service-level, and host-level health checks:

l MRS Manager health checks focus on whether the unified management platform can providemanagement functions.

l Service-level health checks focus on whether components can provide services properly.

l Host-level health checks focus on whether host indicators are normal.

The system health check has three types of check items: Health Status, related alarms, and customizedmonitoring indicators for each check object. The health check results are not always the same as theHealth Status on the portal.

Prerequisites

You have performed a health check.

Procedure


Step 2 Choose More > View Cluster Health Check Report to view the health check report of thecluster.

Step 3 Click Export Report on the health check report pane to export the report and view detailedinformation about check items.

----End

6.10.3 Configuring the Number of Health Check Reports to BeReserved

Scenario

Health check reports of MRS clusters, services, and hosts may vary with the time andscenario. You can modify the number of health check reports to be reserved on MRS Managerfor later comparison.

This setting is valid for health check reports of clusters, services, and hosts. Report files aresaved in $BIGDATA_DATA_HOME/Manager/healthcheck on the active managementnode by default and are automatically synchronized to the standby management node.

Prerequisitesl You have specified service requirements and planned the save time and health check

frequency.

l The disk spaces of the active and standby management nodes are sufficient.


Issue 01 (2018-09-06) 352

Procedure

Step 1 On MRS Manager, choose System > Check Health Status > Configure Health Check.

Step 2 Set Max. Number of Health Check Reports to the number of health check reports to bereserved. The value ranges from 1 to 100 and the default is 50.

Step 3 Click OK to save the configuration. The Health check configuration saved successfully isdisplayed in the upper-right corner.

----End

6.10.4 Managing Health Check Reports

Scenario

On MRS Manager, you can manage historical health check reports, including viewing,downloading, and deleting them.

Procedurel Download a specified health check report.

a. Choose System > Check Health Status.

b. Locate the row that contains the target health check report and click Download Fileto download the report file.

l Download specified health check reports in batches.


b. Select multiple health check reports and click Download File to download them.

l Delete a specified health check report.


b. Locate the row that contains the target health check report and click Delete to deletethe report file.

l Delete specified health check reports in batches.


b. Select multiple health check reports and click Delete File to delete them.

6.10.5 DBService Health Check

Service Health Status

Indicator name: Service Status

Indicator description: This indicator is used to check whether the service status ofDBService is in the normal state. If it is abnormal, the indicator is unhealthy.

Recovery guidance: If this indicator is abnormal, you are advised to rectify the faultaccording to ALM-27001.


Issue 01 (2018-09-06) 353

Alarm Check

Indicator name: Alarm information

Indicator description: This indicator is used to check whether an uncleared alarm exists onthe host. If an uncleared alarm exists, the indicator is unhealthy.

Recovery guidance: If this indicator is abnormal, you are advised to rectify the faultaccording to the alarm help.

6.10.6 Flume Health Check



Indicator description: Check whether the Flume service is normal. If the service isabnormal, the system is unhealthy.

Recovery guidance: If the indicator is abnormal, the system generates an alarm. You areadvised to handle the alarm according to ALM-24000.

Alarm Check


Indicator description: Check whether an uncleared alarm exists on the host. If an unclearedalarm exists on the host, the system is unhealthy.

Recovery guidance: If the indicator is abnormal, you are advised to rectify the faultaccording to the alarm help.

6.10.7 HBase Health Check

Number of RegionServers That Are Running Properly

Indicator name: Normal RegionServers

Indicator description: This indicator is used to check the number of RegionServers that arerunning properly in the HBase cluster.

Recovery guidance: If this indicator is abnormal, check whether the status of RegionServeris normal. If the status is abnormal, rectify the fault. Then you are advised to check whetherthe network connection is normal.



Indicator description: This indicator is used to check whether the service status of HBase isin the normal state. If it is abnormal, the indicator is unhealthy.

Recovery guidance: If this indicator is abnormal, check whether the status of HMaster andRegionServer is normal. If the status is abnormal, rectify the fault. Then check whether thestatus of the ZooKeeper service is Bad. If the status is Bad, rectify the fault. On the HBaseclient, check whether the data in HBase tables can be correctly read. If the data cannot be


Issue 01 (2018-09-06) 354

correctly read, locate the cause of the data read failure. At last, rectify the fault according tothe alarm help.

Alarm Check


Indicator description: This indicator is used to check whether an uncleared alarm exists inthe service. If an uncleared alarm exists, the indicator is unhealthy.


6.10.8 Host Health Check

Swap Usage

Indicator name: Swap Usage

Indicator description: This indicator is used to check the system swap usage. Swap usage =Used swap size/Total swap size. If swap usage exceeds the threshold, the indicator isunhealthy.

Recovery guidance:

Step 1 Check the swap usage on the node.

Log in to the unhealthy node and run free -m to view the total and used swap size. If the swapusage exceeds the threshold, go to Step 2.

Step 2 Expand the system capacity, for example, by adding nodes.

----End

File Handle Usage

Indicator name: File Handle Usage

Indicator description: This indicator is used to check the usage of file handles in the system.Usage of file handles = Number of used handles/Total number of handles. If file handle usageexceeds the threshold, the indicator is unhealthy.

Recovery guidance:

Step 1 Check the file handle usage.

Log in to the unhealthy node and run cat /proc/sys/fs/file-nr. Check the first and thirdcolumns in the command output, which indicate the number of used handles and the totalnumber of handles respectively. If the usage exceeds the threshold, go to Step 2.

Step 2 Check the system and analyze the file handle usage.

----End

NTP Offset

Indicator name: NTP Offset


Issue 01 (2018-09-06) 355

Indicator description: This indicator is used to check the NTP time offset. If the NTP timeoffset exceeds the threshold, the indicator is unhealthy.

Recovery guidance:

Step 1 Check the NTP time offset.

Log in to the unhealthy node and run /usr/sbin/ntpq -np to view the information. The offsetcolumn indicates the time offset. If the time offset exceeds the threshold, go to Step 2.

Step 2 Check whether the clock source configuration is correct. Contact public cloud maintenancepersonnel to handle the problem.

----End

Average LoadIndicator name: Average Load

Indicator description: This indicator is used to check the system average load. The systemaverage load indicates the average number of processes in the running queue within aspecified period. The system average load is calculated using the load value obtained by theuptime command. The calculation method is: (Load of 1 minute + Load of 5 minutes + Loadof 15 minutes)/(3 x Number of CPUs). If the average load exceeds the threshold, the indicatoris unhealthy

Recovery guidance:

Step 1 Log in to the unhealthy node and run the uptime command. The last three columns in thecommand output indicate the load of 1 minute, 5 minutes, and 15 minutes, respectively.Calculate the system average load. If the load exceeds the threshold, go to Step 2.

Step 2 Expand the system capacity, for example, by adding nodes.

----End

Process in the D StatusIndicator name: Uninterruptible Sleep Process

Indicator description: This indicator is used to check an uninterruptible sleep process, thatis, a process in the D state. Generally, a process in the D state is waiting for I/O, such as diskI/O and network I/O, but an I/O exception occurs. If any process in the D state exists in thesystem, the indicator is unhealthy.


Hardware StatusIndicator name: Hardware Status

Indicator description: This indicator is used to check the status of system hardware,including CPUs, memory, disks, power supply units (PSUs), and fans. This indicator obtainsrelated hardware information using ipmitool sdr elist. If the hardware status is abnormal, theindicator is unhealthy.

Recovery guidance:


Issue 01 (2018-09-06) 356

Step 1 Log in to the unhealthy node. Run ipmitool sdr elist to view the system hardware status. Thelast column in the command output indicates the hardware status. If the status is included inthe following fault description table, the indicator is unhealthy.

Module Fault Description

Processor IERRThermal TripFRB1/BIST failureFRB2/Hang in POST failureFRB3/Processor startup/init failureConfiguration ErrorSM BIOS Uncorrectable CPU-complex ErrorDisabledThrottledUncorrectable machine check exception

Power Supply Failure detectedPredictive failurePower Supply AC lostAC lost or out-of-rangeAC out-of-range, but presentConfig Error: Vendor MismatchConfig Error: Revision MismatchConfig Error: Processor MissingConfig Error: Power Supply Rating MismatchConfig Error: Voltage Rating MismatchConfig Error

Power Unit 240VA power downInterlock power downAC lostSoft-power control failureFailure detectedPredictive failure

Memory Uncorrectable ECCParityMemory Scrub FailedMemory Device DisabledCorrectable ECC logging limit reachedConfiguration ErrorThrottledCritical Overtemperature


Issue 01 (2018-09-06) 357

Module Fault Description

Drive Slot Drive FaultPredictive FailureParity Check In ProgressIn Critical ArrayIn Failed ArrayRebuild In ProgressRebuild Aborted

Battery LowFailed

Step 2 If this indicator is abnormal, contact public cloud maintenance personnel to handle theproblem.

----End

Host NameIndicator name: Hostname

Indicator description: This indicator is used to check whether a host name is set. If no hostname is set, the indicator is unhealthy. If this indicator is abnormal, you are advised to set ahost name properly.

Recovery guidance:

Step 1 Log in to the unhealthy node.

Step 2 Run the following command to change the host name to ensure that the node host name isconsistent with the planned host name:

hostname Host name For example, to change the host name to Bigdata-OM-01, run thehostname Bigdata-OM-01 command.

Step 3 Modify the host name configuration file.

Run the vi /etc/HOSTNAME command to edit the file, change file content to Bigdata-OM-01, save the modification, and exit.

----End

UmaskIndicator name: Umask

Indicator description: This indicator is used to check whether the umask of user omm iscorrectly set. If the umask is not set to 0077, the indicator is unhealthy.

Recovery guidance:

Step 1 If this indicator is abnormal, you are advised to set the umask of user omm to 0077. Log in tothe unhealthy node, and run su - omm to switch to user omm.


Issue 01 (2018-09-06) 358

Step 2 Run vi ${BIGDATA_HOME}/.om_profile, set umask to 0077, save the modification, andexit.

----End

OMS HA Status

Indicator name: OMS HA Status

Indicator description: This indicator is used to check whether the status of OMS HAresources is normal. For details about the status of OMS HA resources, run ${CONTROLLER_HOME}/sbin/status-oms.sh to view the status. If any module isabnormal, the indicator is unhealthy.

Recovery guidance:

Step 1 Log in to the active management node, run su - omm to switch to user omm, and run ${CONTROLLER_HOME}/sbin/status-oms.sh to view the OMS status.

Step 2 If floatip, okerberos, and oldap are abnormal, see ALM-12002, ALM-12004, and ALM-12005respectively to resolve the problems.

Step 3 If other resources are abnormal, you are advised to view the logs of the faulty modules.

If the controller resource is abnormal, view the /var/log/Bigdata/controller/controller.loglog file of the faulty node.

If the cep resource is abnormal, view the /var/log/Bigdata/omm/oms/cep/cep.log log file ofthe faulty node.

If the aos resource is abnormal, view the /var/log/Bigdata/controller/aos/aos.log log file ofthe faulty node.

If the feed_watchdog resource is abnormal, view the /var/log/Bigdata/watchdog/watchdog.log log file of the faulty node.

If the httpd resource is abnormal, view the /var/log/Bigdata/httpd/error_log log file of thefaulty node.

If the fms resource is abnormal, view the /var/log/Bigdata/omm/oms/fms/fms.log log file ofthe faulty node.

If the pms resource is abnormal, view the /var/log/Bigdata/omm/oms/pms/pms.log log fileof the faulty node.

If the iam resource is abnormal, view the /var/log/Bigdata/omm/oms/iam/iam.log log file ofthe faulty node.

If the gaussDB resource is abnormal, view the /var/log/Bigdata/omm/oms/db/omm_gaussdba.log log file of the faulty node.

If the ntp resource is abnormal, view the /var/log/Bigdata/omm/oms/ha/scriptlog/ha_ntp.log log file of the faulty node.

If the tomcat resource is abnormal, view the /var/log/Bigdata/tomcat/catalina.log log file ofthe faulty node.


Issue 01 (2018-09-06) 359

Step 4 If the problem cannot be resolved by viewing logs, contact public cloud maintenancepersonnel and send the collected fault logs.

----End

Installation Directory and Data Directory Check

Indicator name: Installation Directory and Data Directory

Indicator description: This indicator checks the lost+found directory in the disk partitionroot directory where the installation directory (/opt/Bigdata by default) is located first. If filesof user omm exist in the directory, an exception occurs. Related files will be stored in the lost+found directory when an exception occurs on a node. This indicator is used to checkwhether the files are lost in such scenarios. Then this indicator checks the installationdirectory (such as /opt/Bigdata) and data directory (such as /srv/BigData). If files of non-omm users exist in the two directories, the indicator is unhealthy.

Recovery guidance:

Step 1 Log in to the unhealthy node, and run su - omm to switch to user omm. Check whether filesor folders of user omm exist in the lost+found directory.

If files of user omm exist, restore the files to a correct directory and perform the check again.If files of user omm do not exist, go to Step 2.

Step 2 Check whether files or folders of non-omm users exist in the installation directory and datadirectory. If files exist and are temporary files generated manually, clear them and perform thecheck again.

----End

CPU Usage

Indicator name: CPU Usage

Indicator description: This indicator is used to check whether the CPU usage exceeds thethreshold. If the CPU usage exceeds the threshold, the indicator is unhealthy.


Memory Usage

Indicator name: Memory Usage

Indicator description: This indicator is used to check whether the memory usage exceeds thethreshold. If the memory usage exceeds the threshold, the indicator is unhealthy.


Host Disk Usage

Indicator name: Host Disk Usage

Indicator description: This indicator is used to check whether the host disk usage exceedsthe threshold. If the host disk usage exceeds the threshold, the indicator is unhealthy.


Issue 01 (2018-09-06) 360


Host Disk Write Rate

Indicator name: Host Disk Write Speed

Indicator name: This indicator is used to check the host disk write rate. The host disk writerate may vary according to the service scenario. This indicator only reflects the specific value.You need to determine whether this indicator is normal based on the service scenario.

Recovery guidance: Determine whether the disk write rate is normal based on the servicescenario.

Host Disk Read Rate

Indicator name: Host Disk Read Speed

Indicator name: This indicator is used to check the host disk read rate. The host disk readrate may vary according to the service scenario. This indicator only reflects the specific value.You need to determine whether this indicator is normal based on the service scenario.

Recovery guidance: Determine whether the disk read rate is normal based on the servicescenario.

Host Service Plane Network Status

Indicator name: Host service plane network status

Indicator description: This indicator is used to check the network connectivity of the clusterhost service plane. If the host service plane network is disconnected, the indicator isunhealthy.

Recovery guidance: If the network is a single-plane network, check the IP address of thesingle plane. If the network is a dual-plane network, the recovery procedures are as follows:

Step 1 Check the network connectivity between the service plane IP addresses of the active andstandby management nodes.

If the network is abnormal, go to Step 3.

If the network is in normal state, go to Step 2.

Step 2 Check the network connectivity between the IP addresses of the active management node andthe faulty node in the cluster.

Step 3 If the network is abnormal, contact public cloud maintenance personnel to resolve thenetwork problem.

----End

Host Status

Indicator name: Host Status

Indicator description: This indicator is used to check whether the host status is normal. If anode is faulty, the indicator is unhealthy.


Issue 01 (2018-09-06) 361

Recovery guidance: If this indicator is abnormal, you are advised to rectify the faultaccording to ALM-12006.

Alarm Check


Indicator description: This indicator is used to check whether an uncleared alarm exists onthe host. If an uncleared alarm exists, the host is unhealthy.


6.10.9 HDFS Health Check

Average Packet Sending Time

Indicator name: Average transfer time of sending packets Statistics

Indicator description: This indicator specifies the average time for DataNode in HDFS tosend packets. If the average packet sending time exceeds 2,000,000 nanoseconds, theindicator is unhealthy.

Recovery guidance: If this indicator is abnormal, check whether the cluster network speed isnormal and whether the memory or CPU usage is too high. You also need to check whetherthe HDFS load in the cluster is too high.



Indicator description: This indicator is used to check whether the service status of HDFS isnormal. If a node is faulty, the service is unhealthy.

Recovery guidance: If this indicator is abnormal, you are advised to check whether the statusof the KrbServer, LdapServer, and ZooKeeper services is Bad. If the service status is Bad,rectify the fault. Then check whether a file write failure occurs because HFDS SafeMode isON, use the HDFS client to check whether data cannot be written into HDFS, and find thecauses of the HDFS data write failure. At last, rectify the fault according to the alarm help.

Alarm Check



Recovery guidance: If this indicator is abnormal, rectify the fault according to the alarmhelp.

6.10.10 Hive Health Check

Maximum Number of Sessions Allowed by HiveServer

Indicator name: Maximum Number of Sessions Allowed by the HiveServer


Issue 01 (2018-09-06) 362

Indicator description: This indicator is used to check the maximum number of sessionsallowed by Hive.


Number of Sessions Connected to HiveServer

Indicator name: Number of Sessions Connected to the HiveServer

Indicator description: This indicator is used to check the number of Hive connections.




Indicator description: This indicator is used to check whether the service status of Hive is innormal state. If the service status is abnormal, the indicator is unhealthy.


Alarm Check


Indicator description: This indicator is used to check whether an uncleared alarm exists onthe host. If an uncleared alarm exists, the indicator is unhealthy.


6.10.11 Kafka Health Check

Number of Available Broker Nodes

Indicator name: Number of Brokers

Indicator description: Check the number of available Broker nodes in the cluster. If thenumber is less than two, the check result is unhealthy.

Recovery guidance: If the indicator is abnormal, go to the Kafka service instance page, clickthe host name of the unavailable Broker, and check the host health status in the summary. Ifthe status is good, see Process Fault to handle the alarm. If the status is not good, see NodeFault to handle the alarm.



Indicator description: Check the status of the Kafka service. If the status is abnormal, thecheck result is unhealthy.


Issue 01 (2018-09-06) 363

Recovery guidance: If the indicator is abnormal, see Kafka Service Unavailable to handle thealarm.

Alarm Check


Indicator description: Check whether an uncleared alarm exists in the service. If anuncleared alarm exists on the host, the system is unhealthy.


6.10.12 KrbServer Health Check

KerberosAdmin Service Availability Check

Indicator name: KerberosAdmin Service Availability

Indicator description: This indicator is used to check the status of the KerberosAdminservice. If the service status is abnormal, the KerberosAdmin service is unavailable.

Recovery guidance: If this indicator is abnormal, the possible cause is that the node wherethe KerberosAdmin service is located is faulty or the SlapdServer service is unavailable.During the KerberosAdmin service recovery, try the following operations:

Step 1 Check whether the node where the KerberosAdmin service is located is faulty.

Step 2 Check whether the SlapdServer service is unavailable.

----End

KerberosServer Service Availability Check

Indicator name: KerberosServer Service Availability

Indicator description: This indicator is used to check the status of the KerberosServerservice. If the service status is abnormal, the KerberosServer service is unavailable.

Recovery guidance: If this indicator is abnormal, the possible cause is that the node wherethe KerberosServer service is located is faulty or the SlapdServer service is unavailable.During the KerberosServer service recovery, try the following operations:

Step 1 Check whether the node where the KerberosServer service is located is faulty.

Step 2 Check whether the SlapdServer service is unavailable.

----End



Indicator description: This indicator is used to check the status of the KrbServer service. Ifthe service status is abnormal, the KrbServer service is unavailable.


Issue 01 (2018-09-06) 364

Recovery guidance: If this indicator is abnormal, the possible cause is that the node wherethe KrbServer service is located is faulty or the LdapServer service is unavailable. For details,see ALM-25500.

Alarm Check


Indicator description: This indicator is used to check the alarm information of the KrbServerservice. If any alarm exists, the KrbServer service may be abnormal.

Recovery guidance: If this indicator is abnormal, see the related alarm document to handlethe alarm.

6.10.13 LdapServer Health Check

SlapdServer Service Availability Check

Indicator name: SlapdServer Service Availability

Indicator description: This indicator is used to check the status of the SlapdServer service. Ifthe service status is abnormal, the SlapdServer service is unavailable.

Recovery guidance: If this indicator is abnormal, the possible cause is that the node wherethe SlapdServer service is located is faulty or the SlapdServer process is faulty. During theSlapdServer service recovery, try the following operations:

Step 1 Check whether the node where the SlapdServer service is located is faulty. For details, seeALM-12006.

Step 2 Check whether the SlapdServer process is running properly. For details, see ALM-12007.

----End



Indicator description: This indicator is used to check the status of the LdapServer service. Ifthe service status is abnormal, the LdapServer service is unavailable.

Recovery guidance: If this indicator is abnormal, the possible cause is that the node wherethe active LdapServer service is located is faulty or the active LdapServer process is faulty.For details, see ALM-25000.

Alarm Check


Indicator description: This indicator is used to check the alarm information of theLdapServer service. If any alarm exists, the LdapServer service may be abnormal.

Recovery guidance: If this indicator is abnormal, see the related alarm document to handlethe alarm.


Issue 01 (2018-09-06) 365

6.10.14 Loader Health Check

ZooKeeper Health Status

Indicator name: ZooKeeper Health Status

Indicator description: Check whether the ZooKeeper service is normal. If the ZooKeeperservice is unhealthy, the system is unhealthy.


HDFS Health Status

Indicator name: HDFS Health Status

Indicator description: Check whether the HDFS service is normal. If the HDFS service isunhealthy, the system is unhealthy.


DBService Health Status

Indicator name: Kerberos Health Status

Indicator description: Check whether the DBService service is normal. If the DBServiceservice is unhealthy, the system is unhealthy.


Yarn Health Status

Indicator name: Yarn Health Status

Indicator description: Check whether the Yarn service is normal. If the Yarn service isunhealthy, the system is unhealthy.


MapReduce Health Status

Indicator name: MapReduce Health Status

Indicator description: Check whether the MapReduce service is normal. If the MapReduceservice is unhealthy, the system is unhealthy.


Loader Process Health Status

Indicator name: Loader Process Health Status


Issue 01 (2018-09-06) 366

Indicator description: Check whether the Loader process status is normal. If the processstatus is abnormal, the system is unhealthy.


Service Health StatusIndicator name: Service Status

Indicator description: Check whether the Loader is normal. If the service is abnormal, thesystem is unhealthy.


Alarm CheckIndicator name: Alarm information

Indicator description: Check whether an uncleared alarm exists on the host. If an unclearedalarm exists on the host, the system is unhealthy.


6.10.15 MapReduce Health Check


Indicator description: This indicator is used to check whether the service status ofMapReduce is in normal state. If the service status is abnormal, the indicator is unhealthy.





6.10.16 OMS Health Check

OMS Status CheckIndicator name: OMS Status

Indicator description: The OMS status check includes the HA status check and resourcestatus check. The values of the HA status include active, standby, and NULL, indicating the


Issue 01 (2018-09-06) 367

active node, standby node, and unknown, respectively. The values of the resource statusinclude normal, abnormal, and NULL. If the HA status is NULL, the OMS is unhealthy. Ifthe resource status is NULL or abnormal, the OMS is unhealthy.

Table 6-19 OMS status description

Name Description

HA status active indicates that the node is an active node.standby indicates that the node is a standby node.NULL indicates that the status is unknown.

Resource status normal indicates that all resources are in normal state.abnormal indicates that all resources are abnormal.NULL indicates that the status is unknown.

Recovery guidance:

Step 1 Log in to the active management node, and run su - omm to switch to user omm. Run ${CONTROLLER_HOME}/sbin/status-oms.sh to view the status of OMS.

Step 2 If the HA status is NULL, the system may be being restarted. NULL is an intermediatestatus, and the HA status will automatically change to a normal value.

Step 3 If the resource status is abnormal, certain MRS Manager component resources are abnormal.Check whether the status of components such as acs, aos, cep, controller, feed_watchdog,fms, guassDB, httpd, iam, ntp, okerberos, oldap, pms, and tomcat is normal.

Step 4 If any MRS Manager component resource is abnormal, see the information about the MRSManager component status check to rectify the fault.

----End

Manager Component Status CheckIndicator name: Manager Component Status

Indicator description: The Manager component status check includes the componentresource running status check and resource HA status check. The values of the resourcerunning status include Normal, Abnormal, and others, and the values of the resource HAstatus include Normal, Exception, and others. The Manager components include acs, aos,cep, controller, feed_watchdog, floatip, fms, gaussDB, heartBeatCheck, httpd, iam, ntp,okerberos, oldap, pms, and tomcat. When the running status and HA status are not Normal,the MRS Manager components are unhealthy.


Issue 01 (2018-09-06) 368

Table 6-20 Manager component status description

Name Description

Resource running status Normal indicates that the resource is running properly.Abnormal indicates that the resource is runningabnormally.Stopped indicates that the resource is stopped.Unknown indicates that the resource status is unknown.Starting indicates that the resource is being started.Stopping indicates that the resource is being stopped.Active_normal indicates that the resource is runningproperly as the active module.Standby_normal indicates that the resource is runningproperly as the standby module.Raising_active indicates that the resource is raising tothe active module.Lowing_standby indicates that the resource isdescending to the standby module.No_action indicates that no action is performed.Repairing indicates that the resource is being repaired.NULL indicates that the resource status is unknown.

Resource HA status Normal indicates that the resource HA is normal.Exception indicates that the resource HA is faulty.Non_steady indicates that the resource HA is notsteady.Unknown indicates that the resource HA is unknown.NULL indicates that the resource HA is null.

Recovery guidance:

Step 1 Log in to the active management node, and run su - omm to switch to user omm. Run ${CONTROLLER_HOME}/sbin/status-oms.sh to view the status of OMS.

Step 2 If floatip, okerberos, and oldap are abnormal, see ALM-12002, ALM-12004, and ALM-12005respectively to resolve the problems.

Step 3 If other resources are abnormal, you are advised to view the logs of the faulty modules.

If the controller resource is abnormal, view the /var/log/Bigdata/controller/controller.loglog file of the faulty node.

If the cep resource is abnormal, view the /var/log/Bigdata/omm/oms/cep/cep.log log file ofthe faulty node.

If the aos resource is abnormal, view the /var/log/Bigdata/controller/aos/aos.log log file ofthe faulty node.


Issue 01 (2018-09-06) 369

If the feed_watchdog resource is abnormal, view the /var/log/Bigdata/watchdog/watchdog.log log file of the faulty node.

If the httpd resource is abnormal, view the /var/log/Bigdata/httpd/error_log log file of thefaulty node.

If the fms resource is abnormal, view the /var/log/Bigdata/omm/oms/fms/fms.log log file ofthe faulty node.

If the pms resource is abnormal, view the /var/log/Bigdata/omm/oms/pms/pms.log log fileof the faulty node.

If the iam resource is abnormal, view the /var/log/Bigdata/omm/oms/iam/iam.log log file ofthe faulty node.

If the gaussDB resource is abnormal, view the /var/log/Bigdata/omm/oms/db/omm_gaussdba.log log file of the faulty node.

If the ntp resource is abnormal, view the /var/log/Bigdata/omm/oms/ha/scriptlog/ha_ntp.log log file of the faulty node.

If the tomcat resource is abnormal, view the /var/log/Bigdata/tomcat/catalina.log log file ofthe faulty node.


----End

OMA Running Status

Indicator name: OMA Status

Indicator description: This indicator is used to check the running status of OMA. The valuesinclude running and not running. If the value is not running, the OMA is unhealthy.

Recovery guidance:

Step 1 Log in to the unhealthy node, and run su - omm to switch to user omm.

Step 2 Run ${OMA_PATH}/restart_oma_app to start the OMA manually, and perform the checkagain. If the check result is still unhealthy, go to Step 3.

Step 3 If the problem cannot be resolved by manually starting the OMA, you are advised to view andanalyze the OMA log file /var/log/Bigdata/omm/oma/omm_agent.log.


----End

SSH Trust Relationship Between Each Node and the Active Management Node

Indicator name: Authentication: the authentication between the OMS node and the node

Indicator description: This indicator is used to check whether the SSH trust relationship isnormal. If you do not need to enter the password when using the SSH to log in to other nodesfrom the active management node as user omm, the SSH trust relationship is healthy.Otherwise, the SSH trust relationship is unhealthy. If you can use the SSH to log in to other


Issue 01 (2018-09-06) 370

nodes from the active management node but cannot use the SSH to log in to the activemanagement node from other nodes, the SSH trust relationship is unhealthy.

Recovery guidance:

Step 1 If this indicator is abnormal, the SSH trust relationship between each node and the activemanagement node is abnormal. In this case, check whether the permission on the /home/ommdirectory is omm. If other users have permission on the directory, the SSH trust relationshipmay be abnormal. You are advised to run chown omm:wheel to modify the permission andperform the check again. If the permission on the /home/omm directory is normal, go to Step2.

Step 2 If the SSH trust relationship is abnormal, the heartbeat between the Controller and theNodeAgent will be abnormal. As a result, an alarm indicating a node failure will be generated.In this case, see ALM-12006 to handle the alarm.

----End

Process Running Time

Indicator name: NodeAgent Runtime, Controller Runtime, and Tomcat Runtime

Indicator description: These indicators are used to check the running time of theNodeAgent, Controller, and Tomcat processes. If the running time is less than half an hour(1800s), the process may have been restarted. You are advised to check the running time halfan hour later. If the running time is still less than half an hour after multiple checks, theprocess is abnormal.

Recovery guidance:

Step 1 Log in to the unhealthy node, and run su - omm to switch to user omm.

Step 2 Run the following command to view the PID by process name:

ps -ef | grep NodeAgent

Step 3 Run the following command to view the process start time by PID:

ps -p pid -o lstart

Step 4 Check whether the process start time is normal. If the process repeatedly restarts, go to Step5.

Step 5 View the related logs and analyze restart causes.

If the running time of NodeAgent is abnormal, check the /var/log/Bigdata/nodeagent/agentlog/agent.log log file.

If the running time of Controller is abnormal, check the /var/log/Bigdata/controller/controller.log log file.

If the running time of Tomcat is abnormal, check the /var/log/Bigdata/tomcat/web.log logfile.


----End


Issue 01 (2018-09-06) 371

Account and Password Expiry CheckIndicator name: Account and Password Expiry Check

Indicator description: This indicator is used to check the two OS users omm and ommdbaof the MRS system. For an OS user, this indicator is used to check the expiration time of theaccount and password. If the validity period of the account or password is less than or equal to15 days, the check result is unhealthy.

Recovery guidance: If the account or password validity period is less than or equal to 15days, you are advised to contact public cloud maintenance personnel to resolve the problem.

6.10.17 Spark Health Check


Indicator description: This indicator is used to check whether the service status of Spark isin normal state. If the service status is abnormal, the indicator is unhealthy.

Recovery guidance: If this indicator is abnormal, you are advised to rectify the faultaccording to the alarm help. For details, see ALM-28001.




6.10.18 Storm Health Check

Number of Available SupervisorsIndicator name: Number of Available Supervisors

Indicator description: Check the number of Supervisors in the cluster. If the number ofSupervisors in the cluster is less than one, the check result is unhealthy.

Recovery guidance: If the indicator is abnormal, go to the Storm service instance page, clickthe host name of the unavailable Supervisor instance, and check the host health status in thesummary. If the status is good, see ALM-12007 Process Fault to handle the alarm. If the statusis not good, see ALM-12006 Node Fault to handle the alarm.

Number of Free SlotsIndicator name: Number of Free Slots

Indicator description: Check the number of idle slots in the cluster. If the number of idleslots in the cluster is less than one, the check result is unhealthy.

Recovery guidance: If the indicator is abnormal, go to the Storm service instance page, andcheck the Supervisor instance health status. If the status is good, see the capacity expansion


Issue 01 (2018-09-06) 372

guide to perform capacity expansion for the Storm service. If the status is not good, seeALM-12007 Process Fault to handle the alarm.



Indicator description: Check whether the status of the Storm service is normal. If the statusis abnormal, the check result is unhealthy.

Recovery guidance: If the indicator is abnormal, see ALM-26051 Storm Service Unavailableto handle the alarm.

Alarm Information


Indicator description: Check whether an uncleared alarm exists in the service. If anuncleared alarm exists on the host, the system is unhealthy.


6.10.19 Yarn Health Check



Indicator description: This indicator is used to check whether the service status of Yarn is innormal state. If the number of NodeManager nodes cannot be obtained, the indicator isunhealthy.

Recovery guidance: If this indicator is abnormal, you are advised to rectify the faultaccording to the alarm help and ensure that the network is normal.

Alarm Check




6.10.20 ZooKeeper Health Check

Average Latency of Request Processing by ZooKeeper

Indicator name: Average Latency

Indicator description: This indicator is used to check the average latency for the ZooKeeperservice to process a request. If the average latency is greater than 300 ms, the indicator isunhealthy.


Issue 01 (2018-09-06) 373

Recovery guidance: If this indicator is abnormal, check whether the cluster network speed isnormal and whether the memory or CPU usage is too high.

Usage of ZooKeeper ConnectionsIndicator name: ZooKeeper Connections Usage

Indicator description: This indicator is used to check whether the memory usage ofZooKeeper exceeds 80%. If the memory usage exceeds the threshold, the indicator isunhealthy.

Recovery guidance: If this indicator is abnormal, you are advised to increase memory for theZooKeeper service. You can increase memory by increasing the value of -Xmx in theGC_OPTS parameter of the ZooKeeper service. After the modification, restart theZooKeeper service.


Indicator description: This indicator is used to check whether the service status ofZooKeeper is in normal state. If the service status is abnormal, the indicator is unhealthy.

Recovery guidance: If this indicator is abnormal, you are advised to check whether the statusof the KrbServer and LdapServer services is Bad. If the service status is Bad, rectify the fault.Then log in to the ZooKeeper client to check whether data cannot be written into ZooKeeperand find the causes of the ZooKeeper data write failure according to the error message. Atlast, rectify the fault according to ALM-13000.




6.11 Static Service Pool Management

6.11.1 Viewing the Status of a Static Service Pool

ScenarioThe big data management platform uses static service resource pools to manage and isolateservice resources that are not running on Yarn. The platform dynamically manages the CPU,I/O, and memory capacity that can be used by HBase, HDFS, and Yarn on the deploymentnodes. The system supports time-based automatic policy adjustment for static service resourcepools. This enables a cluster to automatically adjust the parameters at different periods toensure a more efficient utilization of resources.

On MRS Manager, users can view the monitoring indicators of the resources used by eachservice in static service pools. The following indicators are included:


Issue 01 (2018-09-06) 374

l Overall CPU usage of a servicel Overall disk I/O read rate of a servicel Overall disk I/O write rate of a servicel Overall memory used by a service

Procedure

Step 1 On MRS Manager, click System. In the Resource area, click Configure Static Service Pool.

Step 2 Click Status.

Step 3 View the system resource adjustment base.l System Resource Adjustment Base specifies the maximum amount of resources that

can be used by services on each node in the cluster. If the node has only one service, thisservice exclusively uses the available resources on the node. If the node has multipleservices, they share the available resources.

l CPU(%) specifies the maximum number of CPUs that can be used by services on thenode.

l Memory(%) specifies the maximum memory that can be used by services on the node.

Step 4 View the usage of cluster service resources.

In the Real-Time Statistics area, select All Services. The resource usage of all services in theservice pool is displayed in Real-Time Statistics.

NOTE

Effective Configuration Group specifies the resource control configuration group used by clusterservices currently. By default, the default configuration group is used at all periods in a day. Thisconfiguration group specifies that cluster services can use all CPUs and 70% of the memory of a node.

Step 5 View the resource usage status of a single service.

In the Real-Time Statistics area, select a specified service. The resource usage of the servicein the service pool will be displayed in Real-Time Statistics.

Step 6 Set an interval for automatic page refreshing.


l Refresh every 30 seconds: refreshes the page once every 30 seconds.l Refresh every 60 seconds: refreshes the page once every 60 seconds.l Stop refreshing: stops page refreshing.

----End

6.11.2 Configuring a Static Service Pool

Scenario

Users can adjust the resource base on MRS Manager and customize a resource configurationgroup to control the node resources used by cluster services or specify different node CPUsfor cluster services at different periods.


Issue 01 (2018-09-06) 375

Prerequisitesl After a static service pool is configured, the HDFS and Yarn services need to be

restarted. The services are unavailable during restart.l After a static service pool is configured, the maximum amount of resources used by the

services and their role instances cannot exceed the threshold.

Procedure

Step 1 Modify the resource adjustment base.

1. On MRS Manager, click System. In the Resource area, click Configure Static ServicePool.

2. Click Configuration. The management page of the service pool configuration group isdisplayed.

3. In System Resource Adjustment Base, modify parameters CPU(%) and Memory(%).You can restrict the maximum number of physical CPUs and memory resources that canbe used by the HBase, HDFS, and Yarn services. If multiple services are deployed on thesame node, the maximum percentage of physical resources used by all services cannotexceed the value of this parameter.

4. Click OK to complete the modification.

To modify the parameters again, click on the right side of System ResourceAdjustment Base.

Step 2 Modify the default configuration group of the service pool.

1. Click default, and set CPU LIMIT(%), CPU SHARE(%), I/O(%), and Memory(%)for the HBase, HDFS, and Yarn services in the Service Pool Configuration table.

NOTE

l The sum of CPU LIMIT(%) used by all services can exceed 100%.

l The sum of CPU SHARE(%) and the sum of I/O(%) used by all services must be 100%. Forexample, if CPU resources are allocated to the HDFS, HBase and Yarn services, the total percentageof the CPU resources allocated to the services must be 100%.

l The sum of Memory(%) used by all services can be greater than, smaller than, or equal to 100%.

l Memory(%) cannot take effect dynamically. This parameter can only be modified in the defaultconfiguration group.

2. Click OK to complete the modification. The correct values of the service poolparameters are generated by MRS Manager in Detailed Configuration based on clusterhardware resources and distribution.

To modify the parameters again, click on the right side of Service PoolConfiguration.

3. Click on the right side of Detailed Configuration to change the parameter values ofthe service pool.After you click the name of a specified service in Service Pool Configuration, only theparameters of this service will be displayed in Detailed Configuration. The displayedresource usage will not be updated by changing the parameter values manually. Forparameters that take effect dynamically, their names displayed in a newly addedconfiguration group will contain the ID of the configuration group, for example, HBase :


Issue 01 (2018-09-06) 376

RegionServer : dynamic-config1.RES_CPUSET_PERCENTAGE. The parametersfunction in the same way as those in the default configuration group.

Table 6-21 Static service pool parameters


– RES_CPUSET_PERCENTAGE– dynamic-

configX.RES_CPUSET_PERCENTAGE

Specifies the CPU percentage used by aservice.

– RES_CPU_SHARE– dynamic-

configX.RES_CPU_SHARE

Specifies the CPU share used by a service.

– RES_BLKIO_WEIGHT– dynamic-

configX.RES_BLKIO_WEIGHT

Specifies the I/O weight used by a service.

HBASE_HEAPSIZE Specifies the maximum JVM memory ofRegionServer.

HADOOP_HEAPSIZE Specifies the maximum JVM memory ofDataNode.

dfs.datanode.max.locked.memory Specifies the size of the cached memoryblock replica of DataNode in the memory.

yarn.nodemanager.resource.memory-mb

Specifies the memory that can be used byNodeManager on the current node.

Step 3 Add a customized resource configuration group.

1. Determine whether to implement time-based automatic resource configurationadjustment.If yes, go to Step 3.2.If no, go to Step 4.

2. Click to add a resource configuration group. In Scheduling Time, click to openthe time policy configuration page.Modify the following parameters and click OK to save the modification.– Repeat: If Repeat is selected, the resource configuration group runs periodically

according to a schedule. If Repeat is not selected, you need to set a date and timefor the resource configuration group to take effect.

– Repeat On: Daily, Weekly, and Monthly are supported. This parameter takeseffect only in Repeat mode.

– Between: This parameter specifies the start time and end time for the resourceconfiguration group to take effect. Set this parameter to a unique time segment. Ifthe value is the same as the time segment set for an existing configuration group,the settings cannot be saved. This parameter takes effect only in Repeat mode.


Issue 01 (2018-09-06) 377

NOTE

– The default configuration group takes effect in all undefined time periods.– The newly added configuration group is a configuration item set that takes effect dynamically

in a specified time range.– The newly added configuration group can be deleted. A maximum of four configuration

groups that take effect dynamically can be added.– Select any type of Repeat On. If the end time is earlier than the start time, the end time on the

second day is adopted by default. For example, 22:00 to 6:00 indicates that the scheduling timerange is from 22:00 on the current day to 06:00 on the next day.

– If the types of Repeat On for multiple configuration groups are different, the time segmentscan overlap. Monthly has the highest priority of the policies, Weekly has the second highest,and Daily has the lowest. Therefore, if there are two scheduling configuration groups, and oneis Monthly with a time segment from 04:00 to 07:00, and the other is Daily with a timesegment from 06:00 to 08:00, the Monthly configuration group takes precedence.

– If the types of Repeat On for multiple configuration groups are the same, the time segmentscan overlap when the dates are different. For example, if two Weekly scheduling configurationgroups exist, the time segments can be specified from 04:00 to 07:00 on Monday and 04:00 to07:00 on Wednesday.

3. Modify the resource configuration of each service in Service Pool Configuration, clickOK, and go to Step 4.

You can click to modify the parameters again. You can click in DetailedConfiguration to manually update the parameter values generated by the system basedon service requirements.

Step 4 Save the configuration.

Click Save, select Restart the affected services or instances in the Save Configurationwindow, and click OK.

When Operation succeeded is displayed, click Finish.

----End

6.12 Tenant Management

6.12.1 Introduction

DefinitionAn MRS cluster provides various resources and services for multiple organizations,departments, or applications to share. The cluster provides tenants as a logical entity to usethese resources and services. A mode involving different tenants is called multi-tenant mode.Currently, tenants are supported by analysis clusters only.

PrinciplesThe MRS cluster provides the multi-tenant function. It supports a layered tenant model andallows dynamic adding or deleting of tenants to isolate resources. It dynamically manages andconfigures tenants' computing and storage resources.

The computing resources indicate tenants' Yarn task queue resources. The task queue quotacan be modified, and the task queue usage status and statistics can be viewed.


Issue 01 (2018-09-06) 378

Storage resources support HDFS storage. Tenants' HDFS storage directories can be added ordeleted, and the quotas for file quantity and storage space of the directories can be configured.

As the unified tenant management platform of the MRS cluster, MRS Manager provides amature multi-tenant management model for enterprises, implementing centralized tenant andservice management. Users can create and manage tenants in the cluster.

l Roles, computing resources, and storage resources are automatically created whentenants are created. By default, all rights on the new computing and storage resources areassigned to the tenant roles.

l By default, the permission to view tenant resources, create sub-tenants, and manage sub-tenant resources is assigned to the tenant roles.

l After tenants' computing or storage resources are modified, the related role rights areupdated automatically.

MRS Manager supports a maximum of 512 tenants. The tenants that are created by default inthe system contain default. Tenants that are in the topmost layer with the default tenant arecalled level-1 tenants.

Resource Pool

Yarn task queues support only the label-based scheduling policy. This policy enables Yarntask queues to associate with NodeManagers that have specific node labels. In this way, Yarntasks run on specified nodes for scheduling and certain hardware resources are utilized. Forexample, Yarn tasks requiring a large memory capacity can run on nodes with a large memorycapacity by means of label association, preventing poor service performance.

On the MRS cluster, users can logically divide Yarn cluster nodes to combine multipleNodeManagers into a resource pool. Yarn task queues can be associated with specifiedresource pools by configuring queue capacity policies, ensuring efficient and independentresource utilization in the resource pools.

MRS Manager supports a maximum of 50 resource pools. The system has a Default resourcepool.

6.12.2 Creating a Tenant

Scenario

You can create a tenant on MRS Manager to specify the resource usage.

Prerequisitesl A tenant name has been planned. The name must not be the same as that of a role or

Yarn queue that exists in the current cluster.

l If a tenant requires storage resources, a storage directory has been planned in advancebased on service requirements, and the planned directory does not exist under the HDFSdirectory.

l The resources that can be allocated to the current tenant have been planned and the sumof the resource percentages of direct sub-tenants under the parent tenant at every leveldoes not exceed 100%.


Issue 01 (2018-09-06) 379

Procedure

Step 1 On MRS Manager, click Tenant.

Step 2 Click Create Tenant. On the displayed page, configure tenant attributes according to thefollowing table.

Table 6-22 Tenant parameters


Name Specifies the name of the current tenant. The value consistsof 3 to 20 characters, and can contain letters, digits, andunderscores (_).

Tenant Type The options include Leaf and Non-leaf. If Leaf is selected,the current tenant is a leaf tenant and no sub-tenant can beadded. If Non-leaf is selected, sub-tenants can be added tothe current tenant.

Dynamic Resource Specifies the dynamic computing resources for the currenttenant. The system automatically creates a task queue inYarn and the queue is given the same name as the tenant. Ifdynamic resources are not Yarn resources, the system doesnot automatically create a task queue.

Default Resource PoolCapacity (%)

Specifies the percentage of the computing resources used bythe current tenant in the default resource pool.

Default Resource PoolMax. Capacity (%)

Specifies the maximum percentage of the computingresources used by the current tenant in the default resourcepool.

Storage Resource Specifies the storage resources for the current tenant. Thesystem automatically creates a file folder in the /tenantdirectory, which is given the same name as the tenant. Whenthe tenant is created, the system automatically creates the /tenant directory under the root directory of HDFS. Ifstorage resources are not HDFS, the system does not createa storage directory under the root directory of HDFS.

Space Quota (MB) Specifies the quota for HDFS storage space used by thecurrent tenant. The value of Space Quota ranges from 1 to8796093022208 and the unit is MB. This parameterindicates the maximum HDFS storage space that can be usedby the tenant, not the actual space used. If the value isgreater than the size of the HDFS physical disk, themaximum space available is the full space of the HDFSphysical disk.NOTE

To ensure data reliability, two more copies of a file are automaticallygenerated when the file is stored in HDFS. That is, three copies ofthe same file are stored by default. The HDFS storage spaceindicates the total disk space occupied by all these copies. Forexample, if Space Quota is set to 500, the actual space for storingfiles is about 166 MB (500/3 = 166).


Issue 01 (2018-09-06) 380


Storage Path Specifies the tenant's HDFS storage directory. The systemautomatically creates a file folder in the /tenant directory,which is given the same name as the tenant. For example,the default HDFS storage directory for tenant ta1 is tenant/ta1. When the tenant is created, the system automaticallycreates the /tenant directory under the root directory ofHDFS. The storage path is customizable.

Service Specifies other service resources associated with the currenttenant. HBase is supported. To configure this parameter,click Associate Services. In the dialog box that is displayed,set Service to HBase. If Association Mode is set toExclusive, service resources are occupied exclusively. Ifshare is selected, service resources are shared.

Description Specifies the description of the current tenant.

Step 3 Click OK to save the settings.

It takes a few minutes to save the settings. If the Tenant created successfully is displayed inthe upper-right corner, the tenant is added successfully.

NOTE

l Roles, computing resources, and storage resources are automatically created when tenants arecreated.

l The new role has the rights on the computing and storage resources. The role and the rights arecontrolled by the system automatically and cannot be controlled manually under Manage Role.

l If you want to use the tenant, create a system user and assign the Manager_tenant role and the rolecorresponding to the tenant to the user. For details, see Creating a User.

----End

Related TasksViewing an added tenant


Step 2 In the tenant list on the left, click the name of an added tenant.

The Summary tab is displayed on the right by default.

Step 3 View Basic Information, Resource Quota, and Statistics of the tenant.

If HDFS is in the Stopped state, Available and Usage of Space in Resource Quota areunknown.

----End


Issue 01 (2018-09-06) 381

6.12.3 Creating a Sub-tenant

Scenario

You can create a sub-tenant on MRS Manager if the resources of the current tenant need to befurther allocated.

Prerequisitesl A parent tenant has been added.l A tenant name has been planned. The name must not be the same as that of a role or

Yarn queue that exists in the current cluster.l If a sub-tenant requires storage resources, a storage directory has been planned in

advance based on service requirements, and the planned directory does not exist underthe storage directory of the parent tenant.

l The resources that can be allocated to the current tenant have been planned and the sumof the resource percentages of direct sub-tenants under the parent tenant at every leveldoes not exceed 100%.

Procedure


Step 2 In the tenant list on the left, move the cursor to the tenant node to which a sub-tenant is to beadded. Click Create sub-tenant. On the displayed page, configure the sub-tenant attributesaccording to the following table.

Table 6-23 Sub-tenant parameters


Parent tenant Specifies the name of the parent tenant.

Name Specifies the name of the current tenant. The value consistsof 3 to 20 characters, and can contain letters, digits, andunderscores (_).

Tenant Type The options include Leaf and Non-leaf. If Leaf is selected,the current tenant is a leaf tenant and no sub-tenant can beadded. If Non-leaf is selected, sub-tenants can be added tothe current tenant.

Dynamic Resource Specifies the dynamic computing resources for the currenttenant. The system automatically creates a task queue in theYarn parent tenant queue and the task queue name is thesame as the name of the sub-tenant. If dynamic resources arenot Yarn resources, the system does not automatically createa task queue. If the parent tenant does not have dynamicresources, the sub-tenant cannot use dynamic resources.

Default Resource PoolCapacity (%)

Specifies the percentage of the computing resources used bythe current tenant. The base value is the total resources ofthe parent tenant.


Issue 01 (2018-09-06) 382


Default Resource PoolMax. Capacity (%)

Specifies the maximum percentage of the computingresources used by the current tenant. The base value is thetotal resources of the parent tenant.

Storage Resource Specifies the storage resources for the current tenant. Thesystem automatically creates a file folder in the HDFS parenttenant directory, which is given the same name as the sub-tenant. If storage resources are not HDFS, the system doesnot create a storage directory under the HDFS directory. Ifthe parent tenant does not have storage resources, the sub-tenant cannot use storage resources.

Space Quota (MB) Specifies the quota for HDFS storage space used by thecurrent tenant. The minimum value is 1. The maximumvalue is the entire space quota of the parent tenant. The unitis MB. This parameter indicates the maximum HDFSstorage space that can be used by the tenant, but does notindicate the actual space used. If the value is greater than thesize of the HDFS physical disk, the maximum spaceavailable is the full space of the HDFS physical disk. If thisquota is greater than the quota of the parent tenant, the actualstorage space will be affected by the quota of the parenttenant.NOTE

To ensure data reliability, two more copies of a file are automaticallygenerated when the file is stored in HDFS. That is, three copies ofthe same file are stored by default. The HDFS storage spaceindicates the total disk space occupied by all these copies. Forexample, if Space Quota is set to 500, the actual space for storingfiles is about 166 MB (500/3 = 166).

Storage Path Specifies the tenant's HDFS storage directory. The systemautomatically creates a file folder in the parent tenantdirectory, which is given the same name as the sub-tenant.For example, if the sub-tenant is ta1s and the parentdirectory is tenant/ta1, the system sets this parameter for thesub-tenant to tenant/ta1/ta1s by default. The storage path iscustomizable in the parent directory. The parent directory forthe storage path must be the storage directory of the parenttenant.

Service Specifies other service resources associated with the currenttenant. HBase is supported. To configure this parameter,click Associate Services. In the dialog box that is displayed,set Service to HBase. If Association Mode is set toExclusive, service resources are occupied exclusively. IfShare is selected, service resources are shared.

Description Specifies the description of the current tenant.



Issue 01 (2018-09-06) 383

It takes a few minutes to save the settings. The Tenant created successfully is displayed inthe upper-right corner. The tenant is added successfully.

NOTE

l Roles, computing resources, and storage resources are automatically created when tenants arecreated.

l The new role has the rights on the computing and storage resources. The role and the rights arecontrolled by the system automatically and cannot be controlled manually under Manage Role.

l When using this tenant, create a system user and assign the user a related tenant role. For details, seeCreating a User.

----End

6.12.4 Deleting a Tenant

Scenario

On MRS Manager, you can delete a tenant that is not required.

Prerequisitesl A tenant has been added.

l You have checked whether the tenant to be deleted has sub-tenants. If the tenant has sub-tenants, delete them; otherwise, you cannot delete the tenant.

l The role of the tenant to be deleted cannot be associated with any user or user group. Fordetails about how to cancel the binding between roles and users, see Modifying UserInformation.

Procedure


Step 2 In the tenant list on the left, move the cursor to the tenant node where the tenant is to bedeleted. Click Delete.

The Delete Tenant dialog box is displayed. To save tenant data, select Reserve the data ofthis tenant. Otherwise, the tenant storage space will be deleted.

Step 3 Click OK.

It takes a few minutes to save the configuration. The tenant is deleted successfully. Thetenant's role and storage space are deleted.

NOTE

l After the tenant is deleted, the tenant's task queue persists in Yarn.

l If you choose not to reserve data when deleting the parent tenant, data of sub-tenants is also deletedif the sub-tenants use storage resources.

----End


Issue 01 (2018-09-06) 384

6.12.5 Managing a Tenant Directory

Scenario

You can manage the HDFS storage directory used by a specific tenant on MRS Manager. Themanagement operations include adding a tenant directory, modifying the quotas for directoryfile quantity and storage space, and deleting a directory.

Prerequisites

A tenant with HDFS storage resources has been added.

Procedurel View a tenant directory.

a. On MRS Manager, click Tenant.b. In the tenant list on the left, click the target tenant.c. Click the Resource tab.d. View the HDFS Storage table.

n The Quota column indicates the quotas for the file and directory quantity ofthe tenant directory.

n The Space Quota column indicates the storage space size of the tenantdirectory.

l Add a tenant directory.

a. On MRS Manager, click Tenant.b. In the tenant list on the left, click the tenant whose HDFS storage directory needs to

be added.c. Click the Resource tab.d. In the HDFS Storage table, click Create Directory.

n In Parent Directory, select a storage directory of a parent tenant.This parameter is valid for sub-tenants only. If the parent tenant has multipledirectories, select any one of them.

n Set Path to a tenant directory path.

NOTE

l If the current tenant is not a sub-tenant, the new path is created in the HDFS rootdirectory.

l If the current tenant is a sub-tenant, the new path is created in the specified parentdirectory.

A complete HDFS storage path contains a maximum of 1023 characters. AnHDFS directory name can contain digits, letters, spaces, and underscores (_).The name cannot start or end with a space.

n Set Quota to the quotas for file and directory quantity.Quota is optional. Its value ranges from 1 to 9223372036854775806.

n Set Space Quota to the storage space size of the tenant directory.The value of Space Quota ranges from 1 to 8796093022208.


Issue 01 (2018-09-06) 385

NOTE

To ensure data reliability, two more copies of a file are automatically generated whenthe file is stored in HDFS. That is, three copies of the same file are stored by default.The HDFS storage space indicates the total disk space occupied by all these copies.For example, if Space Quota is set to 500, the actual space for storing files is about166 MB (500/3 = 166).

e. Click OK. The system creates the tenant directory in the HDFS root directory.

l Modify tenant directory attributes.

a. On MRS Manager, click Tenant.

b. In the tenant list on the left, click the tenant whose HDFS storage directory needs tobe modified.

c. Click the Resource tab.

d. In the HDFS Storage table, click Modify in the Operation column of the specifiedtenant directory.

n Set Quota to the quotas for file and directory quantity.

Quota is optional. Its value ranges from 1 to 9223372036854775806.

n Set Space Quota to the storage space size of the tenant directory.

The value of Space Quota ranges from 1 to 8796093022208.

NOTE

To ensure data reliability, two more copies of a file are automatically generated whenthe file is stored in HDFS. That is, three copies of the same file are stored by default.The HDFS storage space indicates the total disk space occupied by all these copies.For example, if Space Quota is set to 500, the actual space for storing files is about166 MB (500/3 = 166).

e. Click OK.

l Delete a tenant directory.

a. On MRS Manager, click Tenant.

b. In the tenant list on the left, click the tenant whose HDFS storage directory needs tobe deleted.

c. Click the Resource tab.

d. In the HDFS Storage table, click Delete in the Operation column of the specifiedtenant directory.

The default HDFS storage directory configured during tenant creation cannot bedeleted. Only new HDFS storage directories can be deleted.

e. Click OK. The tenant directory is deleted.

6.12.6 Recovering Tenant Data

Scenario

Tenant data is stored on MRS Manager and in cluster components by default. Whencomponents are recovered from faults or reinstalled, some tenant configuration data may beabnormal. In this case, you can manually recover the tenant data.


Issue 01 (2018-09-06) 386

Procedure


Step 2 In the tenant list on the left, click a tenant node.

Step 3 Check the status of the tenant data.

1. In Summary, check the color of the circle on the left of Basic Information. Greenindicates that the tenant is available and gray indicates that the tenant is unavailable.

2. Click Resource, and check the color of the circle on the left of Yarn or HDFS Storage.Green indicates that the resource is available and gray indicates that the resource isunavailable.

3. Click Service Association, and check the Status column of the associated service table.Good indicates that the component can provide services for the associated tenant, whileBad indicates that the component cannot.

4. If any of the preceding check items is abnormal, go to Step 4 to recover tenant data.

Step 4 Click Restore Tenant Data.

Step 5 In the Restore Tenant Data window, select one or more components whose data needs to berecovered, and click OK. The system automatically recovers the tenant data.

----End

6.12.7 Creating a Resource Pool

ScenarioOn the MRS cluster, users can logically divide Yarn cluster nodes to combine multipleNodeManagers into a Yarn resource pool. Each NodeManager belongs to one resource poolonly. The system contains a Default resource pool by default. All NodeManagers that are notadded to customized resource pools belong to this default resource pool.

You can create a customized resource pool on MRS Manager and add hosts that have not beenadded to other customized resource pools to it.

Procedure


Step 2 Click the Resource Pool tab.

Step 3 Click Create Resource Pool.

Step 4 In Create Resource Pool, set the attributes of the resource pool.l Name:

Enter a name for the resource pool. The name cannot be Default.The name contains 1 to 20 characters and can consist of digits, letters, and underscores(_). However, it must not start with underscores.

l Available Hosts:

In the host list on the left, select the name of a specified host and click to add theselected host to the resource pool. Only hosts in the cluster can be selected. The host listof a resource pool can be left blank.


Issue 01 (2018-09-06) 387


Step 6 After the resource pool is created, users can view the Name, Members, Association Mode,vCore, and Memory in the resource pool list. Hosts that have been added to the customizedresource pool are no longer members of the Default resource pool.

----End

6.12.8 Modifying a Resource Pool

Scenario

You can modify members of an existing resource pool on MRS Manager.

Procedure



Step 3 Locate the row that contains the specified resource pool, and click Modify in the Operationcolumn.

Step 4 In Modify Resource Pool, modify Added Hosts.l Adding a host: Select the name of a specified host in the host list on the left and click

to add it to the resource pool.l Deleting a host: Select the name of a specified host in the host list on the right and click

to delete it from the resource pool. The host list of a resource pool can be leftblank.


----End

6.12.9 Deleting a Resource Pool

Scenario

You can delete an existing resource pool on MRS Manager.

Prerequisitesl No queue in any cluster is using the resource pool being deleted as the default resource

pool. For details, see Configuring a Queue.l Resource distribution policies of all queues have been cleared from the resource pool

being deleted. For details, see Clearing the Configuration of a Queue.

Procedure




Issue 01 (2018-09-06) 388

Step 3 Locate the row that contains the specified resource pool, and click Delete in the Operationcolumn.

In the dialog box that is displayed, click OK.

----End

6.12.10 Configuring a Queue

Scenario

On MRS Manager, you can modify queue configurations for a specific tenant.

Prerequisites

A tenant associated with Yarn and allocated with dynamic resources has been added.

Procedure


Step 2 Click the Dynamic Resource Plan tab.

Step 3 Click the Queue Configuration tab.

Step 4 In the tenant queue table, click Modify in the Operation column of the specified tenantqueue.

NOTE

In the tenant list on the left of the Tenant Management tab, click the target tenant. In the displayed

window, choose Resource. On the displayed page, click to open the queue configurationmodification page.

Table 6-24 Queue configuration parameters


Maximum Applications Specifies the maximum number of applications. The valueranges from 1 to 2147483647.

Maximum AMResource Percent

Specifies the maximum percentage of resources that can beused to run ApplicationMaster in a cluster. The value rangesfrom 0 to 1.

Minimum User LimitPercent (%)

Specifies the minimum percentage of user resource usage. Thevalue ranges from 0 to 100.

User Limit Factor Specifies the limit factor of the maximum user resource usage.The maximum percentage of user resource usage can becalculated by multiplying the limit factor with the percentageof the tenant's actual resource usage in the cluster. Theminimum value is 0.

State Specifies the current status of a resource plan. The values areRunning and Stopped.


Issue 01 (2018-09-06) 389


Default Resource Pool(Default Node LabelExpression)

Specifies the resource pool used by a queue. The default valueis Default. If you want to change the resource pool, configurethe queue capacity first. For details, see Configuring theQueue Capacity Policy of a Resource Pool.

----End

6.12.11 Configuring the Queue Capacity Policy of a Resource Pool

Scenario

After a resource pool is added, the capacity policies of available resources need to beconfigured for Yarn task queues. This ensures that tasks in the resource pool are runningproperly. Each queue can be configured with the queue capacity policy of only one resourcepool. Users can view the queues in any resource pool and configure queue capacity policies.After the queue policies are configured, Yarn task queues and resource pools are associated.

Prerequisitesl A resource pool has been added.

l The task queues are not associated with other resource pools. By default, all task queuesare associated with the default resource pool.

Procedure



Step 3 In Resource Pool, select a specified resource pool.

Available Resource Quota: indicates that all resources in each resource pool are available forqueues by default.

Step 4 Locate the specified queue in the Resource Allocation table, and click Modify in theOperation column.

Step 5 In Modify Resource Allocation, configure the resource capacity policy of the task queue inthe resource pool.

l Capacity (%): specifies the percentage of the current tenant's computing resource usage.

l Maximum Capacity (%): specifies the percentage of the current tenant's maximumcomputing resource usage.


----End


Issue 01 (2018-09-06) 390

6.12.12 Clearing the Configuration of a Queue

ScenarioUsers can clear the configuration of a queue on MRS Manager when the queue does not needresources from a resource pool or if a resource pool needs to be disassociated from the queue.Clearing the configuration of a queue means that the resource capacity policy of the queue iscanceled.

PrerequisitesIf a queue needs to be unbound from a resource pool, this resource pool cannot serve as thedefault resource pool of the queue. Therefore, you must first change the default resource poolof the queue to another one. For details, see Configuring a Queue.

Procedure



Step 3 In Resource Pool, select a specified resource pool.

Step 4 Locate the specified queue in the Resource Allocation table, and click Clear in theOperation column.

In Clear Queue Configuration, click OK to clear the queue configuration in the currentresource pool.

NOTE

If no resource capacity policy is configured for a queue, the clearance function is unavailable for thequeue by default.

----End

6.13 Backup and Restoration

6.13.1 Introduction

OverviewMRS Manager provides backup and recovery for user data and system data. The backupfunction is provided based on components to back up Manager data (including OMS data andLdapServer data), Hive user data, component metadata saved in DBService, and HDFSmetadata.

Backup and recovery tasks are performed in the following scenarios:

l Routine backup is performed to ensure the data security of the system and components.l If the system is faulty, backup data can be used to restore the system.l If the active cluster is completely faulty, an image cluster identical to the active cluster

needs to be created. Backup data can be used to perform restoration operations.


Issue 01 (2018-09-06) 391

Table 6-25 Backing up metadata

Backup Type Backup Content

OMS Back up database data (excluding alarm data) andconfiguration data in the cluster management system.

LdapServer Back up user information, including the username, password,key, password policy, and group information.

DBService Back up metadata of the component (Hive) managed byDBService.

NameNode Back up HDFS metadata.

Table 6-26 Backing up service data of specific components

Backup Type Backup Content

HBase Back up table-level user data.

HDFS Back up the directories or files that correspond to userservices.

Hive Back up table-level user data.

Note that some components do not provide the data backup and recovery functions:

l ZooKeeper data is backed up on ZooKeeper nodes.l MapReduce and Yarn data is stored in HDFS. Therefore, MapReduce and Yarn depend

on HDFS to provide the backup and recovery functions.

PrinciplesTask

Before backup or recovery, you need to create a backup or recovery task and set taskparameters, such as the task name, backup data source, and type of directories for savingbackup files. When Manager is used to recover the data of HDFS, Hive, and NameNode, thecluster cannot be accessed.

Each backup task can back up different data sources and generate independent backup filesfor each data source. All the backup files generated in each task form a backup file set, whichcan be used in recovery tasks. Backup files can be stored on Linux local disks, HDFS of thelocal cluster, and HDFS of the standby cluster. The backup task provides both full andincremental backup policies. HDFS and Hive backup tasks support the incremental backuppolicy, while OMS, LdapServer, DBService, and NameNode backup tasks support only thefull backup policy.


Issue 01 (2018-09-06) 392

NOTE

The rules for task execution are as follows:

l If a task is being executed, it cannot be executed repeatedly and other tasks cannot be started at thesame time.

l The interval at which a periodic task is automatically executed must be greater than 120s; otherwise,the task is postponed and will be executed in the next period. Manual tasks can be executed at anyinterval.

l When a periodic task is to be automatically executed, the current time cannot be 120s later than thetask start time; otherwise, the task is postponed and will be executed in the next period.

l When a periodic task is locked, it cannot be automatically executed and needs to be manuallyunlocked.

l Before an OMS, LdapServer, DBService, or NameNode backup task starts, ensure that theLocalBackup partition on the active management node has more than 20 GB available space;otherwise, the backup task cannot be started.

l When planning backup and recovery tasks, select the data you want to back up or recover accordingto the service logic, data storage structure, and database or table association. By default, the systemcreates periodic backup task default that has an execution interval of 24 hours to perform fullbackup of OMS, LdapServer, DBService, and NameNode data to the Linux local disk.

Snapshot

The system adopts the snapshot technology to quickly back up data. Snapshots include HDFSsnapshots.

An HDFS snapshot is a read-only backup of HDFS at a specified time point. It is used in databackup, misoperation protection, and disaster recovery.

The snapshot function can be enabled for any HDFS directory to create the related snapshotfile. Before creating a snapshot for a directory, the system automatically enables the snapshotfunction for the directory. Snapshot creation does not affect HDFS operations. A maximum of65,536 snapshots can be created for each HDFS directory.

When a snapshot has been created for an HDFS directory, the directory cannot be deleted andthe directory name cannot be modified before the snapshot is deleted. Snapshots cannot becreated for the upper-layer directories or subdirectories of the directory.

DistCp

Distributed copy (DistCp) is a tool used to replicate large amounts of data within a clusterHDFS or between HDFSs of different clusters. In HBase, HDFS, or Hive backup or recoverytasks, if the data is backed up in HDFS of the standby cluster, the system invokes DistCp toperform the operation. You need to install the same version of MRS in the active and standbyclusters.

DistCp uses MapReduce to implement data distribution, troubleshooting, recovery, and report.DistCp specifies different Map jobs for various source files and directories in the specifiedlist. Each Map job copies the data in the partition that corresponds to the specified file in thelist.

To use DistCp to perform data replication between HDFSs of two clusters, configure thecross-cluster trust relationship and enable the cross-cluster replication function for bothclusters.

Local quick recovery

After using DistCp to back up the HDFS and Hive data of the local cluster to HDFS of thestandby cluster, HDFS of the local cluster retains the backup data snapshots. Users can create


Issue 01 (2018-09-06) 393

local quick recovery tasks to recover data by using the snapshot files in HDFS of the localcluster.

Specifications

Table 6-27 Backup and recovery feature specifications

Item Specifications

Maximum number of backup or recoverytasks

100

Number of concurrent running tasks 1

Maximum number of waiting tasks 199

Maximum size of backup files on a Linuxlocal disk (GB)

600

Table 6-28 Specifications of the default task

Item OMS LdapServer DBService NameNode

Backup period 1 hour

Maximumnumber ofbackups

2

Maximum sizeof a backup file

10 MB 20 MB 100 MB 1.5 GB

Maximum sizeof disk spaceused

20 MB 40 MB 200 MB 3 GB

Save path ofbackup data

Data save path/LocalBackup/ on active and standby management nodes

NOTE

The administrator must regularly transfer the backup data of the default task to an external cluster basedon the enterprise's O&M requirements.

6.13.2 Backing Up Metadata

Scenario

To ensure the security of metadata either on a routine basis or before and after performingcritical metadata operations (such as capacity expansion and reduction, patch installation,upgrades, or migration), metadata must be backed up. The backup data can be used to recover


Issue 01 (2018-09-06) 394

the system if an exception occurs or if the operation has not achieved the expected result. Thisminimizes the adverse impact on services.

Metadata includes OMS data, LdapServer data, DBService data, and NameNode data. TheManager data to be backed up includes OMS data and LdapServer data.

By default, metadata backup is supported by the default task. Users can create a backup taskon MRS Manager to back up metadata. Both automatic and manual backup tasks aresupported.

Prerequisitesl A standby cluster for backing up data has been created and the network is connected.

The inbound rules of the security group in the peer cluster have been added to thesecurity group in each cluster to allow all access requests of all ECS protocols and portsin the security groups.

l The backup type, period, policy, and other specifications have been planned.l The Data save path/LocalBackup/ directories on the active and standby management

nodes have sufficient space.

Procedure

Step 1 Create a backup task.

1. On MRS Manager, choose System > Back Up Data.2. Click Create Backup Task.

Step 2 Set backup policies.

1. Set Name to the name of the backup task.2. Set Mode to the type of the backup task. Periodic indicates that the backup task is

periodically executed and Manual indicates that the backup task is manually executed.To create a periodic backup task, set the following parameters in addition to thepreceding parameters:– Start Time: indicates the time when the task is started for the first time.– Period: indicates a task execution interval. The options include By hour and By

day.– Backup Policy: indicates the volume of data to be backed up when each task is

started. The options include Full backup at the first time and subsequentincremental backup, Full backup every time, and Full backup once every ntimes. When the parameter is set to Full backup once every n times, n must bespecified.

Step 3 Select backup sources.

Set Configuration to OMS and LdapServer.

Step 4 Set backup parameters.

1. Set Path Type of OMS and LdapServer to a backup directory type.The following backup directory types are supported:– LocalDir: indicates that backup files are stored on the local disk of the active

management node. The standby management node automatically synchronizes thebackup files. The default save path is Data save path/LocalBackup/. If you select


Issue 01 (2018-09-06) 395

this value, you need to set Max. Number of Backup Copies to specify the numberof backup files that can be retained in the backup directory.

– LocalHDFS: indicates that backup files are stored in the HDFS directory of thecurrent cluster. If you select this value, you need to set the following parameters:n Target Path: indicates the backup file save path in HDFS. The save path

cannot be a hidden HDFS directory, such as a snapshot or recycle bindirectory, or a default system directory.

n Max. Number of Backup Copies: indicates the number of backup file setsthat can be retained in the backup directory.

n Target Instance Name: indicates the name of the NameService instance thatcorresponds to the backup directory. The default value is hacluster.

2. Click OK to save the settings.

Step 5 Execute the backup task.

In the Operation column of the created task in the backup task list, click More > Run toexecute the backup task.

After the backup task is executed, the system automatically creates a subdirectory for eachbackup task in the backup directory. The subdirectory is used to save data source backup files.The format of the subdirectory name is Backup task name_Task creation time. The format ofthe backup file name is Version_Data source_Task execution time.tar.gz.

----End

6.13.3 Recovering Metadata

ScenarioMetadata needs to be recovered in the following scenarios:

l Data is modified or deleted unexpectedly and needs to be restored.l After a critical operation (such as an upgrade or critical data adjustment) is performed on

metadata components, an exception occurs or the operation does not achieve theexpected result. All modules are faulty and become unavailable.

l Data is migrated to a new cluster.

Users can create a recovery task on MRS Manager to recover metadata. Only manualrecovery tasks are supported.


Issue 01 (2018-09-06) 396

NOTICEl Data recovery can be performed only when the system version is consistent with that of

data backup.l Before recovering data when the service is running properly, you are advised to manually

back up the latest management data. Otherwise, the metadata that is generated after thedata backup and before the data recovery will be lost.

l Use the OMS data and LdapServer data that is backed up at the same point in time torecover the data. Otherwise, the service and operation may fail.

l By default, the MRS cluster uses DBService to save Hive metadata.

Impact on the Systeml Data generated between the backup time and restoration time is lost.l After the data is recovered, the configuration of the components that depend on

DBService may expire and these components need to be restarted.

Prerequisitesl The data in the OMS and LdapServer backup files has been backed up at the same time.l The status of the OMS resources and the LdapServer instances is normal. If the status is

abnormal, data recovery cannot be performed.l The status of the cluster hosts and services is normal. If the status is abnormal, data

recovery cannot be performed.l The cluster host topologies during data recovery and data backup are the same. If the

topologies are different, data recovery cannot be performed and you need to back up dataagain.

l The services added to the cluster during data recovery and data backup are the same. Ifthe services are different, data recovery cannot be performed and you need to back updata again.

l The status of the active and standby DBService instances is normal. If the status isabnormal, data recovery cannot be performed.

l The upper-layer applications that depend on the MRS cluster have been stopped.l On MRS Manager, all the NameNode role instances with data being recovered have been

stopped. Other HDFS role instances keep running. After data is recovered, theNameNode role instances need to be restarted and cannot be accessed before the restart.

l You have checked whether the NameNode backup files have been saved in the Data savepath/LocalBackup/ directory on the active management node.

Procedure

Step 1 Check the location of the backup data.

1. On MRS Manager, choose System > Back Up Data.2. In the Operation column of a specified task in the task list, click More > View History

to view the records of historical backup tasks. In the window that is displayed, select arecord and click View in the Backup Path column to view its backup path information.Find the following information:


Issue 01 (2018-09-06) 397

– Backup Object: indicates the data source of the backup data.– Backup Path: indicates the full path where the backup files are saved.

3. Select the correct item, and manually copy the full path of backup files in Backup Path.

Step 2 Create a recovery task.

1. On MRS Manager, choose System > Restore Data.2. Click Create Restoration Task.3. Set Name to the name of the recovery task.

Step 3 Select recovery sources.

In Configuration, select the components whose metadata is to be recovered.

Step 4 Set recovery parameters.

1. Set Path Type to a backup directory type.2. The settings vary according to backup directory types:

– LocalDir: indicates that backup files are stored on the local disk of the activemanagement node. If you select this value, you need to set Source Path to the fullpath of the backup file. For example, Data path/LocalBackup/backup taskname_task creation time/data source_task execution time/version_datasource_task execution time.tar.gz.

– LocalHDFS: indicates that backup files are stored in the HDFS directory of thecurrent cluster. If you select this value, you need to set the following parameters:n Source Path: indicates the full path of the backup file in HDFS. For example,

backup path/backup task name_task creation time/version_data source_taskexecution time.tar.gz.

n Source Instance Name: indicates the name of the NameService instance thatcorresponds to the backup directory when the recovery task is executed. Thedefault value is hacluster.

3. Click OK to save the settings.

Step 5 Execute the recovery task.

In the Operation column of the created task in the recovery task list, click Start to executethe recovery task.

l If the recovery is successful, the progress bar is green.l If the recovery is successful, the recovery task cannot be executed again.l If the recovery task fails during the first execution, rectify the fault and click Start to

execute the task again.

Step 6 Determine what metadata has been recovered.l If OMS and LdapServer metadata has been recovered, go to Step 7.l If DBService data has been recovered, the task is complete.l If NameNode data has been recovered, choose Service > HDFS > More > Restart

Service on MRS Manager to complete the task.

Step 7 Restart Manager for the recovered data to take effect.

1. On MRS Manager, choose LdapServer > More > Restart Service, click OK, and waitfor the LdapServer service to restart.


Issue 01 (2018-09-06) 398

2. Log in to the active management node. For details, see Viewing Active and StandbyNodes.

3. Run the following command to restart OMS:sh ${BIGDATA_HOME}/om-0.0.1/sbin/restart-oms.shThe command is executed successfully if the following information is displayed:start HA successfully.

4. On MRS Manager, choose KrbServer > More > Synchronize Configuration, deselectRestart services or instances whose configurations have expired, click OK, and waitfor the KrbServer service configuration to synchronize and for the service to restart.

5. On MRS Manager, choose Service > More > Synchronize Configuration, deselectRestart services or instances whose configurations have expired, click OK, and waitfor the cluster configuration to synchronize.

6. Choose Service > More > Stop Cluster. After the cluster has been stopped, chooseService > More > Start Cluster, and wait for the cluster to start.

----End

6.13.4 Modifying a Backup Task

Scenario

Modify the parameters of a created backup task on MRS Manager to meet changing servicerequirements. The parameters of recovery tasks can be viewed but not modified.


After a backup task is modified, the new parameters take effect when the task is executed nexttime.

Prerequisitesl A backup task has been created.l A new backup task policy has been planned based on the actual situation.

Procedure

Step 1 On MRS Manager, choose System > Back Up Data.

Step 2 In the task list, locate a specified task, and click Configure in the Operation column to go tothe configuration modification page.

Step 3 On the page that is displayed, modify the following parameters:l Start Timel Periodl Target Pathl Max. Number of Backup Copies

NOTE

After the Target Path parameter of a backup task is modified, this task will be performed as a fullbackup task for the first time by default.


Issue 01 (2018-09-06) 399


----End

6.13.5 Viewing Backup and Recovery Tasks

ScenarioOn MRS Manager, view created backup and recovery tasks and check their running status.

Procedure


Step 2 Click Back Up Data or Restore Data.

Step 3 In the task list, obtain the previous task execution result in the Task Progress column. Greenindicates that the task is executed successfully, and red indicates that the execution fails.

Step 4 In the Operation column of a specified task in the task list, click More > View History toview the task execution records.

In the displayed window, click View in the Details column of a specified record to display loginformation about the execution.

----End

Related Tasksl Modifying a backup task

See Modifying a Backup Task.l Viewing a recovery task

In the task list, locate a specified task and click View task in the Operation column toview a recovery task. The parameters of recovery tasks can only be viewed but notmodified.

l Executing a backup or recovery taskIn the task list, locate a specified task and click More > Run or Start in the Operationcolumn to start a backup or recovery task that is ready or fails to be executed. Executedrecovery tasks cannot be repeatedly executed.

l Stopping a backup taskIn the task list, locate a specified task and click More > Stop in the Operation columnto stop a backup task that is running.

l Deleting a backup or recovery taskIn the task list, locate a specified task and click More > Delete in the Operation columnto delete a backup or recovery task. Backup data will be reserved by default after a taskis deleted.

l Suspending a backup taskIn the task list, locate a specified task and click More > Suspend in the Operationcolumn to suspend a backup task. Only periodic backup tasks can be suspended.Suspended backup tasks are no longer executed automatically. When you suspend abackup task that is being executed, the task execution stops. If you want to cancel thesuspension status of a task, click More > Resume.


Issue 01 (2018-09-06) 400

6.14 Security Management

6.14.1 List of Default Users

User ClassificationThe MRS cluster provides the following three types of users.

NOTE

Users are required to periodically change their passwords. It is not recommended to use the defaultpasswords.

User Type Description

System user A user used to run OMS system processes.

Internal systemuser

An internal user provided by the MRS cluster and used to implementcommunication between processes, save user group information, andassociate user rights.

Database user l Used to manage the OMS database and access data.l Used to run the database of service components (Hive, Loader and

DBService).

System UsersNOTE

l User ldap of the OS is required in the MRS cluster. Do not delete the account. Otherwise, the clustermay not work properly. Password management policies are maintained by the users.

l Reset the password when you change the passwords of user ommdba and user omm for the firsttime. Change the passwords regularly after you have retrieved them.

Type Username Initial Password Description

MRS clustersystem user

admin MIG2oAMCAQGhAw@IBAaIDAgwCAQGkgZ8@wgZwwVKAHMAWgAw@IBAKFJMEgABD4gA

Default user of MRS Manager.The user is used to record thecluster audit logs.

MRS clusternode OS user

ommdba Randomly generatedby the system

User who creates the MRScluster system database. Thisuser is an OS user generated onthe management nodes anddoes not require a unifiedpassword.


Issue 01 (2018-09-06) 401

Type Username Initial Password Description

omm Randomly generatedby the system

Internal running user of theMRS cluster system. This useris an OS user generated on allnodes and does not require aunified password.

linux cloud.1234NOTE

This user is applicableonly to MRS clusters ofversions earlier than1.6.2.

User used to log in to a node inthe MRS cluster. This user is anOS user generated on all nodes.


root Password set by theuserNOTE

This user is applicableonly to MRS clusters of1.6.2 or an earlierversion.

User used to log in to a node inthe MRS cluster. This user is anOS user generated on all nodes.

User forrunning MRScluster jobs

yarn_user Randomly generatedby the system

Internal user used to run theMRS cluster jobs. This user isgenerated on Core nodes.

Internal System UsersNOTE

Do not delete the following internal system users. Otherwise, the cluster or services may not workproperly.

Type Default User InitialPassword

Description

Kerberosadministrator

kadmin/admin Admin@123 Account that is used to add,delete, modify, and query userson Kerberos.

OMS Kerberosadministrator

kadmin/admin Admin@123 Account that is used to add,delete, modify, and query userson OMS Kerberos.

LDAPadministrator

cn=root,dc=hadoop,dc=com

LdapChangeMe@123

Account that is used to add,delete, modify, and query theuser information on LDAP.

OMS LDAPadministrator


LdapChangeMe@123

Account that is used to add,delete, modify, and query theuser information on OMSLDAP.


Issue 01 (2018-09-06) 402


Description

LDAP user cn=pg_search_dn,ou=Users,dc=hadoop,dc=com

pg_search_dn@123

User that is used to queryinformation about users and usergroups on LDAP.

OMS LDAPuser

cn=pg_search_dn,ou=Users,dc=hadoop,dc=com

pg_search_dn@123

User that is used to queryinformation about users and usergroups on OMS LDAP.

LDAPadministratoraccount

cn=krbkdc,ou=Users,dc=hadoop,dc=com

LdapChangeMe@123

Account that is used to queryKerberos componentauthentication accountinformation.

cn=krbadmin,ou=Users,dc=hadoop,dc=com

LdapChangeMe@123

Account that is used to add,delete, or query Kerberoscomponent authenticationaccount information.

Componentrunning user

oms/manager Randomlygenerated bythe system

The user that is used forcommunication between Masternodes and Core nodes.

check_ker_M Randomlygenerated bythe system

Kerberos internal functionaluser. This user cannot bedeleted, and its password cannotbe changed. This internalaccount cannot be used on thenodes where Kerberos service isnot installed.

K/M Randomlygenerated bythe system

kadmin/changepw Randomlygenerated bythe system

kadmin/history Randomlygenerated bythe system

krbtgt/HADOOP.COM

Randomlygenerated bythe system

User Group InformationDefault User Group Description

supergroup Primary group of user admin. The primary group does nothave additional permissions in the cluster where Kerberosauthentication is disabled.


Issue 01 (2018-09-06) 403

Default User Group Description

check_sec_ldap Used to test whether the active LDAP works properly.This user group is generated randomly in a test andautomatically deleted after the test is complete. This is aninternal system user group used only betweencomponents.

Manager_tenant_187 Tenant system user group. This is an internal system usergroup used only between components.

System_administrator_186 MRS cluster system administrator group. This is aninternal system user group used only betweencomponents.

Manager_viewer_183 MRS Manager system viewer group. This is an internalsystem user group used only between components.

Manager_operator_182 MRS Manager system operator group. This is an internalsystem user group used only between components.

Manager_auditor_181 MRS Manager system auditor group. This is an internalsystem user group used only between components.

Manager_administrator_180 MRS Manager system administrator group. This is aninternal system user group used only betweencomponents.

compcommon MRS cluster internal group for accessing public clusterresources. All system users and system running users areadded to this user group by default.

default_1000 This group is created for tenants. This is an internalsystem user group used only between components.

kafka Kafka common user group. A user added to this usergroup can access a topic only when a user in thekafkaadmin group grants the read and write permission ofthe topic to the user.

kafkasuperuser Users added to this user group have the read and writepermission of all topics.

kafkaadmin Kafka administrator group. Users added to this user grouphave the rights to create, delete, authorize, read, and writeall topics.

storm Users added to this user group can submit topologies andmanage their own topologies.

stormadmin Users added to this user group can have the stormadministrator rights and can submit topologies andmanage all topologies.


Issue 01 (2018-09-06) 404

OS User Group Description

wheel Primary group of MRS internal running user omm.

ficommon MRS cluster common group that corresponds tocompcommon for accessing public cluster resource filesstored in the OS.

Database Users

MRS cluster system database users contain OMS database users and DBService databaseusers.

NOTE

Do not delete the following database users. Otherwise, the cluster or services may not work properly.


Description

OMS database ommdba dbChangeMe@123456

OMS database administratorwho performs maintenanceoperations, such as creating,starting, and stoppingapplications

omm ChangeMe@123456

User for accessing OMSdatabase data

DBServicedatabase

omm dbserverAdmin@123

Administrator of the GaussDBdatabase in the DBServicecomponent

hive HiveUser@ User for Hive to connect to theDBService database

hue HueUser@123 User for Hue to connect to theDBService database

sqoop SqoopUser@ User for Loader to connect tothe DBService database

6.14.2 Changing the Password for an OS User

Scenario

Periodically change the login passwords of the OS users omm and ommdba of the MRScluster node to improve the system O&M security.

The passwords of users omm and ommdba of the nodes can be different.


Issue 01 (2018-09-06) 405

Procedure

Step 1 Log in to the Master1 node, and then log in to other nodes whose OS user password needs tobe modified.

Step 2 Run the following command to switch to user root:

sudo su - root

Step 3 Run the following command to change the password for omm/ommdba:

passwd omm/ommdba

For example, if you run omm:passwd, the system displays the following information:

Changing password for user omm.New password:

Enter a new password. The policy for changing the password of an OS user varies accordingto the OS that is being used.

Retype new password:passwd: all authentication tokens updated successfully.

----End

6.14.3 Changing the Password for User admin

Scenario

Periodically change the password for user admin to improve the system O&M security.

Prerequisites

The client has been updated on the active management node.

Procedure



sudo su - omm


cd /opt/client

Step 4 Run the following command to configure environment variables:

source bigdata_env

Step 5 Run the following command to change the password for user admin. This operation takeseffect in the entire cluster.

kpasswd admin

Enter the old password and then enter a new password twice.

For the MRS 1.6.2 or later cluster, the password complexity requirements are as follows:


Issue 01 (2018-09-06) 406

l The password must contain 8 to 32 characters.l The password must contain at least three types of the following: lowercase letters,

uppercase letters, digits, spaces, and special characters which can only be ~`!?,.:;-_'(){}[]/<>@#$%^&*+|\=

l The password cannot be the same as the username or reverse username.

For the MRS 1.5.1 cluster, the password complexity requirements are as follows:l The password must contain 6 to 32 characters.l The password must contain at least two types of the following: lowercase letters,



NOTE

For MRS clusters of other versions, the password complexity requirements are as follows:l The password must contain at least eight characters.l The password must contain at least four types of the following: lowercase letters, uppercase letters,

digits, spaces, and special characters which can only be ~`!?,.:;-_'(){}[]/<>@#$%^&*+|\=l The password cannot be the same as the username or reverse username.l The password cannot be the same as the previous password.

----End

6.14.4 Changing the Password for the Kerberos Administrator

ScenarioPeriodically change the password for the Kerberos administrator kadmin of the MRS clusterto improve the system O&M security.

If the user password is changed, the OMS Kerberos administrator password is changed aswell.

PrerequisitesA client has been prepared on the Master1 node.

Procedure

Step 1 Log in to the Master1 node.


sudo su - omm

Step 3 Run the following command to go to client directory /opt/client.

cd /opt/client

Step 4 Run the following command to configure the environment variables:

source bigdata_env

Step 5 Run the following command to change the password for kadmin/admin. The passwordchange takes effect on all servers.


Issue 01 (2018-09-06) 407

kpasswd kadmin/admin

For the MRS 1.6.2 or later cluster, the password complexity requirements are as follows:l The password must contain 8 to 32 characters.l The password must contain at least three types of the following: lowercase letters,






NOTE

For MRS clusters of other versions, the password complexity requirements are as follows:

l The password must contain at least eight characters.

l The password must contain at least four types of the following: lowercase letters, uppercase letters,digits, spaces, and special characters which can only be ~`!?,.:;-_'(){}[]/<>@#$%^&*+|\=


l The password cannot be the same as the previous password.

----End

6.14.5 Changing the Password for the OMS KerberosAdministrator

Scenario

Periodically change the password for OMS Kerberos administrator kadmin of the MRScluster to improve the system O&M security.

If the user password is changed, the Kerberos administrator password is changed as well.

Prerequisites

A client has been prepared on the Master1 node.

Procedure



sudo su - omm

Step 3 Run the following command to go to the related directory:

cd ${BIGDATA_HOME}/om-0.0.1/meta-0.0.1-SNAPSHOT/kerberos/scripts



Issue 01 (2018-09-06) 408

source component_env

Step 5 Run the following command to change the password for kadmin/admin. The passwordchange takes effect on all servers.

kpasswd kadmin/admin







NOTE






----End

6.14.6 Changing the Password for the LDAP Administrator andthe LDAP User (including OMS LDAP)

ScenarioPeriodically change the password for LDAP administratorrootdn:cn=root,dc=hadoop,dc=com and LDAP userpg_search_dn:cn=pg_search_dn,ou=Users,dc=hadoop,dc=com of the MRS cluster toimprove the system O&M security.

If the LDAP administrator password is changed, the OMS LDAP administrator and userpassword is changed as well.

Impact on the SystemAll services need to be restarted for the new password to take effect. Services are unavailableduring the restart.

Procedure

Step 1 On MRS Manager, choose Service > LdapServer > More.


Issue 01 (2018-09-06) 409

Step 2 Click Change Password.

Step 3 In the Change Password dialog box, select the user that you want to change the passwordfrom User Information.

Step 4 In the Change Password dialog box, enter the old password in Old Password and the newpassword in New Password and Confirm Password.

The password complexity requirements are as follows:


uppercase letters, digits, and special characters which can only be `~!@#$%^&*()-_=+\|[{}];:'",<.>/?

l The password cannot be the same as the username or reverse username.l The password cannot be the same as the previous password.

Step 5 Select I have read the information and understand the impact, and click OK to confirmthe password change and restart the service.

----End

6.14.7 Changing the Password for a Component Running User

Scenario

Periodically change the password for each component running user of the MRS cluster toimprove the system O&M security.

If the initial password is randomly generated by the system, reset the initial password.


The initial password of a component running user is randomly generated by the system andneeds to be changed. After the password changes, the MRS cluster needs to be restarted,during which services are temporarily interrupted.

Prerequisites

A client has been prepared on the Master1 node.

Procedure



sudo su - omm

Step 3 Run the following command to go to the client directory, for example, /opt/client.

cd /opt/client


source bigdata_env


Issue 01 (2018-09-06) 410

Step 5 Run the following command to log in to the console as kadmin/admin:

kadmin -p kadmin/admin

Step 6 Run the following command to change the password of an internal system user. The passwordchange takes effect on all servers.

cpw component running user

For example: cpw oms/manager







NOTE






----End

6.14.8 Changing the Password for the OMS DatabaseAdministrator

Scenario

Periodically change the password for the OMS database administrator to ensure the systemO&M security.

Procedure


NOTE

The password of user ommdba cannot be changed on the standby management node; otherwise, thecluster cannot work properly. Change the password of user ommdba on the active management nodeonly.



Issue 01 (2018-09-06) 411

sudo su - root

su - omm

Step 3 Run the following command to go to the related directory:

cd $OMS_RUN_PATH/tools

Step 4 Run the following command to change the password for user ommdba:

mod_db_passwd ommdba

Step 5 Enter the old password of user ommdba and enter a new password twice. The passwordchange takes effect on all servers.



uppercase letters, digits, and special characters which can only be ~`!@#$%^&*()-+_=\|[{}];:",<.>/?

l The password cannot be the same as the username or reverse username.l The password cannot be the same as the last 20 historical passwords.

If the following information is displayed, the password is changed successfully.

Congratulations, update [ommdba] password successfully.

----End

6.14.9 Changing the Password for the Data Access User of theOMS Database

Scenario

Periodically change the password for the OMS data access user to ensure the system O&Msecurity.


The OMS service needs to be restarted for the new password to take effect. The service isunavailable during the restart.

Procedure


Step 2 In the Permission area, click Change OMS Database Password.

Step 3 Locate the row that contains user omm and click Change password in Operation to changethe password for the OMS database user.


l The password must contain 8 to 32 characters.


Issue 01 (2018-09-06) 412

l The password must contain at least three types of the following: lowercase letters,uppercase letters, digits, and special characters which can only be ~`!@#$%^&*()-+_=\|[{}];:",<.>/?


Step 4 Click OK. After Operation succeeded is displayed, click Finish.

Step 5 Locate the row that contains user omm and click Restart the OMS service in Operation torestart the OMS database.

NOTE

If you do not restart the OMS database after changing the password, the status of user omm changes toWaiting to restart. In this state, you cannot change its password again until the OMS database isrestarted.

Step 6 In the dialog box that is displayed, select I have read the information and understand theimpact, click OK, and then restart the OMS service.

----End

6.14.10 Changing the Password for a Component Database User

ScenarioPeriodically change the password for each component database user to improve the systemO&M security.

Impact on the SystemThe services need to be restarted for the new password to take effect. Services are unavailableduring the restart.

Procedure

Step 1 On MRS Manager, click Service and click the name of the database user service to bemodified.

Step 2 Determine the component database user whose password is to be changed.l To change the password for the DBService database user, go to Step 3.l To change the password for the Hive, Hue or Loader database user, you must stop the

service first, and go to Step 3.

Click Stop Service to stop the service.

Step 3 Choose More > Change Password.

Step 4 In the displayed window, enter the old and new passwords as prompted.


l The password for a DBService database user must contain 16 to 32 characters; thepassword for a Hive, Hue or Loader database user must contain 8 to 32 characters.

l The password must contain at least three types of the following: lowercase letters,uppercase letters, digits, and special characters which can only be ~`!@#$%^&*()-+_=\|[{}];:",<.>/?


Issue 01 (2018-09-06) 413


Step 5 Click OK. The system automatically restarts the service. After Operation succeeded isdisplayed, click Finish.

----End

6.14.11 Replacing HA Certificates

ScenarioHA certificates are used to encrypt the communication between active/standby processes andhigh availability processes to ensure security. Replace the HA certificates on active andstandby management nodes on MRS Manager to ensure product security.

The certificate file and key file can be generated by the users.

Impact on the SystemThe MRS Manager system must be restarted during the replacement and cannot be accessedor provide services.

Prerequisitesl You have obtained the root-ca.crt root file and the root-ca.pem key file of the certificate

to be replaced.l You have prepared a password, for example, Userpwd@123, for accessing the key file.

The password must meet the following complexity requirements. Otherwise, securityrisks may be incurred.– The password must contain at least eight characters.– The password must contain at least four types of the following: uppercase letters,

lowercase letters, digits, and special characters ~`!?,.:;-_'(){}[]/<>@#$%^&*+|\=.

Procedure


Step 2 Run the following commands to switch the user:

sudo su - root

su - omm

Step 3 Run the following command to generate root-ca.crt and root-ca.pem in the ${OMS_RUN_PATH}/workspace0/ha/local/cert directory:

sh ${OMS_RUN_PATH}/workspace/ha/module/hacom/script/gen-cert.sh --root-ca --country=country --state=state --city=city --company=company --organize=organize --common-name=commonname --email=Administrator email address --password=password

For example, run the following command to generate the files: sh ${OMS_RUN_PATH}/workspace/ha/module/hacom/script/gen-cert.sh --root-ca --country=CN --state=gd --city=sz --company=hw --organize=IT --common-name=HADOOP.COM [email protected] --password=Userpwd@123


Issue 01 (2018-09-06) 414

If the following information is displayed, the command is executed successfully:

Generate root-ca pair success.

Step 4 On the active management node, run the following command as user omm to copy root-ca.crt and root-ca.pem to the ${BIGDATA_HOME}/om-0.0.1/security/certHA directory:

cp -arp ${OMS_RUN_PATH}/workspace0/ha/local/cert/root-ca.* ${BIGDATA_HOME}/om-0.0.1/security/certHA

Step 5 Copy root-ca.crt and root-ca.pem generated on the active management node to ${BIGDATA_HOME}/om-0.0.1/security/certHA on the standby management node as useromm.

Step 6 Run the following command to generate an HA certificate and perform automaticreplacement:

sh ${BIGDATA_HOME}/om-0.0.1/sbin/replacehaSSLCert.sh

Enter password as prompted and press Enter.

Please input ha ssl cert password:

If the following information is displayed, the HA certificate is replaced successfully:

[INFO] Succeed to replace ha ssl cert.

Step 7 Run the following command to restart OMS.

sh ${BIGDATA_HOME}/om-0.0.1/sbin/restart-oms.sh

The following information is displayed:

start HA successfully.

Step 8 Log in to the standby management node and switch to user omm. Repeat Step 6 to Step 7.

Run the sh ${BIGDATA_HOME}/om-0.0.1/sbin/status-oms.sh command to check whetherHAAllResOK of the management node is Normal. Access the MRS Manager again. If MRSManager can be accessed, the operation is successful.

----End

6.14.12 Updating the Key of a Cluster

Scenario

When a cluster is created, the system automatically generates an encryption key to store thesecurity information in the cluster (such as all database user passwords and key file accesspasswords) in encryption mode. After a cluster is successfully installed, it is advised toregularly update the encryption key based on the following procedure.

Impact on the Systeml After a cluster key is updated, a new key is generated randomly in the cluster. This key is

used to encrypt and decrypt the newly stored data. The old key is not deleted, and it isused to decrypt the old encrypted data. After security information is modified, forexample, a database user password is changed, the new password is encrypted using thenew key.


Issue 01 (2018-09-06) 415

l When the key is updated in a cluster, the cluster must be stopped and cannot be accessed.

Prerequisites

You have stopped the upper-layer service applications that depend on the cluster.

Procedure

Step 1 On MRS Manager, choose Service > More > Stop Cluster.

Select I have read the information and understand the impact in the displayed window,and click OK. After Operation succeeded is displayed, click Finish. The cluster is stopped.



sudo su - omm

Step 4 Run the following command to disable logout upon timeout:

TMOUT=0

Step 5 Run the following command to switch the directory:

cd ${BIGDATA_HOME}/om-0.0.1/tools

Step 6 Run the following command to update the cluster key:

sh updateRootKey.sh

Enter y as prompted.

The root key update is a critical operation.Do you want to continue?(y/n):

If the following information is displayed, the key is updated successfully.

...Step 4-1: The key save path is obtained successfully....Step 4-4: The root key is sent successfully.

Step 7 On MRS Manager, choose Service > More > Start Cluster.

In the confirmation dialog box, click OK to start the cluster. After Operation succeeded isdisplayed, click Finish. The cluster is started.

----End

6.15 Patch Operation Guide

6.15.1 Patch Operation Guide for Versions Earlier than MRS 1.7.0If you obtain patch information from the following sources, upgrade the patch according toactual requirements.

l You obtain information about the patch released by MRS from a message pushed by themessage center service.


Issue 01 (2018-09-06) 416

l You obtain information about the patch by accessing the cluster and viewing patchinformation.

Preparing for Patch Installationl Follow instructions in Performing a Health Check to check cluster status. If the cluster

health status is normal, install a patch.

l The administrator has uploaded the cluster patch package to the server. For details, seeUploading the Patch Package.

l You need to confirm the target patch to be installed according to the patch information inthe patch content.

Uploading the Patch Package

Step 1 Access MRS Manager. For details, see Accessing MRS Manager.

Step 2 Choose System > Manage Patch. The Manage Patch page is displayed.

Step 3 Click Upload Patch and set the following parameters.

l Patch File Path: Folder created in the OBS bucket where the patch package is stored,for example, MRS_1.6.2/MRS_1_6_2_11.tar.gz

l Bucket: Name of the OBS bucket where the patch package is stored, for example,mrs_patch

NOTE

You can obtain the bucket name and patch file path on the Patch Information tab page. The valueof the Patch Path is in the following format: [Bucket name]/[Patch file path].

l AK: For details, see My Credential > User Guide > How Do I Manage Access Keys?.

l SK: For details, see My Credential > User Guide > How Do I Manage Access Keys?.

Step 4 Click OK to upload the patch.

----End

Installing a Patch



Step 3 In the Operation column, click Install.

Step 4 In the displayed dialog box, click OK to install the patch.

Step 5 After the patch is installed, you can view the installation status in the progress bar. If theinstallation fails, contact the administrator.

NOTE

For the isolated host nodes in the cluster, follow instructions in Restoring Patches for the IsolatedHosts to restore the patch.

----End


Issue 01 (2018-09-06) 417

Uninstalling a Patch



Step 3 In the Operation column, click Uninstall.

NOTE


----End

6.15.2 Patch Operation Guide for MRS 1.7.0 or LaterIf you obtain patch information from the following sources, upgrade the patch according toactual requirements.

l You obtain information about the patch released by MRS from a message pushed by themessage center service.

l You obtain information about the patch by accessing the cluster and viewing patchinformation.

Preparing for Patch Installationl Follow instructions in Performing a Health Check to check cluster status. If the cluster

health status is normal, install a patch.l You need to confirm the target patch to be installed according to the patch information in

the patch content.

Installing a Patch


Step 2 Choose Cluster > Active Cluster and click the name of the cluster to be queried to enter thepage displaying the cluster's basic information.

Step 3 On the Patch Information page, click Install in the Operation column to install the targetpatch.

NOTE


----End

Uninstalling a Patch


Step 2 Choose Cluster > Active Cluster and click the name of the cluster to be queried to enter thepage displaying the cluster's basic information.

Step 3 On the Patch Information page, click Uninstall in the Operation column to uninstall thetarget patch.


Issue 01 (2018-09-06) 418

NOTE


----End

6.16 Restoring Patches for the Isolated HostsIf some hosts are isolated in a cluster, perform the following operations to restore patches forthese isolated hosts after patch installation on other hosts in the cluster. After patchrestoration, versions of the isolated host nodes are consistent with those are not isolated.

Step 1 Access MRS Manager. For details, see Accessing MRS Manager or Accessing MRSManager Supporting Kerberos Authentication.


Step 3 In the Operation column, click View Details.

Step 4 On the patch details page, select host nodes whose Status is Isolated.

Step 5 Click Select and Restore to restore the isolated host nodes.

Figure 6-2 Restoring patches for the isolated hosts

----End


Issue 01 (2018-09-06) 419

7 Management of Clusters with Kerberos

Authentication Enabled

7.1 Users and Permissions of Clusters with KerberosAuthentication EnabledOverview

l MRS Cluster UsersIndicate the security accounts of MRS Manager, including usernames and passwords.These accounts are used to access resources in MRS clusters. Each MRS cluster in whichKerberos authentication is enabled can have multiple users.

l MRS Cluster RolesBefore using resources in an MRS cluster, users must obtain the access permission. Theaccess permission is defined by MRS cluster objects. A cluster role is a set of one ormore permissions. For example, the permission to access a directory in HDFS needs tobe configured in the specified directory and saved in a role.

MRS Manager provides the user permission management function for MRS clusters,facilitating permission and user management.

l Permission management: adopts the role-based access control (RBAC) mode. In thismode, permissions are granted by role, forming a permission set. After one or more rolesare allocated to a user, the user can obtain the permissions of the roles.

l User management: uses MRS Manager to uniformly manage users, adopts the Kerberosprotocol for user identity verification, and employs Lightweight Directory AccessProtocol (LDAP) to store user information.

Permission ManagementPermissions provided by MRS clusters include the O&M permissions of MRS Manager andcomponents (such as HDFS, HBase, Hive, and Yarn). In actual application, permissions mustbe assigned to each user based on service scenarios. To facilitate permission management,MRS Manager introduces the role function to allow administrators to select and assignspecified permissions. Permissions are centrally viewed and managed in permission sets,enhancing user experience.

MapReduce ServiceUser Guide

7 Management of Clusters with Kerberos AuthenticationEnabled

Issue 01 (2018-09-06) 420

A role is a logical entity that contains one or more permissions. Permissions are assigned toroles, and users can be granted the permissions by obtaining the roles.

A role can have multiple permissions, and a user can be bound to multiple roles.

l Role 1: is assigned operation permissions A and B. After role 1 is allocated to users aand b, users a and b can obtain operation permissions A and B.

l Role 2: is assigned operation permission C. After role 2 is allocated to users c and d,users c and d can obtain operation permission C.

l Role 3: is assigned operation permissions D and F. After role 3 is allocated to user a, usera can obtain operation permissions D and F.

For example, if an MRS user is bound to the administrator role, the user is an administrator ofthe MRS cluster.

Table 7-1 lists the roles that are created by default on MRS Manager.

Table 7-1 Default roles and description

Default Role Description

default Tenant role

Manager_administrator Manager administrator: This role has the permission tomanage MRS Manager.

Manager_auditor Manager auditor: This role has the permission to view andmanage auditing information.

Manager_operator Manager operator: This role has all permissions except tenant,configuration, and cluster management permissions.

Manager_viewer Manager viewer: This role has the permission to view theinformation about systems, services, hosts, alarms, andauditing logs.

System_administrator System administrator: This role has the permissions ofManager administrators and all service administrators.

Manager_tenant Manager tenant viewer: This role has the permission to viewinformation on the Tenant page on MRS Manager.

When creating a role on MRS Manager, you can perform permission management for MRSManager and components, as described in Table 7-2.

Table 7-2 Manager and component permission management

Permission Description

Manager Manager access and login permission.

HBase HBase administrator permission and permission for accessingHBase tables and column families.



Issue 01 (2018-09-06) 421

Permission Description

HDFS HDFS directory and file permission.

Hive l Hive Admin PrivilegeHive administrator permission.

l Hive Read Write PrivilegesHive data table management permission, which is theoperation permission to set and manage the data of createdtables.

Hue Storage policy administrator rights.

Yarn l Cluster Admin OperationsYarn administrator permission.

l Scheduler QueueQueue resource management permission.

User Management

MRS clusters that support Kerberos authentication use the Kerberos protocol and LDAP foruser management.

l Kerberos verifies the identity of a user when the user logs in to MRS Manager or uses acomponent client. Identity verification is not required for clusters with Kerberosauthentication disabled.

l LDAP is used to store user information, including user records, user group information,and permission information.

MRS clusters can automatically update Kerberos and LDAP user data when users are createdor modified on MRS Manager. They can also automatically perform user identity verificationand authentication and obtain user information when a user logs in to MRS Manager or uses acomponent client. This ensures the security of user management and simplifies the usermanagement tasks. MRS Manager also provides the user group function for managing one ormore users by type:

l A user group is a set of users. Users in the system can exist independently or in a usergroup.

l After a user is added to a user group to which roles are allocated, the role permission ofthe user group is assigned to the user.

The following table lists the user groups that are created by default on MRS Manager.

Table 7-3 Default user groups and description

User Group Description

hadoop Users added to this user group have the permission to submittasks to all Yarn queues.

hbase Common user group. Users added to this user group will nothave any additional permission.



Issue 01 (2018-09-06) 422


hive Users added to this user group can use Hive.

supergroup Users added to this user group can have the administratorrights of HBase, HDFS, and Yarn and can use Hive.

flume Common user group. Users added to this user group will nothave any additional permission.

kafka Kafka common user group. A user added to this user groupcan access a topic only when a user in the kafkaadmin groupgrants the read and write permission of the topic to the user.


kafkaadmin Kafka administrator group. Users added to this user grouphave the rights to create, delete, authorize, read, and write alltopics.


stormadmin Users added to this user group can have the stormadministrator rights and can submit topologies and manage alltopologies.

User admin is created by default for MRS clusters with Kerberos authentication enabled andis used by administrators to maintain the clusters.

Process OverviewIn practice, administrators must understand the service scenarios of MRS clusters and planuser permissions. Then, create roles and assign permissions to the roles on MRS Manager tomeet service requirements. Administrators can create user groups on MRS Manager tomanage users in one or more service scenarios of the same type.

NOTE

If a role has the permission of HDFS, HBase, Hive, or Yarn, the role can use the corresponding functionsof the component. To use MRS Manager, the corresponding Manager permission must be added to therole.



Issue 01 (2018-09-06) 423

Figure 7-1 Process of creating a user

7.2 Default Users of Clusters with KerberosAuthentication Enabled

User ClassificationThe MRS cluster provides the following three types of users. Users are required toperiodically change the passwords. It is not recommended to use the default passwords.



Issue 01 (2018-09-06) 424

User Type Description

System user l A user created on MRS Manager for MRS cluster O&M andservice scenarios. There are two types of users:– Human-machine user: used for MRS Manager O&M scenarios

and component client operation scenarios.– Machine-machine user: used for MRS cluster application

development scenarios.l A user used to run OMS system processes.

Internal systemuser

An internal user provided by the MRS cluster and used to implementcommunication between processes, save user group information, andassociate user rights.

Database user l A user used to manage the OMS database and access data.l A user used to run the database of service components (Hive, Hue,

Loader and DBService).

System UsersNOTE

l User ldap of the OS is required in the MRS cluster. Do not delete the account. Otherwise, the clustermay not work properly. Password management policies are maintained by the users.

l Reset the password when you change the passwords of user ommdba and user omm for the firsttime. Change the passwords regularly after you have retrieved them.

Type Username InitialPassword

Description

MRS clustersystem user

admin Specified by theuser when thecluster is created

Administrator of MRSManager.This user also has the followingrights:l Common HDFS and

ZooKeeper user rights.l Rights to submit and query

MapReduce and Yarn tasks,to manage Yarn queues, andto access Yarn WebUI.

l Rights to submit, query,activate, deactivate,reassign, delete topologies,and operate all topologies ofthe Storm service.

l Rights to create, delete,authenticate, reassign,consume, write, and querytopics of the Kafka service.



Issue 01 (2018-09-06) 425

Type Username InitialPassword

Description


ommdba Randomlygenerated by thesystem

User who creates the MRScluster system database. Thisuser is an OS user generated onthe management nodes anddoes not require a unifiedpassword.

omm Randomlygenerated by thesystem

Internal running user of theMRS cluster system. This useris an OS user generated on allnodes and does not require aunified password.

User for runningMRS cluster jobs

yarn_user Randomlygenerated by thesystem

Internal user used to run theMRS cluster jobs. This user isgenerated on Core nodes.

Internal System UsersNOTE

Do not delete the following internal system users. Otherwise, the cluster or services may not workproperly.


Description

Kerberosadministrator

kadmin/admin KAdmin@123

Account that is used to add,delete, modify, and query users onKerberos.

OMS Kerberosadministrator

kadmin/admin KAdmin@123

Account that is used to add,delete, modify, and query users onOMS Kerberos.

LDAPadministrator


LdapChangeMe@123

Account that is used to add,delete, modify, and query the userinformation on LDAP.

OMS LDAPadministrator


LdapChangeMe@123

Account that is used to add,delete, modify, and query the userinformation on OMS LDAP.

LDAP user cn=pg_search_dn,ou=Users,dc=hadoop,dc=com

pg_search_dn@123

User that is used to queryinformation about users and usergroups on LDAP.

OMS LDAPuser

cn=pg_search_dn,ou=Users,dc=hadoop,dc=com

pg_search_dn@123

User that is used to queryinformation about users and usergroups on OMS LDAP.



Issue 01 (2018-09-06) 426


Description

LDAPadministratoraccount

cn=krbkdc,ou=Users,dc=hadoop,dc=com

LdapChangeMe@123

Account that is used to queryKerberos componentauthentication accountinformation.

cn=krbadmin,ou=Users,dc=hadoop,dc=com

LdapChangeMe@123

Account that is used to add,delete, or query Kerberoscomponent authentication accountinformation.

User forquerying theMRS cluster

executor Randomlygeneratedby thesystem

User that is used to query clusterswith Kerberos authenticationenabled on the MRS managementconsole.

Componentrunning user

hdfs Hdfs@123 HDFS system administrator whohas the following permission:1. File system operation

permission:l Views, modifies, and

creates files.l Views and creates

directories.l Views and modifies the

groups where files belong.l Views and sets disk quotas

of users1. HDFS management operation

permission:l Views the WebUI status.l Views and sets the active

and standby HDFS status.l Enters and exits HDFS in

security mode.l Checks HDFS.



Issue 01 (2018-09-06) 427


Description

hbase Hbase@123

HBase system administrator whohas the following permission:l Cluster management

permission: Enables anddisables tables, and triggersMajorCompact and AccessControl List (ACL).

l Grants and reclaimspermission, and shuts downthe cluster.

l Table managementpermission: Creates, modifies,and deletes tables.

l Data management permission:Reads and writes table-,column family-, and column-level data.

l Accesses the HBase WebUI.

mapred Mapred@123

MapReduce system administratorwho has the followingpermission:l Submits, stops, and views

MapReduce tasks.l Modifies the Yarn

configuration parameters.l Accesses the Yarn and

MapReduce WebUI.

spark Spark@123

Spark system administrator whohas the following permission:l Accesses the Spark WebUI.l Submits Spark tasks.

oms/manager Randomlygeneratedby thesystem

Controller and NodeAgentauthentication user who has thepermission of supergroup

backup/manager Randomlygeneratedby thesystem

User who runs backup andrecovery tasks and has thepermission of supergroup



Issue 01 (2018-09-06) 428


Description

hdfs/hadoop.hadoop.com

Randomlygeneratedby thesystem

HDFS system startup user whohas the following permission:1. File system operation




groups where files belong.l Views and sets disk quotas

of users.1. HDFS management operation

permission:l Views the WebUI status.l Views and sets the active

and standby HDFS status.l Enters and exits HDFS in

security mode.l Checks HDFS.

mapred/hadoop.hadoop.com


MapReduce system startup userwho has the followingpermission:l Submits, stops, and views

MapReduce tasks.l Modifies the Yarn

configuration parameters.

mr_zk/hadoop.hadoop.com


User used for MapReduce toaccess ZooKeeper

hbase/hadoop.hadoop.com


User used for the authenticationbetween internal componentsduring the HBase system startup

hbase/zkclient.hadoop.com


User used for HBase to performZooKeeper authentication in acluster in security mode



Issue 01 (2018-09-06) 429


Description

thrift/hadoop.hadoop.com


ThriftServer system startup user

thrift/<hostname> Randomlygeneratedby thesystem

User used for the ThriftServersystem to access HBase. This userhas the permission to read, write,execute, create, and manage allHBase NameSpaces and tables.<hostname> specifies the hostname of the node whereThriftServer is installed.

hive/hadoop.hadoop.com


User used for the authenticationbetween internal componentsduring the Hive system startup.The user has the followingpermission:1. Hive administrator permission:l Creates, deletes, and

modifies databases.l Creates, queries, modifies,

and deletes tables.l Queries, inserts, and loads

data.2. HDFS file operation




groups where files belong.3. Submits and stops MapReduce

jobs.

spark/hadoop.hadoop.com


Spark system startup user

spark_zk/hadoop.hadoop.com


User used for Spark to accessZooKeeper



Issue 01 (2018-09-06) 430


Description

zookeeper/hadoop.hadoop.com


ZooKeeper system startup user

zkcli/hadoop.hadoop.com


ZooKeeper server login user

kafka/hadoop.hadoop.com


User used for securityauthentication for Kafka.

storm/hadoop.hadoop.com


Storm system startup user.

storm_zk/hadoop.hadoop.com


User for the Worker process toaccess ZooKeeper.

loader/hadoop.hadoop.com


User for Loader system startupand Kerberos authentication.

HTTP/<hostname> Randomlygeneratedby thesystem

Used to connect to the HTTPinterface of each component.<hostname> indicates the name ofthe node in the cluster.

flume Randomlygeneratedby thesystem

User for Flume system startupand HDFS and Hive access. Theuser has read and writepermission of the HDFSdirectory /flume.

check_ker_M Randomlygeneratedby thesystem

Kerberos internal functional user.This user cannot be deleted, andits password cannot be changed.This internal account cannot beused on the nodes where Kerberosservice is not installed.K/M Randomly

generatedby thesystem



Issue 01 (2018-09-06) 431


Description

kadmin/changepw Randomlygeneratedby thesystem

kadmin/history Randomlygeneratedby thesystem

krbtgt/HADOOP.COM


User Group InformationDefault User Group Description

hadoop Users added to this user group have the permission tosubmit tasks to all Yarn queues.

hbase Common user group. Users added to this user group willnot have any additional rights.

hive Users added to this user group can use Hive.

spark Common user group. Users added to this user group willnot have any additional rights.

supergroup Users added to this user group can have the administratorrights of HBase, HDFS, and Yarn and can use Hive.

check_sec_ldap Used to test whether the active LDAP works properly.This user group is generated randomly in a test andautomatically deleted after the test is complete. This is aninternal system user group used only betweencomponents.

Manager_tenant_187 Tenant system user group. This is an internal system usergroup used only between components.

System_administrator_186 MRS cluster system administrator group. This is aninternal system user group used only betweencomponents.

Manager_viewer_183 MRS Manager system viewer group. This is an internalsystem user group used only between components.

Manager_operator_182 MRS Manager system operator group. This is an internalsystem user group used only between components.



Issue 01 (2018-09-06) 432

Default User Group Description

Manager_auditor_181 MRS Manager system auditor group. This is an internalsystem user group used only between components.

Manager_administrator_180 MRS Manager system administrator group. This is aninternal system user group used only betweencomponents.

compcommon MRS cluster internal group for accessing public clusterresources. All system users and system running users areadded to this user group by default.

default_1000 This group is created for tenants. Internal system usergroup, which is used only between components.

kafka Kafka common user group. A user added to this usergroup can access a topic only when a user in thekafkaadmin group grants the read and write permission ofthe topic to the user.


kafkaadmin Kafka administrator group. Users added to this user grouphave the rights to create, delete, authorize, read, and writeall topics.


stormadmin Users added to this user group can have the stormadministrator rights and can submit topologies andmanage all topologies.

OS User Group Description

wheel Primary group of MRS internal running user omm

ficommon MRS cluster common group that corresponds tocompcommon for accessing public cluster resource filesstored in the OS

Database UsersMRS cluster system database users contain OMS database users and DBService databaseusers.

NOTE

Do not delete the following database users. Otherwise, the cluster or services may not work properly.



Issue 01 (2018-09-06) 433


Description

OMS database ommdba dbChangeMe@123456

OMS database administratorwho performs maintenanceoperations, such as creating,starting, and stoppingapplications

omm ChangeMe@123456

User used for accessing OMSdatabase data

DBServicedatabase

omm dbserverAdmin@123

Administrator of the GaussDBdatabase in the DBServicecomponent

hive HiveUser@ User used for Hive to connectto the DBService database

hue HueUser@123 User used for Hue to connect tothe DBService database

sqoop SqoopUser@ User for Loader to connect tothe DBService database

7.3 Creating a Role

Scenario

This section describes how to create a role on MRS Manager and authorize and manageManager and components.

Up to 1000 roles can be created on MRS Manager.

Prerequisites

You have learned service requirements.

Procedure

Step 1 On MRS Manager, choose System > Manage Role.

Step 2 Click Create Role and fill in Role Name and Description.

Role Name is mandatory and contains 3 to 30 digits, letters, and underscores (_). Descriptionis optional.

Step 3 In Permission, set role permission.

1. Click Service Name and select a name in View Name.

2. Select one or more permissions.



Issue 01 (2018-09-06) 434

NOTE

l The Permission parameter is optional.

l If you select View Name to set component permissions, you can enter a resource name in the

Search box in the upper-right corner and click . The search result is displayed.

l The search scope covers only directories with current permissions. You cannot search subdirectories.Search by keywords supports fuzzy match and is case-insensitive. Results of the next page can besearched.

Table 7-4 Manager permission description

Resource SupportingPermission Management

Permission Setting

Alarm Authorizes the Manager alarm function. You can selectView to view alarms and Management to manage alarms.

Audit Authorizes the Manager audit log function. You can selectView to view audit logs and Management to manageaudit logs.

Dashboard Authorizes the Manager overview function. You can selectView to view the cluster overview.

Hosts Authorizes the node management function. You can selectView to view node information and Management tomanage nodes.

Services Authorizes the service management function. You canselect View to view service information and Managementto manage services.

System_cluster_management

Authorizes the MRS cluster management function. Youcan select Management to use the MRS patchmanagement function.

System_configuration Authorizes the MRS cluster configuration function. Youcan select Management to configure MRS clusters onManager.

System_task Authorizes the MRS cluster task function. You can selectManagement to manage periodic tasks of MRS clusterson Manager.

Tenant Authorizes the Manager multi-tenant managementfunction. You can select Management to view theManager tenant management page.

Table 7-5 HBase permission description


Permission Setting

SUPER_USER_GROUP Grants you HBase administrator rights.



Issue 01 (2018-09-06) 435


Permission Setting

Global HBase resource type, indicating the whole HBase.

Namespace HBase resource type, indicating namespace, which is usedto store HBase tables. It has the following permissions:l Admin: permission to manage the namespacel Create: permission to create HBase tables in the

namespacel Read: permission to access the namespacel Write: permission to write data to the namespacel Execute: permission to execute the coprocessor

(Endpoint)

Table HBase resource type, indicating a data table, which is usedto store data. It has the following permissions:l Admin: permission to manage a data tablel Create: permission to create column families and

columns in a data tablel Read: permission to read a data tablel Write: permission to write data to a data tablel Execute: permission to execute the coprocessor

(Endpoint)

ColumnFamily HBase resource type, indicating a column family, which isused to store data. It has the following permissions:l Create: permission to create columns in a column

familyl Read: permission to read a column familyl Write: permission to write data to a column family

Qualifier HBase resource type, indicating a column, which is usedto store data. It has the following permissions:l Read: permission to read a columnl Write: permission to write data to a column

Permissions of an HBase resource type of each level are shared by resource types of sub-levels by default. However, the Recursive option is not selected by default. For example, ifRead and Write permissions are added to the default namespace, they are automaticallyadded to the tables, column families, and columns in the namespace. If you manually set achild resource after setting the parent resource, the permission of the child resource is theunion of the permissions of the parent resource and the current child resource.



Issue 01 (2018-09-06) 436

Table 7-6 HDFS permission description


Permission Setting

Folder HDFS resource type, indicating an HDFS directory, whichis used to store files or subdirectories. It has the followingpermissions:l Read: permission to access the HDFS directoryl Write: permission to write data to the HDFS directoryl Execute: permission to perform an operation. It must

be selected when you add access or write permission.

Files HDFS resource type, indicating a file in HDFS. It has thefollowing permissions:l Read: permission to access the filel Write: permission to write data to the filel Execute: permission to perform an operation. It must

be selected when you add access or write permission.

Permissions of an HDFS directory of each level are not shared by resource types of sub-levelsby default. For example, if Read and Execute permissions are added to the tmp directory,you must select Recursive at the same time to add permissions to subdirectories.

Table 7-7 Hive permission description


Permission Setting

Hive Admin Privilege Grants you Hive administrator rights.

Database Hive resource type, indicating a Hive database, which isused to store Hive tables. It has the following permissions:l Select: permission to query the Hive databasel Delete: permission to perform the deletion operation in

the Hive databasel Insert: permission to perform the insertion operation

in the Hive databasel Create: permission to perform the creation operation

in the Hive database



Issue 01 (2018-09-06) 437


Permission Setting

Table Hive resource type, indicating a Hive table, which is usedto store data. It has the following permissions:l Select: permission to query the Hive tablel Delete: permission to perform the deletion operation in

the Hive tablel Update: grants users the Update permission of the

Hive tablel Insert: permission to perform the insertion operation

in the Hive tablel Grant of Select: permission to grant the Select

permission to other users using Hive statementsl Grant of Delete: permission to grant the Delete

permission to other users using Hive statementsl Grant of Update: permission to grant the Update

permission to other users using Hive statementsl Grant of Insert: permission to grant the Insert

permission to other users using Hive statements

Permissions of a Hive resource type of each level are shared by resource types of sub-levelsby default. However, the Recursive option is not selected by default. For example, if Selectand Insert permissions are added to the default database, they are automatically added to thetables and columns in the database. If you manually set a child resource after setting theparent resource, the permission of the child resource is the union of the permissions of theparent resource and the current child resource.

Table 7-8 Yarn permission description


Permission Setting

Cluster Admin Operations Grants you Yarn administrator rights.

root Root queue of Yarn. It has the following permissions:l Submit: permission to submit jobs in the queuel Admin: permission to manage permissions of the

current queue

Parent Queue Yarn resource type, indicating a parent queue containingsub-queues. A root queue is a type of a parent queue. Ithas the following permissions:l Submit: permission to submit jobs in the queuel Admin: permission to manage permissions of the

current queue



Issue 01 (2018-09-06) 438


Permission Setting

Leaf Queue Yarn resource type, indicating a leaf queue. It has thefollowing permissions:l Submit: permission to submit jobs in the queuel Admin: permission to manage permissions of the

current queue

Permissions of a Yarn resource type of each level are shared by resource types of sub-levelsby default. However, the Recursive option is not selected by default. For example, if theSubmit permission is added to the root queue, it is automatically added to the sub-queue.Permissions inherited by sub-queues will not be displayed as selected in the Permission table.If you manually set a child resource after setting the parent resource, the permission of thechild resource is the union of the permissions of the parent resource and the current childresource.

Table 7-9 Hue permission description


Permission Setting

Storage Policy Admin Grants you storage policy administrator rights.

Step 4 Click OK. Return to Manage Role.

----End

Related Tasks

Modifying a role


Step 2 In the Permission area, click Manage Role.

Step 3 In the row of the role to be modified, click Modify to modify role information.

NOTE

If you change permissions assigned to the role, it takes 3 minutes to make new configurations takeeffect.

Step 4 Click OK. The modification is complete.

----End

Deleting a role


Step 2 In the Permission area, click Manage Role.



Issue 01 (2018-09-06) 439

Step 3 In the row of the role to be deleted, click Delete.

Step 4 Click OK. The role is deleted.

----End

7.4 Creating a User Group

ScenarioThis section describes how to create user groups and specify their operation permissions onMRS Manager. Management of single or multiple users can be unified in the user groups.After being added to a user group, users can obtain operation permissions owned by the usergroup.

Up to 100 user groups can be created on MRS Manager.

PrerequisitesAdministrators have learned service requirements and created roles required by servicescenarios.

Procedure


Step 2 In the Permission area, click Manage User Group.

Step 3 Above the user group list, click Create User Group.

Step 4 Set Group Name and Description.

Group Name is mandatory and contains 3 to 20 digits, letters, and underscores (_).Description is optional.

Step 5 In Role, click Select and Add Role to select and add specified roles.

If you do not add the roles, the user group you are creating now does not have the permissionto use MRS clusters.

Step 6 Click OK. The user group is created.

----End

Related TasksModifying a user group



Step 3 In the row of a user group to be modified, click Modify.

NOTE

If you change role permissions assigned to the user group, it takes 3 minutes to make new configurationstake effect.



Issue 01 (2018-09-06) 440


----End

Deleting a user group



Step 3 In the row of the user group to be deleted, click Delete.

Step 4 Click OK. The user group is deleted.

----End

7.5 Creating a User

ScenarioThis section describes how to create users on MRS Manager based on site requirements andspecify their operation permissions to meet service requirements.

Up to 1000 users can be created on MRS Manager.

PrerequisitesAdministrators have learned service requirements and created roles and role groups requiredby service scenarios.

Procedure


Step 2 In the Permission area, click Manage User.

Step 3 Above the user list, click Create User.

Step 4 Configure parameters as prompted and enter a username in Username.

NOTE

l If a username exists, you cannot create another username that only differs from the existingusername in case. For example, if User1 has been created, you cannot create user1.

l When you use the user you created, enter the correct username, which is case-sensitive.l Username is mandatory and contains 3 to 20 digits, letters, and underscores (_).l root, omm, and ommdba are reserved system users. Select another username.

Step 5 Set User Type to either Human-machine or Machine-machine.l Human-machine users: used for O&M on MRS Manager and operations on component

clients. If you select this user type, you need to enter a password and confirm thepassword in Password and Confirm Password accordingly.

l Machine-machine users: used for MRS application development. If you select this usertype, you do not need to enter a password, because the password is randomly generated.

Step 6 In User Group, click Select and Join User Group to select user groups and add users tothem.



Issue 01 (2018-09-06) 441

NOTE

l If roles have been added to user groups, the users can be granted with permissions of the roles.

l If you want to grant new users with Hive permissions, add the users to the Hive group.

l If you want to manage tenant resources, assign the Manager_tenant role and the role correspondingto the tenant to the user group.

Step 7 In Primary Group, select a group as the primary group for users to create directories andfiles. The drop-down list contains all groups selected in User Group.

Step 8 In Assign Rights by Role, click Select and Add Role to add roles for users based on onsiteservice requirements.

NOTE

l When you create a user, if permissions of a user group that is granted to the user cannot meet servicerequirements, you can assign other created roles to the user. It takes 3 minutes to make rolepermissions granted to the new user take effect.

l Adding a role when you create a user can specify the user rights.

l A new user can access WebUIs of HDFS, HBase, Yarn, Spark, and Hue even when roles are notassigned to the user.

Step 9 In Description, provide description based on onsite service requirements.

Description is optional.

Step 10 Click OK. The user is created.

If a new user is used in the MRS cluster for the first time, for example, used for logging in toMRS Manager or using the cluster client, the password must be changed. For details, seesection "Changing the Password of an Operation User".

----End

7.6 Modifying User Information

Scenario

This section describes how to modify user information on MRS Manager, includinginformation about the user group, primary group, role, and description.

Procedure



Step 3 In the row of a user to be modified, click Modify.

NOTE

If you change user groups for or assign role permissions to the user, it takes at most 3 minutes to makenew configurations take effect.


----End



Issue 01 (2018-09-06) 442

7.7 Locking a User

Scenario

This section describes how to lock users in MRS clusters. A locked user cannot log in to MRSManager or perform security authentication in the cluster.

A locked user can be unlocked by an administrator manually or until the lock durationexpires. You can lock a user by using either of the following methods:

l Automatic lock: Set Number of Password Retries in Configure Password Policy. Ifuser login attempts exceed the parameter value, the user is automatically locked. Fordetails, see Modifying a Password Policy.

l Manual lock: The administrator manually locks a user.

The following describes how to manually lock a user. Machine-machine users cannot belocked.

Procedure



Step 3 In the row of a user you want to lock, click Lock User.

Step 4 In the window that is displayed, click OK to lock the user.

----End

7.8 Unlocking a User

Scenario

If a user's login attempts exceed the value of Number of Password Retries and the user islocked, the administrator can unlock the user on MRS Manager.

Procedure



Step 3 In the row of a user you want to unlock, choose Unlock User.

Step 4 In the window that is displayed, click OK to unlock the user.

----End



Issue 01 (2018-09-06) 443

7.9 Deleting a User

Scenario

If an MRS cluster user is not required, the administrator can delete the user on MRS Manager.

Procedure



Step 3 In the row of the user to be deleted, choose More > Delete

Step 4 Click OK.

----End

7.10 Changing the Password of an Operation User

Scenario

Passwords of Human-machine system users must be regularly changed to ensure MRScluster security. This section describes how to change your passwords on MRS Manager.


If you have downloaded a user authentication file, download it again and obtain the keytabfile after changing the password of the MRS cluster user.

Prerequisitesl You have obtained the current password policies from the administrator.l You have obtained the URL to access MRS Manager from the administrator.

Procedure

Step 1 On MRS Manager, move the mouse cursor to in the upper-right corner.

On the menu that is displayed, select Change Password.

Step 2 Fill in Old Password, New Password, and Confirm Password. Click OK.






Issue 01 (2018-09-06) 444




NOTE






----End

7.11 Initializing the Password of a System User

Scenario

This section describes how to initialize a password on MRS Manager if a user forgets thepassword or the password of a public account needs to be changed regularly. After passwordinitialization, the user must change the password upon the first login.


If you have downloaded a user authentication file, download it again and obtain the keytabfile after initializing the password of the MRS cluster user.

Initializing the Password of a Human-machine User



Step 3 In the row that contains the user whose password is to be initialized, click More > Initializepassword and change the password as prompted.

In the window that is displayed, enter the password of the current administrator account andclick OK. Then in Initialize Password, click OK.




For the MRS 1.5.1 cluster, the password complexity requirements are as follows:l The password must contain 6 to 32 characters.



Issue 01 (2018-09-06) 445

l The password must contain at least two types of the following: lowercase letters,uppercase letters, digits, spaces, and special characters which can only be ~`!?,.:;-_'(){}[]/<>@#$%^&*+|\=


NOTE






----End

Initializing the Password of a Machine-machine User

Step 1 Prepare a client based on service conditions and log in to the node with the client installed.


sudo su - omm

Step 3 Run the following command to switch to the client directory, for example, /opt/client:

cd /opt/client


source bigdata_env

Step 5 Run the following command to log in to the console as user kadmin/admin:

kadmin -p kadmin/admin

Step 6 Run the following command to reset the password of a component running user: Thisoperation takes effect for all servers.

cpw Component running user name

For example, cpw oms/manager.






Issue 01 (2018-09-06) 446

NOTE






----End

7.12 Downloading a User Authentication File

ScenarioWhen a user develops big data applications and runs them in an MRS cluster that supportsKerberos authentication, the user needs to prepare a Machine-machine user authenticationfile for accessing the MRS cluster. The keytab file in the authentication file can be used foruser authentication.

This section describes how to download a Machine-machine user authentication file andexport the keytab file on MRS Manager.

NOTE

Before you download a Human-machine user authentication file, change the password for the user onMRS Manager to make the initial password set by the administrator invalid. Otherwise, the exportedkeytab file cannot be used. For details, see Changing the Password of an Operation User.

Procedure



Step 3 In the row of a user for whom you want to export the keytab file, choose More > Downloadauthentication credential to download the authentication file. After the file is automaticallygenerated, save it to a specified path and keep it secure.

Step 4 Open the authentication file with a decompression program.l user.keytab indicates a user keytab file used for user authentication.l krb5.conf indicates the configuration file of the authentication server. The application

connects to the authentication server according to the configuration file informationwhen authenticating users.

----End



Issue 01 (2018-09-06) 447

7.13 Modifying a Password Policy

ScenarioThis section describes how to set password and user login security rules as well as user lockrules. Password policies set on MRS Manager take effect for Human-machine users only,because the passwords of Machine-machine users are randomly generated.

NOTICEModify password policies based on service security requirements, because they involve usermanagement security. Otherwise, security risks may be caused.

Procedure


Step 2 Click Configure Password Policy.

Step 3 Modify password policies as prompted. For parameter details, see Table 7-10.

Table 7-10 Password policy parameter description


Minimum Password Length Indicates the minimum number of characters apassword contains. The value ranges from 6 to32. The default value is 6.

Number of Character Types Indicates the minimum number of charactertypes a password contains. The character typesare uppercase letters, lowercase letters, digits,spaces, and special characters (~`!?,.:;-_'(){}[]/<>@#$%^&*+|\=). The value can be 4 or 5. Thedefault value is 2, which means that a passwordmust contain at least two types of the followingcharacters: uppercase letters, lowercase letters,digits, special characters, and spaces.

Password Validity Period (days) Indicates the validity period (days) of apassword. The value ranges from 0 to 90. 0means that the password is permanently valid.The default value is 90.



Issue 01 (2018-09-06) 448


Password Expiration NotificationDays

It is used to notify password expiration inadvance. After the value is set, if the differencebetween the cluster time and the passwordexpiration time is smaller than this value, theuser receives password expiration notifications.When logging in to MRS Manager, the user willbe notified that the password is about to expireand a message is displayed asking the user tochange the password. The value ranges from 0 toX (X must be set to the half of the passwordvalidity period and rounded down). Value 0indicates that no notification is sent. The defaultvalue is 5.

Interval of Resetting AuthenticationFailure Count (min)

Indicates the interval (minutes) of retainingincorrect password attempts. The value rangesfrom 0 to 1440. 0 indicates that incorrectpassword attempts are permanently retained and1440 indicates that incorrect password attemptsare retained for one day. The default value is 5.

Number of Password Retries Indicates the number of consecutive wrongpasswords allowed before the system locks theuser. The value ranges from 3 to 30. The defaultvalue is 5.

Account Lock Duration (min) Indicates the time period during which a user islocked when the user lockout conditions are met.The value ranges from 5 to 120. The defaultvalue is 5.

----End

7.14 Configuring Cross-Cluster Mutual TrustRelationships

ScenarioIf two clusters, both with Kerberos authentication enabled, need to access the resources ofeach other, the administrator must configure the mutual trust relationships between theclusters.

If no trust relationship is configured, resources of a cluster are available only for users in thecluster. MRS automatically assigns a unique domain name for each cluster to define thescope of resources for users.



Issue 01 (2018-09-06) 449

Impact on the Systeml After cross-cluster mutual trust is configured, resources of a cluster become available for

users in the other cluster. User permission in the clusters must be regularly checkedbased on service and security requirements.

l After cross-cluster mutual trust is configured, the KrbServer needs to be restarted and thecluster becomes unavailable during the restart.

l After cross-cluster mutual trust is configured, internal users krbtgt/Local cluster domainname@External cluster domain name and krbtgt/External cluster domain name@Localcluster domain name are added to the two clusters. The internal users cannot be deleted.The default password of the users is Admin@123.

Prerequisitesl Kerberos authentication is enabled for both clusters. For example, two analysis clusters

with Kerberos authentication enabled are created.

l Both clusters are in the same VPC and subnet.

Procedure

Step 1 On the MRS management console, query all security groups of the two clusters.

l If the security groups of the two clusters are same, go to Step 3.

l If the security groups of the two clusters are different, go to Step 2.

Step 2 On the VPC management console, add rules for each security group.

Set Protocol to ANY, Transfer Direction to Inbound, and Source to Security Group. Thesource is the security group of the peer cluster.

Step 3 Log in to MRS Manager of the two clusters separately. Click Service and check whether theHealth Status of all components is Good.


l If no, contact technical support personnel for troubleshooting.

Step 4 Query configuration information.

1. On MRS Manager of the two clusters, choose Service > KrbServer > Instance. QueryOM IP Address of the two KerberosServer hosts.

2. Click Service Configuration. Set Type to All. Choose KerberosServer > Port in thenavigation tree on the left. Query the value of kdc_ports. The default value is 21732.

3. Click Realm and query the value of default_realm.

Step 5 On MRS Manager of either cluster, modify the peer_realms parameter.

Table 7-11 Parameter description


realm_name default_realm of the peer cluster obtained in step Step 4



Issue 01 (2018-09-06) 450


ip_port KDC address of the peer cluster. Format: IP address of aKerberosServer node in the peer cluster:kdc_portThe addresses of the two KerberosServer nodes are separated by acomma. For example, if the IP addresses of the KerberosServer nodesare 10.0.0.1 and 10.0.0.2 respectively, the value of this parameter is10.0.0.1:21732,10.0.0.2:21732.

NOTE

l To deploy trust relationships with multiple clusters, click to add items and specify relevant

parameters. To delete an item, click .

l A cluster can have trust relationships with a maximum of 16 clusters. By default, no trustrelationship exists between different clusters that are trusted by a local cluster.

Step 6 Click Save Configuration. In the dialog box that is displayed, select Restart the affectedservices or instances and click OK. If you do not select Restart the affected services orinstances, manually restart the affected services or instances.

After Operation succeeded is displayed, click Finish.

Step 7 Exit MRS Manager and log in to it again. If the login is successful, the configurations arevalid.

Step 8 Log in to MRS Manager of the other cluster and repeat Step 5 to Step 7.

----End

Follow-up Operations

After you configure the cross-cluster mutual trust relationship, service configurations aremodified and the service is restarted on MRS Manager. You need to prepare the clientconfiguration file and update the client again.

Scenario 1:

If cluster A and cluster B (peer cluster and mutually trusted cluster) are the same type, forexample, analysis cluster or streaming cluster, follow instructions in Updating the Client toupdate the client configuration files of cluster A and cluster B.

l Update the client configuration file of cluster A.l Update the client configuration file of cluster B.

Scenario 2:

If cluster A and cluster B (peer cluster and mutually trusted cluster) are the different type,perform the following operations to update their clients.

l Update the client configuration file of cluster A to cluster B.l Update the client configuration file of cluster B to cluster A.l Update the client configuration file of cluster A.



Issue 01 (2018-09-06) 451

l Update the client configuration file of cluster B.

Step 1 Log in to MRS Manager of cluster A.

Step 2 Click Service and then Download Client.

Step 3 In Client Type, select Only configuration files.

Step 4 In Download Path, select Remote host.

Step 5 Set Host IP Address to the EIP of cluster B, Host Port to 22, and Save Path to /home/linux.l If the default port 22 for logging in to cluster B using SSH is changed, set Host Port to a

new port.l The value of Save Path contains a maximum of 256 characters.

Step 6 For clusters of versions earlier than MRS 1.6.2, set Login User to linux. For clusters of MRS1.6.2 or later, set Login User to root.

If another user is used, ensure that the user has permissions to read, write, and execute thesave path.

Step 7 In SSH Private Key, select the key file that is used when cluster B is created and upload it.

Step 8 Click OK to generate a client file.

If the following information is displayed, the client file is successfully saved. And then, clickClose.

Client files downloaded to the remote host successfully.

Step 9 Log in to an ECS in cluster B using VNC. For details see Getting Started > Logging In toan ECS > Logging In to a Linux ECS Using VNC in the Elastic Cloud Server User Guide.l For clusters of versions earlier than MRS 1.6.2, all images support Cloud-Init. The preset

username of Cloud-Init is linux and the default password is cloud.1234. If you modifythe password and use a new one, follow instructions in FAQs > Login FAQs > How DoI Log In to an ECS Once All Images Support Cloud-Init? in the Elastic Cloud ServerUser Guide to log in to the ECS. You are advised to change the password upon the initiallogin.

l For clusters of MRS 1.6.2 or later, all images support Cloud-init. The preset usernamefor Cloud-init is root and the password is the one you set during cluster creation. Fordetails, see FAQs > Login FAQs > How Do I Log In to an ECS Once All ImagesSupport Cloud-Init?.

Step 10 Run the following commands to switch to user root and copy the installation package tothe /opt directory:

sudo su - root

cp /home/linux/MRS_Services_Client.tar /opt

Step 11 Runt the following command to go to the client directory:

cd /opt/client

Step 12 Run the following commands to update client configurations of cluster A to cluster B:

sh refreshConfig.sh Client installation directory Complete path of the client configurationfile package

The following provides an example.



Issue 01 (2018-09-06) 452

sh refreshConfig.sh /opt/client /opt/MRS_Services_Client.tar

If the following information is displayed, client configurations are successfully updated.


Step 13 Repeat Step 1 to Step 12 to update the client configuration file of cluster B to cluster A.

Step 14 Follow instructions in Updating the Client to perform the following operations to update theclient configuration files of the local clusters.l Update the client configuration file of cluster A.l Update the client configuration file of cluster B.

----End

7.15 Configuring Users to Access Resources of a TrustedCluster

Scenario

After cross-cluster mutual trust is configured, permission must be configured for users in thelocal cluster, so that the users can access the same resources in the peer cluster as the users inthe peer cluster.

Prerequisites

The mutual trust relationship has been configured between two clusters (clusters A and B).The clients of the clusters have been updated.

Procedure

Step 1 Log in to MRS Manager of cluster A and choose System > Manage User. Check whethercluster A has accounts that are the same as those of cluster B.l If yes, go to Step 2.l If no, go to Step 3.

Step 2 Click on the left side of the username to unfold detailed user information. Check the usergroups and roles of the accounts to ensure that they have the same permission as the accountsof cluster B.

For example, user admin of cluster A has the permission to access and create files in the /tmpdirectory of cluster A. Then go to Step 4.

Step 3 Create the accounts in cluster A and bind the accounts to the user group and roles required bythe services. Then go to Step 4.

Step 4 Choose Service > HDFS > Instance. Query OM IP Address of NameNode(Active).

Step 5 Log in to the client of cluster B.

For example, if you have updated the client on the Master2 node, log in to the Master2 nodeto use the client. For details, see Client Management.



Issue 01 (2018-09-06) 453

Step 6 Run the following command to access the /tmp directory of cluster A.

hdfs dfs -ls hdfs://192.168.6.159:9820/tmp

In the preceding command, 192.168.6.159 is the IP address of the active NameNode of clusterA; 9820 is the default port for communication between the client and the NameNode.

NOTE

For MRS 1.6.2 or earlier, the default port is 25000. For details, see List of Open Source ComponentPorts.

Step 7 Run the following command to create a file in the /tmp directory of cluster A.

hdfs dfs -touchz hdfs://192.168.6.159:25000/tmp/mrstest.txt

If you can query the mrstest.txt file in the /tmp directory of cluster A, the cross-clustermutual trust is configured successfully.

----End



Issue 01 (2018-09-06) 454

8 Using MRS

8.1 Accessing the UI of the Open Source Component

8.1.1 List of Open Source Component Ports

Table 8-1 Common ports

Sub-systemName

Parameter Default Port(MRS1.6.2orEarlier)

DefaultPort (MRS1.7.0 orLater)

Port Description

HBase hbase.master.port 21300 16000 HMaster RPC portUsed for the HBase client toconnect to HMaster.NOTE

The value range of the port is just asuggestion and is specified in theproduct. In addition, the value rangeof the port is not limited in code.

l Port enabled by defaultduring the installation: Yes

l Port enabled after securityhardening: Yes

MapReduce ServiceUser Guide 8 Using MRS

Issue 01 (2018-09-06) 455

Sub-systemName



Port Description

HBase hbase.master.info.port

21301 16010 HMaster HTTPS portUsed for a remote Web client toconnect to the HMaster UI.NOTE




HBase hbase.regionserver.port

21302 16020 RegoinServer RPC portUsed for the HBase client toconnect to RegionServer.NOTE




HBase hbase.regionserver.info.port

21303 16030 RegionServer HTTPS portUsed for a remote Web client toconnect to the RegionServer UI.NOTE





Issue 01 (2018-09-06) 456

Sub-systemName



Port Description

HBase hbase.thrift.info.port

21304 9095 ThriftServer monitoring port ofThriftServerUsed to monitor clientconnections.NOTE




HBase hbase.regionserver.thrift.port

21305 9090 ThriftServer monitoring port ofRegionServerUsed to monitor connectionsbetween the client andRegionServer.NOTE




HBase hbase.rest.info.port

21308 8085 Port of the native web UI ofRegionServer RESTServer

HBase - 21309 21309 REST port of RegionServerRESTServer


Issue 01 (2018-09-06) 457

Sub-systemName



Port Description

HDFS dfs.namenode.rpc.port

25000 9820NOTE

The defaultvalue is8020 foropen-sourceversionsearlier than3.0.0.

NameNode RPC portUsed for:1. Communications between theHDFS client and NameNode2. Connection betweenDataNode and NameNodeNOTE




HDFS dfs.namenode.http.port

25002 9870NOTE


HDFS HTTP port (NameNode)Used for:1. Point-to-point NameNodecheckpoint operations2. Connection between theremote Web client and theNameNode UINOTE





Issue 01 (2018-09-06) 458

Sub-systemName



Port Description

HDFS dfs.namenode.https.port

25003 9871NOTE


HDFS HTTPS port (NameNode)Used for:1. Point-to-point NameNodecheckpoint operations2. Connection between theremote Web client and theNameNode UINOTE




HDFS dfs.datanode.ipc.port

25008 9867NOTE


DataNode IPC server portUsed for the client to connect toDataNode for performing RPCoperations.NOTE





Issue 01 (2018-09-06) 459

Sub-systemName



Port Description

HDFS dfs.datanode.port 25009 9866NOTE


DataNode data transformationportUsed for:1. Communications between theHDFS client and DataNode fordata transformation2. Point-to-point datatransformation on DataNodeNOTE




HDFS dfs.datanode.http.port

25010 9864NOTE


DataNode HTTP portUsed for a remote Web client toconnect to the DataNode UI insecurity mode.NOTE





Issue 01 (2018-09-06) 460

Sub-systemName



Port Description

HDFS dfs.datanode.https.port

25011 9865NOTE


DataNode HTTPS portUsed for a remote Web client toconnect to the DataNode UI insecurity mode.NOTE




HDFS dfs.journalnode.rpc.port

25012 8485 JournalNode RPC portUsed for client communicationsfor accessing variousinformation.NOTE




HDFS dfs.journalnode.http.port

25013 8480 JournalNode HTTP portUsed for a remote Web client toconnect to JournalNode insecurity mode.NOTE





Issue 01 (2018-09-06) 461

Sub-systemName



Port Description

HDFS dfs.journalnode.https.port

25014 8481 JournalNode HTTPS portUsed for a remote Web client toconnect to JournalNode insecurity mode.NOTE




HDFS HTTPFS_HTTP_PORT

25018 14000 HttpFS HTTP server monitoringportUsed for a remote REST API toconnect to HttpFS.NOTE




HDFS HTTPFS_ADMIN_PORT

25020 14001 HttpFS ADMIN servermonitoring portUsed for a remote REST API toconnect to HttpFS.NOTE





Issue 01 (2018-09-06) 462

Sub-systemName



Port Description

HDFS dfs.datanode.http.address.ext

25016 25016 DataNode HTTP addressextension portUsed for a remote Web client toconnect to the DataNode UI insecurity mode.NOTE




HDFS HTTPFS_HTTPS_PORT

25019 25019 HttpFS HTTPS servermonitoring portUsed for a remote REST API toconnect to HttpFS.NOTE




Hive templeton.port 21055 50111 Port for WebHCat to provideREST servicesUsed for communicationsbetween WebHCat clients andWebHCat servers.l Port enabled by default

during the installation: Yesl Port enabled after security

hardening: Yes


Issue 01 (2018-09-06) 463

Sub-systemName



Port Description

Hive templeton.port 21066 10000 Port for HiveServer to provideThrift servicesUsed for communicationsbetween HiveServer clients andHiveServer.l Port enabled by default


hardening: Yes

Hive templeton.port 21088 9083 Port for MetaStore to provideThrift servicesUsed for communicationsbetween the MetaStore clientand MetaStore, that is,communications betweenHiveServer and MetaStore.l Port enabled by default


hardening: Yes

Hue HTTP_PORT 21200 8888 Port for Hue to provide HTTPSservicesUsed to provide web servicesusing HTTPS. This parametercan be modified.l Port enabled by default


hardening: Yes

Kafka port 21005 9092 Port for Broker to receive andobtain data

Kafka ssl.port 21008 9093 SSL port for Broker to receiveand obtain data

Kafka sasl.port 21007 21007 Port for Broker to provide SASLsecurity authentication andsecurity Kafka services


Issue 01 (2018-09-06) 464

Sub-systemName



Port Description

Kafka sasl-ssl.port 21009 21009 Port for Broker to provide SASLsecurity authentication and SSLcommunications as well assecurity authentication andcommunication encryptionservices

Loader LOADER_HTTPS_PORT

21351 21351 Port for providing REST APIsfor Loader job configuration andrunningl Port enabled by default


hardening: Yes

Manager - 8080 8080 Port provided by WebService foruser accessUsed to access the Web UI.l Port enabled by default


hardening: Yes

Manager - 28443 28443 Port provided by WebService foruser accessUsed to access the Web UI.l Port enabled by default


hardening: Yes


Issue 01 (2018-09-06) 465

Sub-systemName



Port Description

MapReduce

mapreduce.jobhistory.webapp.port

26012 19888 Web HTTP port of theJobHistory serverUsed to view the web page ofthe JobHistory server.NOTE




MapReduce

mapreduce.jobhistory.port

26013 10020 JobHistory server portUsed for:1. The MapReduce clientrestores task data.2. The Job client obtains taskreports.NOTE





Issue 01 (2018-09-06) 466

Sub-systemName



Port Description

MapReduce

mapreduce.jobhistory.webapp.https.port

26014 19890 Web HTTPS port of theJobHistory serverUsed to view the web page ofthe JobHistory server.NOTE




Spark2.1.0

hive.server2.thrift.port

22550 22550 JDBC Thrift portUsed for socket communicationsbetween Spark2.1.0 CLI/JDBCand the Spark2.1.0 CLI/JDBCserver.NOTE

If hive.server2.thrift.port isoccupied, a port occupationexception will be thrown.




Issue 01 (2018-09-06) 467

Sub-systemName



Port Description

Spark2.1.0

spark.ui.port 22950 4040 JDBC Web UI portUsed for HTTPS/HTTPcommunications between Webrequests and the JDBC ServerWeb UI serverNOTE

The system obtains the portaccording to the parameter settingand checks its validity. If the port isinvalid, the system increases theport number by 1 each time for amaximum of 16 times until a validport is obtained. The number ofretries can be configured usingspark.port.maxRetries.



Spark2.1.0

spark.history.ui.port

22500 18080 JobHistory Web UI portUsed for HTTPS/HTTPcommunications between Webrequests and the Spark2.1.0History Server server.NOTE

The system obtains the portaccording to the parameter settingand checks its validity. If the port isinvalid, the system increases theport number by 1 each time for amaximum of 16 times until a validport is obtained. The number ofretries can be configured usingspark.port.maxRetries.



Storm nimbus.thrift.port 29200 6627 Port for Nimbus to provide thriftservices

Storm supervisor.slots.ports

29200-29499

"6700,6701,6702,6703"

Port for receiving requestsforwarded from other servers


Issue 01 (2018-09-06) 468

Sub-systemName



Port Description

Storm logviewer.port29288

29248 8000 Port for Logviewer to provideHTTPS services

Storm ui.port 29280 29280 Port for the Storm UI to provideHTTP services (ui.port)

Storm ui.port 29243 29243 Port for the Storm UI to provideHTTPS services (ui.port)

YARN yarn.resourcemanager.webapp.port

26000 8088 Web HTTP port of theResourceManager service

YARN yarn.resourcemanager.webapp.https.port

26001 8090 Web HTTPS port of theResourceManager serviceUsed to access theResourceManager webapplication in security mode.NOTE




YARN yarn.nodemanager.webapp.address

26006 8042 NodeManager Web HTTP port

YARN yarn.nodemanager.webapp.https.port

26010 8044 NodeManager Web HTTPS portUsed to access the NodeManagerweb applications in securitymode.NOTE





Issue 01 (2018-09-06) 469

Sub-systemName



Port Description

ZooKeeper

clientPort 24002 2181 ZooKeeper client portUsed for the ZooKeeper client toconnect to the ZooKeeper server.NOTE




Kerberos kdc_ports 21732 21732 KerberOS server portUsed for Kerberosauthentication. This parametermay be used for configuringmutual trust relationshipsbetween clusters.NOTE




8.1.2 Overview

ScenarioWebsites of different components are created and hosted on the Master or Core nodes in theMRS cluster by default. You can view information about the components on these websites.The websites can be accessed only through the network of the cluster and are not released onthe Internet for security reasons. Common users can access the websites by creating an ECSwith a graphical user interface (GUI) in the network.

If you do not want to create an extra ECS, you can turn to technical experts or developmentengineers. They can use the dynamic port forwarding function of the SSH channel to allowyou to access the websites.


Issue 01 (2018-09-06) 470

Websites

Table 8-2 Clusters with Kerberos authentication disabled

Cluster Type Website Type Website

All Types MRS Manager https://Floating IP address ofMRS Manager:28443/webNOTE

Remotely log in to the Master2 node and runthe ifconfig command. In the command output,eth0:wsom refers to the floating IP address ofMRS Manager. Record the actual value of inet.

Analysis cluster HDFS NameNode l Applicable to MRS 1.6.3 or earlierhttp://IP address of the activeNameNode role instance:25002/dfshealth.html#tab-overview

l Applicable to MRS 1.7.0 or laterhttp://IP address of the activeNameNode role instance:9870/dfshealth.html#tab-overview

HBase HMaster l Applicable to MRS 1.6.3 or earlierhttps://IP address of the activeHmaster role instance:21301/master-status

l Applicable to MRS 1.7.0 or laterhttps://IP address of the activeHmaster role instance:16010/master-status

MapReduceJobHistoryServer

l Applicable to MRS 1.6.3 or earlierhttp://IP address of theJobHistoryServer role instance:26012/jobhistory

l Applicable to MRS 1.7.0 or laterhttp://IP address of theJobHistoryServer role instance:19888/jobhistory

YARN ResourceManager l Applicable to MRS 1.6.3 or earlierhttp://IP address of the activeResourceManager role instance:26000/cluster

l Applicable to MRS 1.7.0 or laterhttp://IP address of the activeResourceManager role instance:8088/cluster


Issue 01 (2018-09-06) 471


Spark JobHistory l Applicable to MRS 1.6.3 or earlierhttp://IP address of the JobHistory roleinstance:22500/

l Applicable to MRS 1.7.0 or laterhttp://IP address of the JobHistory roleinstance:18080/

For MRS 1.3.0:http://IP address of the JobHistory roleinstance:23020/

Hue l Applicable to MRS 1.6.3 or earlierhttps://Floating IP address of Hue:21200

l Applicable to MRS 1.7.0 or laterhttps://Floating IP address of Hue:8888

The Loader page is a graphical datamigration management tool based on theopen source Sqoop WebUI and is hostedon the Hue WebUI.NOTE

Remotely log in to the Master2 node and runthe ifconfig command. In the command output,eth0:FI_HUE refers to the floating IP addressof Hue. Record the actual value of inet. If youcannot query the floating IP address of Hue onthe Master2 node, switch to the Master1 nodeto query the floating IP address and thenrecord it. If there is only one Master node, login to this Master node to query the floating IPaddress and then record it.

Streamprocessingcluster

Storm http://IP address of any UI role instance:29280/index.html

Table 8-3 Clusters with Kerberos authentication enabled


All Types MRS Manager https://Floating IP address ofMRS Manager:28443/web

Analysis cluster HDFS NameNode Choose Service > HDFS > NameNodeWebUI > NameNode (Active)

HBase HMaster Choose Service > HBase > HMasterWebUI > HMaster (Active)


Issue 01 (2018-09-06) 472


MapReduceJobHistoryServer

Choose Service > MapReduce >JobHistoryServer WebUI >JobHistoryServer

YARN ResourceManager Choose Service > Yarn >ResourceManager WebUI >ResourceManager (Active)

Spark JobHistory Choose Service > Spark > Spark WebUI> > JobHistory.

Hue Choose Service > Hue > Hue WebUI >Hue (Active)The Loader page is a graphical datamigration management tool based on theopen source Sqoop WebUI and is hostedon the Hue WebUI.

Streamprocessingcluster

Storm Choose Service > Storm > WebUI > UI

8.1.3 Creating an SSH Channel to Connect an MRS Cluster andConfiguring the Browser

Scenario

Users and an MRS cluster are in different networks. As a result, an SSH channel needs to becreated to send users' requests for accessing websites to the MRS cluster and dynamicallyforward them to the target websites.

Prerequisitesl You have prepared an SSH client for creating the SSH channel, for example, the Git

open source SSH client. You have downloaded and installed the client.l You have created a cluster and prepared a key file in the pem format or obtained the

password specified during cluster creation..l Users can access the Internet on the local PC.

Procedure

Step 1 Log in to the MRS management console and choose Cluster > Active Cluster.

Step 2 Click the specified MRS cluster name.

Record Default Security Group of the Master node.

Step 3 Add an inbound rule to the security group of the Master node to allow data from the specifiedsources to access port 22.


Issue 01 (2018-09-06) 473

For details, see Virtual Private Cloud > User Guide > Security > Security Group >Adding a Security Group Rule.

Step 4 Bind an elastic IP address to the Master2 node.

For details, see Virtual Private Cloud > User Guide > Network Components > EIP >Assigning an EIP and Binding It to an ECS.

Step 5 Locally start Git Bash and run the following command to log in to the Master2 node:

l For clusters of versions earlier than MRS 1.6.2: ssh -i Path of the key file linux@ElasticIP address.

l For clusters of MRS 1.6.2 or later: ssh root@Elastic IP address or ssh -i Path of the keyfile root@Elastic IP address.

Step 6 Run the following commands to view data forwarding configurations:

cat /etc/sysctl.conf | grep net.ipv4.ip_forward

l If net.ipv4.ip_forward=1 is displayed, the forwarding function has been configured. Goto Step 8.

l If net.ipv4.ip_forward=0 is displayed, the forwarding function has not been configured.Go to Step 7.

l If the net.ipv4.ip_forward parameter fails to be queried, this parameter has not beenconfigured. Run the following command and go to Step 8.

echo net.ipv4.ip_forward = 1>>/etc/sysctl.conf

Step 7 Modify forwarding configurations on the node.

1. Run the following command to switch to user root:sudo su - root

2. Run the following commands to modify forwarding configurations:

echo 1 > /proc/sys/net/ipv4/ip_forwardsed -i "s/net.ipv4.ip_forward=0/net.ipv4.ip_forward = 1/g" /etc/sysctl.confsysctl -w net.ipv4.ip_forward=1

3. Run the following command to modify the sshd configuration file:

vi /etc/ssh/sshd_configPress I to enter the edit mode. Locate AllowTcpForwarding and GatewayPorts anddelete comment tags. Modify them as follows. Save the changes and exit.AllowTcpForwarding yesGatewayPorts yes

4. Run the following command to restart the sshd service:

service sshd restart

Step 8 Run the following command to view the floating IP address:

ifconfig

In the command output, eth0:FI_HUE indicates the floating IP address of Hue andeth0:wsom specifies the floating IP address of MRS Manager. Record the value of inet.

Run the exit command to exit.


Issue 01 (2018-09-06) 474

NOTE

If the floating IP address of Hue cannot be queried on the Master2 node, query it on the Master 1 nodeand record it.

Step 9 Run the following command to create an SSH channel supporting dynamic port forwarding:

l For clusters of versions earlier than MRS 1.6.2: ssh -i Path of the key file -v -ND Localport linux@Elastic IP address.

l For clusters of MRS 1.6.2 or later: ssh -i Path of the key file -v -ND Local portroot@Elastic IP address or ssh -v -ND Local port root@Elastic IP address. Afterrunning the command, enter the password you set when creating the cluster.

In the command, set Local port to the user's local port that is not occupied. Port 8157 isrecommended.

After the SSH channel is created, add -D to the command and run the command to start thedynamic port forwarding function. By default, the dynamic port forwarding function enables aSOCKS proxy process and monitors the user's local port. Port data will be forwarded to theMaster2 node using the SSH channel.

Step 10 Run the following command to configure the browser proxy.

1. Go to the Google Chrome client installation directory on the local PC.

2. Press Shift, right-click the blank area, and choose Open Command Window Here to goto the command line mode. Enter the following command:

chrome --proxy-server="socks5://localhost:8157" --host-resolver-rules="MAP *0.0.0.0 , EXCLUDE localhost" --user-data-dir=c:/tmppath --proxy-bypass-list="*google*com,*gstatic.com,*gvt*.com,*:80",

NOTE

In the preceding command, 8157 is the local proxy port configured in Step 9.

Step 11 In the address bar of the browser, enter the address for accessing MRS Manager.

Address format: https://Floating IP address of MRS Manager:28443/web

The username and password of the MRS cluster need to be entered for accessing clusters withKerberos authentication enabled, for example, user admin. They are not required foraccessing clusters with Kerberos authentication disabled.

When accessing the MRS Manager for the first time, you must add the address to the trustedsite list.

Step 12 Prepare the website access address.

1. Obtain the website address format and the role instance according to Websites.

2. Click Services.

3. Click the specified service name, for example, HDFS.

4. Click Instance and view Service IP of NameNode(Active).

Step 13 In the address bar of the browser, enter the website address to access it.

Step 14 When logging out of the website, terminate and close the SSH channel.

----End


Issue 01 (2018-09-06) 475

8.2 Using Hadoop from ScratchThis section describes how to use Hadoop to submit a wordcount job. Wordcount, a typicalHadoop job, is used to count the words in texts.

Prerequisites


Procedure

Step 1 Prepare the wordcount program.

The open source Hadoop example program contains the wordcount program. You candownload the Hadoop example program at https://dist.apache.org/repos/dist/release/hadoop/common/.

For example, select a Hadoop version hadoop-2.7.x. Download hadoop-2.7.x.tar.gz,decompress it, and obtain hadoop-mapreduce-examples-2.7.x.jar from the hadoop-2.7.x\share\hadoop\mapreduce directory. The hadoop-mapreduce-examples-2.7.x.jar exampleprogram contains the wordcount program.

NOTE

hadoop-2.7.x indicates the Hadoop version.

Step 2 Prepare data files.

There is no format requirement for data files. Prepare one or more TXT files. The following isan example of a TXT file:

qwsdfhoedfrffrofhuncckgktpmhutopmmajjpsffjfjorgjgtyiuyjmhombmbogohoyhmjhheyeombdhuaqqiquyebchdhmamdhdemmjdoeyhjwedcrfvtgbmojiyhhqssddddddfkfkjhhjkehdeiyrudjhfhfhffooqweopuyyyy

Step 3 Upload data to OBS.

1. Log in to the OBS console.2. Click Create Bucket to create a bucket and name it. The name must be unique;

otherwise the bucket cannot be created. Here name wordcount will be used as anexample.

3. In the wordcount bucket, click Create Folder to create the program, input, output,and log folders.– program: stores user programs.– input: stores user data files.– output: stores job output files.– log: stores job output log files.

4. Go to the program folder, click to select the program package downloaded in Step1, and click Upload.

5. Go to the input folder and upload the data file that is prepared in Step 2.


Issue 01 (2018-09-06) 476

https://dist.apache.org/repos/dist/release/hadoop/common/

https://dist.apache.org/repos/dist/release/hadoop/common/

Step 4 Log in to the MRS management console. In the navigation tree on the left, choose Cluster >Active Cluster and click the cluster named mrs_20160907. The mrs_20160907 cluster wascreated in section Creating a Cluster.

Step 5 Submit a wordcount job.

1. Select Job Management. On the Job tab page, click Create to go to the Create Jobpage.Only when the mrs_20160907 cluster is in the running state can jobs be submitted.Table 8-4 describes parameters for job configuration. The following is a jobconfiguration example:– Type: Select MapReduce.– Name: For example, mr_01.– Program Path:

Set the path to the address that stores the program on OBS. Replace the bucketname and program name with the names of the bucket and program that youcreated in Step 3.3. For example, s3a://wordcount/program/hadoop-mapreduce-examples-2.7.x.jar.

– Parameters:Indicate the main class of the program to be executed, for example, wordcount.

– Import From:Set the path to the address that stores the input data files on OBS. Replace thebucket name and input name with the names of the bucket and file folder that youcreated in Step 3.3. For example, s3a://wordcount/input.

– Export To:Set the path to the address that stores the job output files on OBS. Replace thebucket name and output name with the names of the bucket and file folder that youcreated in Step 3.3. For example, s3a://wordcount/output.

– Log path:Set the path to the address that stores the job log files on OBS. Replace the bucketname and log name with the names of the bucket and file folder that you created inStep 3.3. For example, s3a://wordcount/log.

A job will be executed immediately after being created successfully.


Issue 01 (2018-09-06) 477



Type Job typePossible types include:– MapReduce– Spark– Spark Script– Hive ScriptNOTE

To add jobs of the Spark and Hive types, you need to select Spark and Hivecomponents when creating a cluster and the cluster must be in the runningstate. Spark Script jobs support Spark SQL only, and Spark supports SparkCore and Spark SQL.

Name Job nameThis parameter consists of 1 to 64 characters, including letters,digits, hyphens (-), or underscores (_). It cannot be null.NOTE



When configuring this parameter, click OBS or HDFS, specify the file path,and click OK.

This parameter cannot be null.This parameter must meet the following requirements:– A maximum of 1023 characters are allowed, but special

characters (*?<">|\) are not allowed. The address cannot beempty or full of spaces.

– The path varies depending on the file system:n OBS: The path must start with s3a://, for example, s3a://

wordcount/program/hadoop-mapreduce-examples-2.7.x.jar.

n HDFS: The path must start with /user.– Spark Script must end with .sql; MapReduce and Spark must end

with .jar. sql and jar are case-insensitive.


Issue 01 (2018-09-06) 478


Parameters Key parameter for executing jobsThis parameter is assigned by an internal function. MRS is onlyresponsible for inputting the parameter. Separate parameters withspaces.Format: package name.class nameA maximum of 2047 characters are allowed, but special characters(;|&>',<$) are not allowed. This parameter can be empty.NOTE

When you enter parameters containing sensitive information, for example, apassword for login, you can add an at sign (@) before the parameters toencrypt the parameter values and prevent persistence of sensitive informationin the form of plaintext. Therefore, when you view job information on theMRS management console, sensitive information will be displayed asasterisks (*).




The path varies depending on the file system:– OBS: The path must start with s3a://.– HDFS: The path must start with /user.A maximum of 1023 characters are allowed, but special characters(*?<">|\) are not allowed. This parameter can be empty.








Issue 01 (2018-09-06) 479

Step 6 View the job execution results.

1. Go to the Job Management tab page. On the Job tab page, check whether the jobs arecomplete.The job operation takes a while. After the jobs are complete, refresh the job list, asshown in .You cannot execute a successful or failed job, but can add or copy the job. After settingjob parameters, you can submit the job again.

2. Log in to the OBS console. Go to the OBS directory and query job output information.In the wordcount > output directory of OBS, you can query and download the joboutput files.

3. Log in to the OBS console. Go to the OBS directory and check the detailed job executionresults.In the wordcount > log directory of OBS, you can query and download the jobexecution logs by job ID.

Step 7 Terminate a cluster.

For details, see Terminating a Cluster in the User Guide.

----End

8.3 Using Spark from ScratchThis section describes how to use Spark to submit a sparkPi job. SparkPi, a typical Spark job,is used to calculate the value of pi (π).

PrerequisitesYou have administrator rights on MRS Manager.

Procedure

Step 1 Prepare the sparkPi program.

The open source Spark example program contains the sparkPi program. You can downloadthe Spark example program at https://d3kbcqa49mib13.cloudfront.net/spark-2.1.0-bin-hadoop2.7.tgz

Decompress the Spark example program to obtain the spark-examples_2.11-2.1.0.jar file inthe spark-2.1.0-bin-hadoop2.7/examples/jars directory. The spark-examples_2.11-2.1.0.jarexample program contains the sparkPi program.


1. Log in to the OBS console.2. Click Create Bucket to create a bucket and name it. The name must be unique;

otherwise the bucket cannot be created. Here name sparkPi will be used as an example.3. In the sparkpi bucket, click Create Folder to create the program, output, and log

folders.

4. Go to the program folder, click to select the program package downloaded in Step1, and click Upload.


Issue 01 (2018-09-06) 480

https://d3kbcqa49mib13.cloudfront.net/spark-2.1.0-bin-hadoop2.7.tgz

https://d3kbcqa49mib13.cloudfront.net/spark-2.1.0-bin-hadoop2.7.tgz

Step 3 Log in to the MRS management console. In the navigation tree on the left, choose Cluster >Active Cluster and click the cluster named mrs_20160907. The mrs_20160907 cluster wascreated in section Creating a Cluster.

Step 4 Submit a sparkPi job.

1. Select Job Management. On the Job tab page, click Create to go to the Create Jobpage.

Only when the mrs_20160907 cluster is in the running state can jobs be submitted.

Table 8-5 describes parameters for job configuration. The following is a jobconfiguration example:

– Type: Select Spark.

– Name: For example, job_spark.

– Program Path:

Set the path to the address that stores the program on OBS. Replace the bucketname and program name with the names of the bucket and program that youcreated in Step 2.3. For example, s3a://sparkpi/program/spark-examples_2.11-2.1.0.jar.

– Parameters:

Indicate the main class of the program to be executed, for example,org.apache.spark.examples.SparkPi 10.

– Export To:

Set the path to the address that stores the job output files on OBS. Replace thebucket name and output name with the names of the bucket and file folder that youcreated in Step 2.3. For example, s3a://sparkpi/output.

– Log path:

Set the path to the address that stores the job log files on OBS. Replace the bucketname and log name with the names of the bucket and file folder that you created inStep 2.3. For example, s3a://sparkpi/log.

A job will be executed immediately after being created successfully.



Type Job typePossible types include:– MapReduce– Spark– Spark Script– Hive ScriptNOTE

To add jobs of the Spark and Hive types, you need to select Spark and Hivecomponents when creating a cluster and the cluster must be in the runningstate. Spark Script jobs support Spark SQL only, and Spark supports SparkCore and Spark SQL.


Issue 01 (2018-09-06) 481


Name Job nameThis parameter consists of 1 to 64 characters, including letters,digits, hyphens (-), or underscores (_). It cannot be null.NOTE




This parameter cannot be null.This parameter must meet the following requirements:– A maximum of 1023 characters are allowed, but special

characters (*?<">|\) are not allowed. The address cannot beempty or full of spaces.

– The path varies depending on the file system:n OBS: The path must start with s3a://, for example, s3a://

wordcount/program/hadoop-mapreduce-examples-2.7.x.jar.

n HDFS: The path must start with /user.– Spark Script must end with .sql; MapReduce and Spark must end

with .jar. sql and jar are case-insensitive.

Parameters Key parameter for executing jobsThis parameter is assigned by an internal function. MRS is onlyresponsible for inputting the parameter. Separate parameters withspaces.Format: package name.class nameA maximum of 2047 characters are allowed, but special characters(;|&>',<$) are not allowed. This parameter can be empty.NOTE

When you enter parameters containing sensitive information, for example, apassword for login, you can add an at sign (@) before the parameters toencrypt the parameter values and prevent persistence of sensitive informationin the form of plaintext. Therefore, when you view job information on theMRS management console, sensitive information will be displayed asasterisks (*).



Issue 01 (2018-09-06) 482











Step 5 View the job execution results.

1. Go to the Job Management tab page. On the Job tab page, check whether the jobs arecomplete.The job operation takes a while. After the jobs are complete, refresh the job list, asshown in .You cannot execute a successful or failed job, but can add or copy the job. After settingjob parameters, you can submit the job again.

2. Go to the OBS directory and query job output information.In the sparkpi > output directory of OBS, you can query and download the job outputfiles.

3. Go to the OBS directory and check the detailed job execution results.In the sparkpi > log directory of OBS, you can query and download the job executionlogs by job ID.



Issue 01 (2018-09-06) 483


----End

8.4 Using Spark SQL from ScratchTo process structured data, Spark provides Spark SQL, which is similar to SQL.

You can create a table named src_data, write a data entry in each row of the src_data table,and store data in the mrs_20160907 cluster. You can then use SQL statements to query data inthe src_data table. Afterward, you can delete the src_data table.

Prerequisites


You have obtained the AK/SK for writing data from the OBS data source to the Spark SQLtable. The method for obtaining the AK/SK is as follows:

1. Log in to the management console.2. Click the username and choose My Credential from the drop-down list.3. Click Access Credentials.4. Click Add Access Key to switch to the Add Access Key page.5. Enter the login password, the short message verification code and click OK to download

the access key. Keep the access key secure.

Procedure

Step 1 Prepare data sources for Spark SQL analysis.

The following is an example of a text file:

abcd3ghjiefgh658ko1234jjyu97h8kodfg1kk99icxz3


1. Log in to the OBS management console.2. Click Create Bucket to create a bucket and name it. The name must be unique or else

the bucket cannot be created. Here name sparksql will be used as an example.3. In the sparksql bucket, click Create Folder to create the input folder.

4. Go to the input folder, click to select a local text file, and click Upload.

Step 3 Import the text file in OBS to HDFS.

1. Log in to the MRS management console. In the navigation tree on the left, chooseCluster > Active Cluster and click the cluster named mrs_20160907. Themrs_20160907 cluster was created in section Creating a Cluster.

2. Select File Management tab page.3. Click Create Folder and create the userinput file folder.


Issue 01 (2018-09-06) 484

4. Go to the userinput file folder, and click Import Data.5. Select the OBS and HDFS paths and click OK.

OBS path: s3a://sparksql/input/sparksql-test.txtHDFS path: /user/userinput

Step 4 Submit the Spark SQL statement.

1. On the Job Management tab page, select Spark SQL. The Spark SQL job page isdisplayed.Only when the mrs_20160907 cluster is in the running state can jobs be submitted.

2. Enter the Spark SQL statement to create a table.When entering Spark SQL statements, ensure that the characters contained are fewerthan 10,000.The syntax is as follows:CREATE [EXTERNAL] TABLE [IF NOT EXISTS] table_name [(col_name data_type[COMMENT col_comment], ...)] [COMMENT table_comment] [PARTITIONED BY(col_name data_type [COMMENT col_comment], ...)] [CLUSTERED BY (col_name,col_name, ...) [SORTED BY (col_name [ASC|DESC], ...)] INTO num_bucketsBUCKETS] [ROW FORMAT row_format] [STORED AS file_format] [LOCATIONhdfs_path];You can use either of the following two methods to create a table:– Method 1: Create table src_data and write data in every row.

The data source is stored in the /user/userinput file folder of HDFS: createexternal table src_data(line string) row format delimited fields terminated by '\\n' stored as textfile location '/user/userinput';The data source is stored in the /sparksql/input file folder of OBS: create externaltable src_data(line string) row format delimited fields terminated by '\\n'stored as textfile location 's3a://AK:SK@sparksql/input';For the method of obtaining the AK/SK, see the description in Prerequisites.

– Method 2: Create table src_data1 and load data to the src_data1 table in batches.create table src_data1 (line string) row format delimited fields terminated by',' ;load data inpath '/user/userinput/sparksql-test.txt' into table src_data1;

NOTE

When method 2 is used, the data from OBS cannot be loaded to the created tables directly.

3. Enter the Spark SQL statement to query a table.The syntax is as follows:SELECT col_name FROM table_name;To query data in the src_data table, for example, enter the following statement:select * from src_data;

4. Enter the Spark SQL statement to delete a table.The syntax is as follows:DROP TABLE [IF EXISTS] table_name;For example:drop table src_data;


Issue 01 (2018-09-06) 485

5. Click Check to check whether the statements are correct.6. Click Submit.

After submitting Spark SQL statements, you can check whether the execution issuccessful in Last Execution Result and view detailed execution results in Last QueryResult Set.



----End

8.5 Using HBase from ScratchHBase is a scalable column-based distributed storage system. It features high reliability andhigh performance.

You can update the client on a Master node in the mrs_20160907 cluster. The client can beused to create a table, data can be inserted to, read, and deleted from the table, and the tablecan be modified and deleted.

PrerequisitesYou have administrator rights on MRS Manager.

BackgroundAfter an MRS cluster has been successfully created, the original client is by default stored inthe /opt/client directory on all nodes in the cluster. Before using the client, download theclient file, update the client, and locate the active management node of MRS Manager.

For example, if a user develops an application to manage information about users who useservice A in an enterprise, the operation processes of service A using the HBase client are asfollows:

l Create a user information table.l Add diplomas and titles of users to the table.l Query usernames and addresses by user ID.l Query information by username.l Deregister users and delete user data.l Delete the user information table after service A ends.

Table 8-6 User information

ID Name Gender Age Address

12005000201 A Male 19 City A

12005000202 B Female 23 City B

12005000203 C Male 26 City C

12005000204 D Male 18 City D


Issue 01 (2018-09-06) 486

ID Name Gender Age Address

12005000205 E Female 21 City E

12005000206 F Male 32 City F

12005000207 G Female 29 City G

12005000208 H Female 30 City H

12005000209 I Male 26 City I

12005000210 J Male 25 City J

Procedure

Step 1 Download the client file or the client configuration file.

1. Log in to the MRS management console. In the navigation tree on the left, chooseCluster > Active Cluster and click the cluster named mrs_20160907. Themrs_20160907 cluster was created in section Creating a Cluster.

2. In the Cluster List > mrs_20160907 area, click View to open MRS Manager.3. Click Services, and click Download Client.

Set Client Type to All client files or Only configuration files, set Download Path toServer, and click OK to generate the client file or the client configuration file. Thegenerated file is saved in the /tmp/MRS-client directory on the active management nodeby default.


1. In the Cluster List > mrs_20160907 area, select Node to view the Name parameter. Thenode that contains master1 in its name is the Master1 node. The node that containsmaster2 in its name is the Master2 node.The active and standby management nodes of MRS Manager are installed on Masternodes by default. Because Master1 and Master2 are switched over in active and standbymode, Master1 is not always the active management node of MRS Manager. Run acommand in Master1 to check whether Master1 is the active management node of MRSManager. For details about the command, see Step 2.4.

2. For clusters running a version earlier than MRS 1.6.2, log in to the Master1 node using apassword as user linux. For clusters running MRS 1.6.2 or a later version, log in to theMaster1 node using a password as user linux. For details, see Logging In to an ECSUsing VNC in the User Guide.The Master node supports Cloud-init. The preset username for Cloud-init is linux. Thepassword is randomly generated and is displayed on the VNC login page by default. Ifyou have changed the password, log in to the node using the new password. See "HowDo I Log In to an ECS Once All Images Support Cloud-Init?" in the Elastic CloudServer User Guide (FAQs > Login FAQs > How Do I Log In to an ECS Once AllImages Support Cloud-Init?).

3. Run the following command to switch to user omm.sudo su - rootsu - omm


Issue 01 (2018-09-06) 487

4. Run the following command to confirm the active and standby management nodes.sh ${BIGDATA_HOME}/om-0.0.1/sbin/status-oms.sh | grep ActivedFor example, the following information is displayed. node-master2-LJXDj indicates thename of the active management node.192-168-1-17 node-master2-LJXDj V100R001C01 2016-10-01 06:58:41 active normal Actived

NOTE

If the Master1 node to which you have logged in is the standby management node and you need tolog in to the active management node, run the following command:ssh IP address of the Master2 nodessh name of the active management nodeFor example, run the following command: ssh node-master2-LJXDj

5. Log in to the active management node, for example, node node-master2-LJXDj, as userroot.


After an MRS cluster is successfully created, the client is installed in the /opt/client directoryby default.

cd /opt/client

Step 4 Run the following command to update the client configuration for the active managementnode.

Switch to user omm.

sudo su - omm

sh refreshConfig.sh /opt/client Full path of the client configuration file package

For example, run the following command:

sh refreshConfig.sh /opt/client /tmp/MRS-Client/MRS_Services_Client.tar

If the following information is displayed, the configuration is updated successfully.


Step 5 Use the client on a Master node.

1. On the active management node where the client is updated, for example, node-master2-LJXDj, run the following command to go to the client directory.cd /opt/client

2. Run the following command to configure environment variables.source bigdata_env

3. If the Kerberos authentication is enabled for the current cluster, run the followingcommand to authenticate users. If the Kerberos authentication is disabled for the currentcluster, skip this step.kinit MRS cluster userFor example, kinit admin.

4. Run an HBase component client command directly.hbase shell


Issue 01 (2018-09-06) 488

Step 6 Run commands on the HBase client to implement service A.

1. Create a user information table according to Table 8-6 and add data to it.create 'user_info',{NAME => 'i'}For example, to add data of user 12005000201, run the following commands insequence:put 'user_info','12005000201','i:name','A'put 'user_info','12005000201','i:gender','Male'put 'user_info','12005000201','i:age','19'put 'user_info','12005000201','i:address','City A'

2. Add degree and title information about the user to the table.For example, to add degree and title information about user 12005000201, run thefollowing commands:put 'user_info','12005000201','i:degree','master'put 'user_info','12005000201','i:pose','manager'

3. Query usernames and addresses by user ID.For example, to query the username and address of user 12005000201, run the followingcommand:scan'user_info',{STARTROW=>'12005000201',STOPROW=>'12005000201',COLUMNS=>['i:name','i:address']}

4. Query information by username.For example, to query information about user A, run the following command:scan'user_info',{FILTER=>"SingleColumnValueFilter('i','name',=,'binary:A')"}

5. Delete user data from the user information table.All user data needs to be deleted. For example, to delete data of user 12005000201, runthe following command:delete'user_info','12005000201','i'

6. Run the following command to delete the user information table.disable'user_info';drop 'user_info'



----End

8.6 Using HBase

8.6.1 Configuring the HBase Replication Function

ScenarioAs a key feature to ensure high availability of the HBase cluster system, HBase clusterreplication provides HBase with remote data replication in real time. It provides basic O&Mtools such as tools used for the replication relationship maintenance, data reconstruction, data


Issue 01 (2018-09-06) 489

verification, and data synchronization progress view. To achieve real-time data replication,you can replicate data from the HBase cluster to another one.

Prerequisitesl The active and standby clusters have been successfully installed and started (the cluster

status is Running on the Active Cluster page), and you have the administrator rights ofthe clusters.

l The network between the active and standby clusters is normal and ports can be usedproperly. For details, see the MapReduce Service Communication Matrix.

l If Kerberos authentication is enabled in the active cluster, cross-cluster mutual trustrelationships have been configured for the active and standby clusters. For details, seeConfiguring Cross-Cluster Mutual Trust Relationships. If Kerberos authentication isdisabled in the active cluster, no cross-cluster mutual trust relationship is required.

NOTE

Follow instructions in Viewing Basic Information About an Active Cluster to access the basicinformation page of the cluster and check whether Kerberos Authentication is Enabled orDisabled.

l If the active cluster has historical data and the data needs to be synchronized to thestandby cluster, cross-cluster replication has been configured for the active and standbyclusters. For details, see Enabling the Cross-Cluster Copy Function.

l The time of the active and standby clusters must be the same, and the NTP service in theactive and standby clusters must use the same time source.

l Mapping relationships between host names and service IP addresses have beenconfigured in the /etc/hosts files in the active and standby clusters by appending"192.***.***.*** host1" to the hosts files.

l The network bandwidth between the active and standby clusters is determined by servicetraffic and cannot be smaller than the maximum allowed service traffic.

Restrictionsl Despite that HBase cluster replication provides the real-time data replication function,

the data synchronization progress is determined by several factors, such as the serviceloads in the active cluster and the health status of processes in the standby cluster.Normally, the standby cluster should not take over services running in the active cluster.In extreme cases, system maintenance personnel and other decision makers determinewhether the standby cluster takes over services according to the current datasynchronization indicators.

l Currently, the replication function supports only one active cluster and one standbycluster in HBase.

l Typically, do not perform operations on data synchronization tables in the standbycluster, such as modifying table properties or deleting tables. If any misoperation on thestandby cluster occurs, data synchronization between the active and standby clusters willfail and data of the corresponding table in the standby cluster will be lost.

l If the replication function of HBase tables in the active cluster is enabled for datasynchronization, after modifying the structure of a table in the active cluster, you need tomanually modify the structure of the corresponding table in the standby cluster to ensuretable structure consistency.


Issue 01 (2018-09-06) 490

ProcedureEnable the replication function for the active cluster to synchronize data written by Put.

Step 1 Log in to MRS Manager of the active cluster.

Step 2 Choose Service > HBase > Service Configuration, set Type to All, and go to the HBaseconfiguration page.

Step 3 Choose RegionServer > Replication, and check whether hbase.replication is set to true. Ifthe value is false, set hbase.replication to true.

Step 4 (Optional) Set configuration items listed in Table 8-7. You can set parameters based on thedescription or use the default value.

Table 8-7 Optional configuration items

NavigationPath

Configuration Item DefaultValue

Description

HMaster >Performance

hbase.master.logcleaner.ttl

600000 Specifies the retention period ofHLog. If it is set to 604800000(unit: ms), the retention period ofHLog is 7 days.

hbase.master.cleaner.interval

60000 Specifies the deletion interval ofHLog. The HLog that exceeds theconfigured period will beautomatically deleted. You areadvised to set it to the maximumvalue to save more HLogs.

RegionServer> Replication

replication.source.size.capacity

16777216 After data in the active cluster issynchronized to the standbycluster, the active cluster reads andsends data in HLog according tothis parameter.

replication.source.nb.capacity

25000 After data in the active cluster issynchronized to the standbycluster, the active cluster reads andsends data in HLog according tothis parameter. This parameter isused withreplication.source.size.capacity.

replication.source.maxretriesmultiplier

10 Specifies the maximum number ofretries for sending log data.

replication.source.sleepforretries

1000 Specifies the sleeping period forlog data sending failures. (unit:ms)

hbase.regionserver.replication.handler.count

6 Specifies the number of threads ofthe RPC handler that the standbycluster uses to receive data.


Issue 01 (2018-09-06) 491

Enable the replication function for the active cluster to synchronize data written byBulkLoad.

Step 5 Determine whether to enable the replication function to synchronize data written byBulkLoad.

NOTE

If you use the BulkLoad data import feature of HBase and need to synchronize data, you need to enablethe replication function.

If it needs to be enabled, go to Step 6.

If it does not need to be enabled, go to Step 9.

Step 6 Choose Service > HBase > Service Configuration, set Type to All, and go to the HBaseconfiguration page.

Step 7 Locate hbase.replication.bulkload.enabled and change its value to true to enable thereplication function to synchronize data written by BulkLoad.

Step 8 Locate hbase.replication.cluster.id and modify it. It specifies the HBase ID of the activecluster and is used for the standby cluster to connect to the active cluster. The parameter valuecan contain uppercase letters, lowercase letters, digits, and underscores (_), and cannot exceed30 characters

Restart the HBase service and install the client.

Step 9 Click Save Configuration. In the displayed window, select Restart the affected services orinstances and click OK to restart the HBase service.

After the system displays Operation succeeded, click Finish. The service is successfullystarted.

Step 10 In the active and standby clusters, choose Service > HBase > Download Client to downloadthe client. Follow instructions in Updating the Client to update the client configuration file.

Step 11 Use PuTTY to access the HBase shell of the active cluster as user hbase.

1. On the active management node where the client has been updated, run the followingcommand to switch to the client directory:cd /opt/client

2. Run the following command to configure the environment variable:source bigdata_env

3. If Kerberos authentication is enabled for the current cluster, run the following commandto authenticate the user. If it is disabled, skip this step.kinit hbase

NOTE

After running kinit hbase, the system prompts you to enter a password. The default password ofuser hbase is Hbase@123.

4. Run the following HBase client command:hbase shell

Synchronize the table data of the active cluster. (Skip this step if the active cluster has nodata.)


Issue 01 (2018-09-06) 492

Step 12 Check whether the standby cluster has historical data. If it has historical data and its data mustbe consistent with data in the active cluster, clear data in the standby cluster.

1. On the HBase shell of the standby cluster, run the list command to view existing tables inthe standby cluster.

2. Run the disable 'tableName';drop 'tableName' command to delete data tables in thestandby cluster.

Step 13 After the HBase replication function is configured and data synchronization is enabled, checkwhether tables and data exist in the active cluster and whether the historical data needs to besynchronized to the standby cluster.

You can run the list command to view the existing tables in the active cluster and run the scan'tableName' command to check whether there is historical data in the tables.

l If tables exist and data needs to be synchronized, go to Step 14 .l If data does not need to be synchronized, no further action is required.

Step 14 When the HBase replication function is configured, historical data in a table cannot beautomatically synchronized. You need to replicate the historical data of the active cluster andthen manually synchronize it to the standby cluster.

The manual synchronization is single-table synchronization, which can be performed usingExport, Distcp, and Import.

Perform the following steps to manually synchronize a single table:

1. Export table data from the active cluster.hbase org.apache.hadoop.hbase.mapreduce.Export -Dhbase.mapreduce.include.deleted.rows=true Table name Directory that saves sourcedataExample: hbase org.apache.hadoop.hbase.mapreduce.Export -Dhbase.mapreduce.include.deleted.rows=true t1 /user/hbase/t1

2. Copy the exported data to the standby cluster.hadoop distcp Directory that saves source data in the active cluster hdfs://ActiveNameNodeIP:9820/Directory that saves source data in the standby clusterActiveNameNodeIP indicates the IP address of the active NameNode in the standbycluster.Example: hadoop distcp /user/hbase/t1 hdfs://192.168.40.2:9820/user/hbase/t1

NOTE

For MRS 1.6.2 or earlier, the default port is 25000. For details, see List of Open SourceComponent Ports.

3. Import data to the standby cluster as the HBase table user of the standby cluster.hbase org.apache.hadoop.hbase.mapreduce.Import -Dimport.bulk.output=Directorythat saves output in the standby cluster Table name Directory that saves source data inthe standby clusterhbase org.apache.hadoop.hbase.mapreduce.LoadIncrementalHFiles Directory thatsaves output in the standby cluster Table nameExample:hbase org.apache.hadoop.hbase.mapreduce.Import -Dimport.bulk.output=/user/hbase/output_t1 t1 /user/hbase/t1


Issue 01 (2018-09-06) 493

hbase org.apache.hadoop.hbase.mapreduce.LoadIncrementalHFiles /user/hbase/output_t1 t1

Add the replication relationship between the active and standby clusters.

Step 15 Run the following command on the HBase shell to create a replication synchronizationrelationship for HBase between active and standby clusters.

add_peer 'Standby cluster ID','ZooKeeper address of the standby cluster',{HDFS_CONFS=> true}

l Standby cluster ID indicates an ID for the active cluster to recognize the standby cluster.It is recommended that the ID contain letters and digits.

l ZooKeeper address of the standby cluster includes the service IP address of ZooKeeper,the port for monitoring client connections, and the HBase root directory of the standbycluster on ZooKeeper.

l {HDFS_CONFS => true} indicates that the default HDFS configurations of the activecluster will be synchronized to the standby cluster. This parameter is used for HBase ofthe standby cluster to access HDFS of the active cluster. If the replication function forsynchronizing data written by BulkLoad is disabled, you do not need to use thisparameter.For example, to add a replication relationship between the active and standby clusterscontaining BulkLoad data, run the following command:add_peer '1','192.168.40.2,192.168.40.3,192.168.40.4:2181:/hbase',{HDFS_CONFS=> true}

NOTE

For MRS 1.6.2 or earlier, the default port is 24002. For details, see List of Open SourceComponent Ports.

NOTE

1. Choose Service > ZooKeeper > Instance to obtain the service IP address of ZooKeeper.

2. Choose Service > ZooKeeper > Service Configuration and set Type to All. Search forclientPort, namely, the port for monitoring client connections.

3. Run the list_peers command to check whether the replication relationship between the activeand standby clusters is added. If the following information is displayed, the relationship issuccessfully added.hbase(main):003:0> list_peersPEER_ID CLUSTER_KEY STATE TABLE_CFS1 192.168.0.13,192.168.0.177,192.168.0.25:24002:/hbase ENABLED

Specify the data writing status for the active and standby clusters.

Step 16 On the HBase shell of the active cluster, run the following command to retain the data writingstatus:

set_clusterState_active

If the following information is displayed, the command is successfully executed.

hbase(main):001:0> set_clusterState_active=> true

Step 17 On the HBase shell of the standby cluster, run the following command to retain the data read-only status:

set_clusterState_standby


Issue 01 (2018-09-06) 494

If the following information is displayed, the command is successfully executed.

hbase(main):001:0> set_clusterState_standby=> true

Enable the HBase replication function to synchronize data.

Step 18 Check whether a name space exists in the HBase service instance of the standby cluster andwhether the name space has the same name as the name space of the HBase table whosereplication function will be enabled.

You can run the list_namespace command on the HBase shell of the standby cluster to querya name space.l If the same name space exists, go to Step 19.l If the same name space does not exist, run the create_namespace 'ns1' command on the

HBase shell of the standby cluster to create a same name space and then go to Step 19.

Step 19 On the HBase shell of the active cluster, run the following command to enable the real-timedata replication function for the tables in the active cluster. This ensures that modified data inthe active cluster can be synchronized to the standby cluster in real time.

You can only synchronize data of one HTable at one time.

enable_table_replication 'Table name'

NOTE

l If a table with the same name as the table where real-time synchronization is to be enabled does notexist in the standby cluster, the system will automatically create the table.

l If a table with the same name as the table where real-time synchronization is to be enabled exists inthe standby cluster, the structures of the two tables must be consistent.

l If the encryption algorithm SMS4 or AES is configured for 'Table name', the function forsynchronizing data from the active cluster to the standby cluster in real time cannot be enabled forthe HBase table.

l If the standby cluster is offline or has a table with the same name but different structure, thereplication function will fail to be enabled.If the standby cluster is offline, start the standby cluster first.If the standby cluster has a table with the same name but different structure, modify the tablestructure to make it as the same as the table structure of the active cluster. Run the alter commandon the HBase shell of the standby cluster as prompted.

Step 20 On the HBase shell of the active cluster, run the following command to enable the real-timereplication function for the active cluster to synchronize the HBase permission table:

enable_table_replication 'hbase:acl'

NOTE

If the standby cluster needs to read data when the permission of the HBase source data table in the activecluster is modified, modify the role permission of the standby cluster.

Check data synchronization status for the active and standby clusters.

Step 21 Run the following command on the HBase client to check the synchronized data of the activeand standby clusters. After the replication function is enabled for data synchronization, youcan also run this command to check whether the newly synchronized data is consistent.

hbase org.apache.hadoop.hbase.mapreduce.replication.VerifyReplication --starttime=Start time --endtime=End time Column family name Standby cluster ID Tablename


Issue 01 (2018-09-06) 495

NOTE

l The start time must be earlier than the end time.

l The values of starttime and endtime must be in the timestamp format. You need to run date -d"2015-09-30 00:00:00" +%s to change a common time format to a timestamp format. Aftercommand execution, a 10-digit number (accurate to second) will be returned. However, HBaseidentifies a 13-digit number (accurate to millisecond). Therefore, you need to supplement three zeros(000) to the command output.

Switch over active and standby clusters.

NOTE

1. If the standby cluster needs to be switched over to an active one, configure an active-standbyrelationship again by referring to Step 1 to Step 11 and Step 15 to Step 20. Do not perform Step 12to Step 14 to synchronize table data between active and standby clusters.

----End

Related Commands

Table 8-8 HBase replication

Operation Command Description

Set up anactive-standbyrelationship.

add_peer 'Standby clusterID','Standby cluster address'Example:add_peer '1','zk1,zk2,zk3:2181:/hbase'add_peer '1','zk1,zk2,zk3:2181:/hbase1'24002 indicates the port ofZooKeeper in the cluster.Note: The ZooKeeper port needs tobe modified to 2181 for MRS 1.7 orlater.add_peer '1','zk1,zk2,zk32181:/hbase1',{HDFS_CONFS => true}

Set up a relationship between anactive cluster and a standby cluster.If the replication function forsynchronizing data written byBulkLoad is enabled, run thefollowing command: add_peer'Standby cluster ID','Standby clusteraddress',{HDFS_CONFS => true}

Remove anactive-standbyrelationship.

remove_peer 'Standby cluster ID'Example:remove_peer '1'

Remove standby cluster informationfrom the active cluster.

Query anactive-standbyrelationship.

list_peers Query standby cluster information(mainly Zookeeper information) inthe active cluster.


Issue 01 (2018-09-06) 496


Enable thereal-timeuser tablesynchronizationfunction.

enable_table_replication 'Tablename'Example:enable_table_replication 't1'

In the active cluster, synchronizeexisting user tables to the standbycluster.

Disable thereal-timeuser tablesynchronizationfunction.

disable_table_replication 'Tablename'Example:disable_table_replication 't1'

In the active cluster, do notsynchronize existing user tables tothe standby cluster.

Verify datain the activeand standbyclusters.

bin/hbaseorg.apache.hadoop.hbase.mapreduce.replication.VerifyReplication--starttime=Start time --endtime=End time Column familyname Standby cluster ID Tablename

Verify whether data of the specifiedtable is the same between the activecluster and the standby cluster.The description of the parameters inthis command is as follows:l Start time: If start time is not

specified, the default value 0 willbe used.

l End time: If end time is notspecified, the time when thecurrent operation is submittedwill be used by default.

l Table name: If a table name is notentered, all user tables for whichthe real-time synchronizationfunction is enabled will beverified by default.

Switch datawritingstatus.

set_clusterState_activeset_clusterState_standby

Specify whether data can be writtento the cluster HBase tables.


Issue 01 (2018-09-06) 497


Add orupdate theactiveclusterHDFSconfigurations saved inthe peercluster.

set_replication_hdfs_confs'PeerId', {'key1' => 'value1','key2' => 'value2'}

Enable the replication function fordata including BulkLoad data. WhenHDFS parameters are modified in theactive cluster, the modificationcannot be automaticallysynchronized to the standby cluster.You need to manually run thecommands to synchronize thechanges. The affected parameters areas follows:l fs.defaultFSl dfs.client.failover.proxy.provide

r.haclusterl dfs.client.failover.connection.ret

ries.on.timeoutsl dfs.client.failover.connection.ret

riesFor example, if the value offs.defaultFS is changed to hdfs://hacluster_sale, run the followingcommand to synchronize HDFSconfigurations to the standby clusterwhose ID is 1:set_replication_hdfs_confs '1',{'fs.defaultFS' => 'hdfs://hacluster_sale'}

8.6.2 Enabling the Cross-Cluster Copy FunctionEnable the cross-cluster copy function.

ScenarioDistCp is used to back up the data stored on HDFS from a cluster to another cluster. DistCpdepends on the cross-cluster copy function, which is disabled by default and needs to beenabled in both clusters.

This section describes how to modify parameters on MRS Manager to enable the cross-clustercopy function.

Impact on the SystemYarn needs to be restarted to enable the cross-cluster copy function and cannot be accessedduring the restart.


Issue 01 (2018-09-06) 498

Prerequisites

The hadoop.rpc.protection parameter of the two clusters must be set to the same datatransmission mode, which can be privacy (encryption enabled) or authentication (encryptiondisabled).

NOTE

Choose Service > Yarn > Service Configuration and set Type to All. Search forhadoop.rpc.protection.

Procedure

Step 1 Log in to MRS Manager of a cluster.

Step 2 Choose Service > Yarn > Service Configuration and set Type to All.

Step 3 In the navigation tree, choose Yarn > Distcp.

Step 4 Set dfs.namenode.rpc-address.haclusterX.remotenn1 to the service IP address and RPCport number of one NameNode instance of the peer cluster, and set dfs.namenode.rpc-address.haclusterX.remotenn2 to the service IP address and RPC port number of the otherNameNode instance of the peer cluster.

NOTE

Choose Service > HDFS > Instance to obtain the service IP address of the NameNode instance.

dfs.namenode.rpc-address.haclusterX.remotenn1 and dfs.namenode.rpc-address.haclusterX.remotenn2 do not distinguish active and standby NameNode instances.The default NameNode RPC port number is 9820 and cannot be modified on MRS Manager.

Examples of the modified parameter values: 10.1.1.1:9820 and 10.1.1.2:9820.

NOTE

For MRS 1.6.2 or earlier, the default NameNode RPC port is 25000. For details, see List of OpenSource Component Ports.

Step 5 Click Save Configuration, select Restart the role instance, and click OK to start the Yarnservice.

After the system displays Operation succeeded, click Finish. The Yarn service issuccessfully started.

Step 6 Log in to MRS Manager of the other cluster and repeat the preceding steps.

----End

8.6.3 Using the ReplicationSyncUp ToolPrerequisites

1. Only MRS 1.7 or later supports the ReplicationSyncUp tool.2. Active and standby clusters have been installed and started.3. Time is consistent between the active and standby clusters and the Network Time

Protocol (NTP) service on the active and standby clusters uses the same time source.4. When the HBase service of the active cluster is stopped, the ZooKeeper and HDFS

services must be started and run.


Issue 01 (2018-09-06) 499

5. ReplicationSyncUp must be run by the system user who starts the HBase process.6. In security mode, ensure that the HBase system user of the standby cluster has the read

permission on HDFS of the active cluster. This is because that it will update theZooKeeper nodes and HDFS files of the HBase system.

7. After HBase of the active cluster is faulty, the ZooKeeper, file system, and network ofthe active cluster are still available.

Scenario

The replication mechanism can use WAL to synchronize the state of a cluster with the state ofanother cluster. After the HBase replication is enabled, if the active cluster is faulty,ReplicationSyncUp synchronizes incremental data from the active cluster to the standbycluster using the information from the ZooKeeper node. After data synchronization iscomplete, the standby cluster can be used as an active cluster.

Parameter settings

Parameter Description DefaultValue

hbase.replication.bulkload.enabled

Whether to enable the bulkload datareplication function. The parameter valuetype is Boolean. To enable the bulkloaddata replication function, set thisparameter to true for the active cluster.

false

hbase.replication.cluster.id ID of the source HBase cluster After thebulkload data replication is enabled, thisparameter is mandatory and must bedefined in the source cluster. Theparameter value type is String.

-

Tool Usage

Run the following command on the client of the active cluster:

hbase org.apache.hadoop.hbase.replication.regionserver.ReplicationSyncUp -Dreplication.sleep.before.failover=1

NOTE

replication.sleep.before.failover indicates sleep time required for replication of the remaining datawhen RegionServer fails to start. You are advised to set this parameter to 1 second to quickly triggerreplication.

Precautions

1. When the active cluster is stopped, this tool obtains the WAL processing progress and WALprocessing queue from the Zookeeper Node (RS znode) and copies the queues that are notcopied to the standby cluster.

2. RegionServer of each active cluster has its own znode under the replication node ofZooKeeper in the standby cluster. It contains one znode of each peer cluster.

3. If RegionServer is faulty, each RegionServer in the active cluster receives a notificationthrough the watcher and attempts to lock the znode of the faulty RegionServer, including its


Issue 01 (2018-09-06) 500

queues. The successfully created RS transfers all queues to the znode of its own queue. Afterqueues are transferred, they are deleted from the old location.

4. When the active cluster is stopped, ReplicationSyncUp synchronizes data between activeand standby clusters using the information from the ZooKeeper node. In addition, WALs ofthe RS znode will be moved to the standby cluster.

Restrictions and Limitations

If the standby cluster is stopped or the peer relationship is closed, the tool runs normally butthe peer relationship cannot be replicated.

8.6.4 Using HIndex

8.6.4.1 Introduction to HIndex

Prerequisites

Only MRS 1.7 or later supports the HBase HIndex function.

Scenario

HBase is a distributed storage database based on key-value. Data in tables is sorted bydictionary based on rowkeys. If you query data by specifying rowkey or scan data in aspecific rowkey range, HBase can quickly locate data to be read. In most cases, you need toquery data whose column value is XXX. HBase provides the filter function to enable you toquery data with specific column values. All data is scanned in the sequence of RowKey anddata is matched with a specific column value until the required data is found. The filterfunction will scan some unnecessary data to obtain the required data. As a result, the filterfunction cannot meet the requirements for high-performance, frequent queries.

HBase HIndex is designed to address these issues. HBase HIndex provides HBase with thecapability of indexing based on specific column values, making query faster, as shown inFigure 8-1.

Figure 8-1 HBase HIndex


Issue 01 (2018-09-06) 501

NOTE

1. Index data does not support a rolling upgrade.

2. Composite index: You must add or delete all columns that participate in composite indexes.Otherwise, data may be inconsistent.

3. You should not explicitly configure any split policy to a data table where an index has been created.

4. The mutation operation is not supported, such as, increment and append.

5. Index on columns having maxVersions > 1 is not supported

6. The value size of a column for which an index is added cannot exceed 32 KB.

7. When the user data is deleted because TTL of the column family is invalid, the corresponding indexdata will not be deleted immediately. The index data will be deleted during major compaction.

8. After an index is created, the TTL of the user column family must not be changed.

8.1. If the TTL of the column family is changed to a larger value after the index is created, delete theindex and create one again. Otherwise, some generated index data may be deleted before the deletion ofuser data.

8.2. If the TTL of the column family is changed to a smaller value after an index is created, the indexmay be deleted after the deletion of user data.

9. When you back up a table with the index function enabled, the index data is filtered byHIndexWALCoprocessor by default and only the user data is backed up. You can use an indexgeneration tool to create indexes for user data in the standby cluster. For details, see Using an IndexGeneration Tool. You can use either of the following methods to back up user data and index data at thesame time.

9.1. After the index table is enabled, create an index first and then put data. Manually setREPLICATION_SCOPE=>'1' for the index column family (default value: d) on the active cluster.

Example: alter 'indexedTable',NAME => 'd',REPLICATION_SCOPE =>19.2. Remove the configuration of HIndexWALCoprocessor from hbase.coprocessor.wal.classes.

Parameter settings

1. Log in to MRS Manager of the cluster.2. Choose Service > HBase > Service Configuration and set Type to All. The HBase

configuration page is displayed.

Navigation Path Configuration Item

Default Value Description

HMaster > System hbase.coprocessor.master.classes

org.apache.hadoop.hbase.hindex.server.master.HIndexMasterCoprocessor

This coprocessor is used to handleMaster-level operations after theHIndex function is enabled, forexample, creating an index metatable, adding an index, anddeleting an index, a table, andindex metadata.

RegionServer >RegionServer

hbase.coprocessor.regionserver.classes

org.apache.hadoop.hbase.hindex.server.regionserver.HIndexRegionServerCoprocessor

This coprocessor is used to handlethe operations that the Masterdelivers to RegionServer after theHIndex function is enabled.


Issue 01 (2018-09-06) 502

hbase.coprocessor.region.classes

org.apache.hadoop.hbase.hindex.server.regionserver.HIndexRegionCoprocessor

This coprocessor is used tooperate data in the Region afterthe HIndex function is enabled.

hbase.coprocessor.wal.classes

org.apache.hadoop.hbase.hindex.server.regionserver.HIndexWALCoprocessor

This coprocessor is used forReplication, which filters indexdata to prevent index data frombeing sent to the peer cluster. Thepeer cluster generates index databy itself.

NOTE

1. The default value is the value that needs to be configured after the HBase HIndex function is enabled.The value has been configured by default for MRS clusters that support the HBase HIndex function.

2. Ensure that the master parameter is configured on HMaster and the region and regionserverparameters are configured on RegionServer.

Related APIs

The APIs that use HIndex are in the org.apache.hadoop.hbase.hindex.client.HIndexAdminclass. The following table describes the related APIs.

Operation API Description Precautions

Add an index. addIndices()

Add an index to a table withoutdata. Calling this API will addthe specified index to a table butskips index data generation.Therefore, after this operation,the index cannot be used for thescanning and filteringoperations. This API applies toscenarios where users want toadd indexes in batches to tablesthat have a large amount of pre-existing user data. The specificoperation is to use external toolssuch as the TableIndexer tool tobuild index data.

1. An index cannot bemodified once it isadded. To modify theindex, you need to deletethe old index and thencreate a new one again.2. Do not create twoindexes on the samecolumn with differentindex names. Otherwise,storage and processingresources will be wasted.3. Indexes cannot beadded to a system table.4. The append andincrement operations arenot supported when datais put into the indexcolumn.5. If any fault occurs onthe client exceptDoNotRetryIOExcep-


Issue 01 (2018-09-06) 503

addIndicesWithData()

Add an index to a table withdata. This method is used to addthe specified index to the tableand create index data for theexisting user data. Alternatively,the method can be invoked togenerate an index and thengenerate index data when theuser data is stored. Therefore,after this operation, the indexcan be used for the scanning andfiltering operations immediately.

tion, you need to tryagain.6. An index columnfamily is selected fromthe following conditionsin sequence based onavailability:6.1. Typically, the defaultindex column family is d.However, if the value ofhindex.default.family.name is set, the value willbe used.6.2. Symbol #, @, $, or%6.3. #0, @ 0, $ 0, %0, #1,@ 1 ...to #255, @ 255,$ 255, %2556.4.throw Exception7. You can use theHIndex TableIndexer toolto add indexes withoutbuilding index data.

Delete anindex.

dropIndices()

This API is used to delete anindex only. This API deletes thespecified index from the tablebut skips the correspondingindex data. After this operation,the index cannot be used for thescanning and filteringoperations. The clusterautomatically deletes old indexdata during major compaction.This API applies to scenarioswhere a table contains a largeamount of index data anddropIndicesWithData() isunavailable. In addition, you canuse the TableIndexer tool todelete indexes and index data.

1. An index can bedisabled when the indexis in the ACTIVE,INACTIVE, orDROPPING state.2. If you usedropIndices() to deletean index, ensure that theindex data has beendeleted before the indexis added to the table withthe same index name(that is, majorcompaction has beencompleted).3. If you delete an index,the followinginformation will bedeleted:3.1. a column familywith an index3.2. Any one of columnfamilies in a combinationindex4. Indexes and index datacan be deleted together


Issue 01 (2018-09-06) 504

dropIndicesWithData()

This API is used to delete indexdata. This API deletes thespecified index and all indexdata corresponding to the indexin the user table. After thisoperation, the index iscompletely deleted from thetable and is no longer used forthe scanning and filteringoperations.

using the HIndexTableIndexer tool.

Enable/Disable anindex.

disableIndices()

This API disables all indexesspecified by the user so that theyare no longer used for thescanning and filteringoperations.

1. An index can beenabled when the indexis in the ACTIVE,INACTIVE, orBUILDING state.2. An index can bedisabled when the indexis in the ACTIVE orINACTIVE state.3. Before disabling anindex, ensure that theindex data is consistentwith the user data. If nonew data is added to thetable when the index isdisabled, the index datais consistent with theuser data.4. When enabling anindex, you can use theTableIndexer tool tobuild index data toensure data consistency.

enableIndices()

This API enables all indexesspecified by the user so that theycan be used for the scanning andfiltering operations.

View thecreated index.

listIndices()

This API is used to list allindexes of a specified table.

None

Querying data based on indexes

You can use a filter to query data in a user table with an index. The query result of a user tablewith a single or combination index is the same as that of a table without an index, but thetable with an index provides higher data query performance than the table without an index.

The index usage rules are as follows:

1. Scenario 1: A single index is created for one or more columns.

When this column is used for AND or OR query filtering, an index can improve queryperformance.

Example: Filter_Condition(IndexCol1)AND / OR Filter_Condition(IndexCol2)


Issue 01 (2018-09-06) 505

When you use "Index Column AND Non-Index Column" for filtering in the query, the indexcan improve query performance.

Example: Filter_Condition(IndexCol1)AND Filter_Condition(IndexCol2)ANDFilter_Condition(NonIndexCol1)

When you use "Index Column OR Non-Index Column" for filtering in the query but do notuse an index, query performance will not be improved.

Example: Filter_Condition(IndexCol1)AND / OR Filter_Condition(IndexCol2) ORFilter_Condition(NonIndexCol1)

2. Scenario 2: A combination index is created for multiple columns.

When the columns to be queried are all or part of the combination index and have the sameorder as the combination index, using the index improves query performance.

For example, create a combination index for C1, C2, and C3. The index takes effect in thefollowing situations:

Filter_Condition(IndexCol1)AND Filter_Condition(IndexCol2)ANDFilter_Condition(IndexCol3)

Filter_Condition(IndexCol1)AND Filter_Condition(IndexCol2)

FILTER_CONDITION(IndexCol1)

The index does not take effect in the following situations:





When you use "Index Column AND Non-Index Column" for filtering in the query, the indexcan improve query performance.

Example:

Filter_Condition(IndexCol1)AND Filter_Condition(NonIndexCol1)

Filter_Condition(IndexCol1)AND Filter_Condition(IndexCol2)ANDFilter_Condition(NonIndexCol1)

When you use "Index Column OR Non-Index Column" for filtering in the query but do notuse an index, query performance will not be improved.

Example:

Filter_Condition(IndexCol1)OR Filter_Condition(NonIndexCol1)

(Filter_Condition(IndexCol1)ANDFilter_Condition(IndexCol2))OR(Filter_Condition(NonIndexCol1))

When multiple columns are used for query, you can specify a value range for only the lastcolumn in the combination index and set other columns to a specified value.

For example, create a combination index for C1, C2, and C3. In a range query, only the valuerange of C3 can be set. The filter criteria are "C1 = XXX, C2 = XXX, and C3 = Value range."


Issue 01 (2018-09-06) 506

Best query policy

Use SingleColumnValueFilter or SingleColumnRangeFilter. It will provide the definitevalue column_family:qualifierpair (called col1) in filter criteria.

If col1 is the first index column in the table, any index in the table can be a candidate indexused during the query. Example:

If there is an index on the col1, the index can be used as a candidate index because col1 is thefirst and the only column of the index. If there is another index on col1 and col2, you canconsider this index as a candidate index because col1 is the first column in the index list. Onthe other hand, if there is an index on col2 and col1, this index cannot be used as a candidateindex because the first column in the index list is not col1.

The most suitable method to use the index now is that when there are multiple candidateindexes, the most suitable index for scanning data needs to be selected from possiblecandidate indexes.

You can use the following solutions to learn how to select the best index policy.

1. It is better to fully match.

Scenario: There are two indexes available, one for col1&col2 and the other for col1.

In this scenario, the second index is better than the first one, because it scans less index data.

2. If there are multiple candidate multi-column indexes, select an index with fewer indexcolumns.

Scenario: There are two indexes available, one for col1&col2 and the other forcol1&col2&col3.

In this case, you had better use the index on col1 and col2, because it scans less index data.

NOTE

1. During a query based on an index, the index state must be ACTIVE. You can invoke the listIndices()API to view the index state.

2. To make that correct data can be queried based on the index, ensure the consistency between indexdata and user data.

3. Run the following command to perform a complex query on the HBase shell client (assuming that anindex has been created for the specified column):

scan 'tablename', {FILTER => "SingleColumnValueFilter(family, qualifier, compareOp,comparator, filterIfMissing, latestVersionOnly)"}

Example: scan 'test', {FILTER => "SingleColumnValueFilter('info', 'age', =, '26', true, true)"}

In the preceding scenario, if you want to save the row where no column is found in the result, youshould not create any index in any such column, because if the column to be queried does not exist, therow will be filtered out when SCVF is used to scan the index columns. When the SCVF whosefilterIfMissingset is false (default value) scans non-index columns, rows where no column is queriedwill also be returned in the result. Therefore, to avoid inconsistent query results, you are advised to setfilterIfMissing to true after creating SCVF for the index column.

4. Run the following command in hbase shell to view the index data created for user data:

scan 'tablename', {ATTRIBUTES => {'FETCH_INDEX_DATA' => 'true'}}

8.6.4.2 Loading Index Data in Batches

Prerequisites


Issue 01 (2018-09-06) 507

Only MRS 1.7 or later supports the HBase function of loading index data in batches.

Scenario

HBase provides the ImportTsv&LoadIncremental tool to load user data in batches. Currently,the HIndexImportTsv is provided to load user data and load index data in batches.HIndexImportTsv inherits all functions of the HBase batch data loading tool ImportTsv. If atable is not created before the HIndexImportTsv tool is executed, an index will be createdwhen the table is created, and index data is generated when user data is generated.

Procedure

1. Run the following commands to import data to HDFS:hdfs dfs -mkdir <inputdir>hdfs dfs -put <local_data_file> <inputdir>For example, define the data file data.txt as follows:12005000201,Zhang San,Male,19,Shenzhen City, Guangdong Province12005000202,Li Wanting,Female,23,Hangzhou City, Zhejiang Province12005000203,Wang Ming,Male,26,Ningbo City, Zhejiang Province12005000204,Li Gang,Male,18,Xiangyang City, Hubei Province12005000205,Zhao Enru,Female,21,Shangrao City, Jiangxi Province12005000206,Chen Long,Male,32,Zhuzhou City, Hunan Province12005000207,Zhou Wei,Female,29,Nanyang City, Henan Province12005000208,Yang Yiwen,Female,30,Wenzhou City, Zhejiang Province12005000209,Xu Bing,Male,26,Weinan City, Shanxi Province12005000210,Xiao Kai,Male,25,Dalian City, Liaoning ProvinceRun the following commands:hdfs dfs -mkdir /datadirImporthdfs dfs -put data.txt /datadirImport

2. Run the following command to create the bulkTable table:create 'bulkTable', {NAME => 'info',COMPRESSION => 'SNAPPY',DATA_BLOCK_ENCODING => 'FAST_DIFF'},{NAME=>'address'}

3. Run the following command to generate an HFile file (StoreFiles):hbase org.apache.hadoop.hbase.hindex.mapreduce.HIndexImportTsv -Dimporttsv.separator=<separator>-Dimporttsv.bulk.output=</path/for/output> -Dindexspecs.to.add=<indexspecs> -Dimporttsv.columns=<columns> tableName <inputdir>– -Dimport.separator: Indicates a separator, for example, -Dimport.separator=','.– -Dimport.bulk.output=</path/for/output>: Indicates an output path of the

execution result. You need to specify a path that does not exist.– <columns>: Indicates the mapping of the imported data in a table, for example, -

Dimporttsv.columns=HBASE_ROW_KEY,info:name,info:gender,info:age,address:city,address:province.

– <tablename>: Indicates the name of a table to be operated.– <inputdir>: Indicates the directory where data is loaded in batches.– -Dindexspecs.to.add=<indexspecs>: Indicates the mapping between an index name

and a column, for example, -Dindexspecs.to.add='index_bulk=>info:[age->String]'. Figure 8-2 shows the index composition, which can be represented asfollows:indexNameN=>familyN :[columnQualifierN-> columnQualifierDataType],[columnQualifierM-> columnQualifierDataType];familyM: [columnQualifierO->


Issue 01 (2018-09-06) 508

columnQualifierDataType]# indexNameN=> familyM: [columnQualifierO->columnQualifierDataType]

Figure 8-2 Index

Column qualifiers are separated by commas (,).Example: "index1 => f1:[c1-> String],[c2-> String]"Column families are separated by semicolons (;).Example: "index1 => f1:[c1-> String],[c2-> String]; f2:[c3-> Long]"Multiple indexes are separated by pound keys (#).Example: "index1 => f1:[c1-> String],[c2-> String]; f2:[c3-> Long]#index2 => f2:[c3-> Long]"The following data types are supported by columns.Available data types are as follows: STRING, INTEGER, FLOAT, LONG,DOUBLE, SHORT, BYTE, CHAR

NOTE

1. Data types can be transferred in lowercase.

For example, run the following command:hbase org.apache.hadoop.hbase.hindex.mapreduce.HIndexImportTsv -Dimporttsv.separator=',' -Dimporttsv.bulk.output=/dataOutput -Dindexspecs.to.add='index_bulk=>info:[age->String]' -Dimporttsv.columns=HBASE_ROW_KEY,info:name,info:gender,info:age,address:city,address:province bulkTable /datadirImport/data.txtCommand output:[root@shap000000406 opt]# hbase org.apache.hadoop.hbase.hindex.mapreduce.HIndexImportTsv -Dimporttsv.separator=',' -Dimporttsv.bulk.output=/dataOutput -Dindexspecs.to.add='index_bulk=>info:[age->String]' -Dimporttsv.columns=HBASE_ROW_KEY,info:name,info:gender,info:age,address:city,address:province bulkTable /datadirImport/data.txt2018-05-08 21:29:16,059 INFO [main] mapreduce.HFileOutputFormat2: Incremental table bulkTable output configured.2018-05-08 21:29:16,069 INFO [main] client.ConnectionManager$HConnectionImplementation: Closing master protocol: MasterService2018-05-08 21:29:16,069 INFO [main] client.ConnectionManager$HConnectionImplementation: Closing zookeeper sessionid=0x80007c2cb4fd5b4d2018-05-08 21:29:16,072 INFO [main] zookeeper.ZooKeeper: Session: 0x80007c2cb4fd5b4d closed2018-05-08 21:29:16,072 INFO [main-EventThread] zookeeper.ClientCnxn: EventThread shut down for session: 0x80007c2cb4fd5b4d2018-05-08 21:29:16,379 INFO [main] client.ConfiguredRMFailoverProxyProvider: Failing over to 1472018-05-08 21:29:17,328 INFO [main] input.FileInputFormat: Total input files to process : 12018-05-08 21:29:17,413 INFO [main] mapreduce.JobSubmitter: number of splits:12018-05-08 21:29:17,430 INFO [main] Configuration.deprecation: io.bytes.per.checksum is deprecated. Instead, use dfs.bytes-per-checksum


Issue 01 (2018-09-06) 509

2018-05-08 21:29:17,687 INFO [main] mapreduce.JobSubmitter: Submitting tokens for job: job_1525338489458_00022018-05-08 21:29:18,100 INFO [main] impl.YarnClientImpl: Submitted application application_1525338489458_00022018-05-08 21:29:18,136 INFO [main] mapreduce.Job: The url to track the job: http://shap000000407:8088/proxy/application_1525338489458_0002/2018-05-08 21:29:18,136 INFO [main] mapreduce.Job: Running job: job_1525338489458_00022018-05-08 21:29:28,248 INFO [main] mapreduce.Job: Job job_1525338489458_0002 running in uber mode : false2018-05-08 21:29:28,249 INFO [main] mapreduce.Job: map 0% reduce 0%2018-05-08 21:29:38,344 INFO [main] mapreduce.Job: map 100% reduce 0%2018-05-08 21:29:51,421 INFO [main] mapreduce.Job: map 100% reduce 100%2018-05-08 21:29:51,428 INFO [main] mapreduce.Job: Job job_1525338489458_0002 completed successfully2018-05-08 21:29:51,523 INFO [main] mapreduce.Job: Counters: 50

4. Run the following command to import the generated HFile to HBase:hbase org.apache.hadoop.hbase.mapreduce.LoadIncrementalHFiles </path/for/output> <tablename>For example, run the following command:hbase org.apache.hadoop.hbase.mapreduce.LoadIncrementalHFiles /dataOutputbulkTableCommand output:[root@shap000000406 opt]# hbase org.apache.hadoop.hbase.mapreduce.LoadIncrementalHFiles /dataOutput bulkTable2018-05-08 21:30:01,398 WARN [main] mapreduce.LoadIncrementalHFiles: Skipping non-directory hdfs://hacluster/dataOutput/_SUCCESS2018-05-08 21:30:02,006 INFO [LoadIncrementalHFiles-0] hfile.CacheConfig: Created cacheConfig: CacheConfig:disabled2018-05-08 21:30:02,006 INFO [LoadIncrementalHFiles-2] hfile.CacheConfig: Created cacheConfig: CacheConfig:disabled2018-05-08 21:30:02,006 INFO [LoadIncrementalHFiles-1] hfile.CacheConfig: Created cacheConfig: CacheConfig:disabled2018-05-08 21:30:02,085 INFO [LoadIncrementalHFiles-2] compress.CodecPool: Got brand-new decompressor [.snappy]2018-05-08 21:30:02,120 INFO [LoadIncrementalHFiles-0] mapreduce.LoadIncrementalHFiles: Trying to load hfile=hdfs://hacluster/dataOutput/address/042426c252f74e859858c7877b95e510 first=12005000201 last=120050002102018-05-08 21:30:02,120 INFO [LoadIncrementalHFiles-2] mapreduce.LoadIncrementalHFiles: Trying to load hfile=hdfs://hacluster/dataOutput/info/f3995920ae0247a88182f637aa031c49 first=12005000201 last=120050002102018-05-08 21:30:02,128 INFO [LoadIncrementalHFiles-1] mapreduce.LoadIncrementalHFiles: Trying to load hfile=hdfs://hacluster/dataOutput/d/c53b252248af42779f29442ab84f86b8 first=\x00index_bulk\x00\x00\x00\x00\x00\x00\x00\x0018\x00\x0012005000204 last=\x00index_bulk\x00\x00\x00\x00\x00\x00\x00\x0032\x00\x00120050002062018-05-08 21:30:02,231 INFO [main] client.ConnectionManager$HConnectionImplementation: Closing master protocol: MasterService2018-05-08 21:30:02,231 INFO [main] client.ConnectionManager$HConnectionImplementation: Closing zookeeper sessionid=0x81007c2cf0f55cc52018-05-08 21:30:02,235 INFO [main] zookeeper.ZooKeeper: Session: 0x81007c2cf0f55cc5 closed2018-05-08 21:30:02,235 INFO [main-EventThread] zookeeper.ClientCnxn: EventThread shut down for session: 0x81007c2cf0f55cc5

8.6.4.3 Using an Index Generation ToolPrerequisites

Only MRS 1.7 or later supports the HBase function of using an index generation tool togenerate index data.


Issue 01 (2018-09-06) 510

Scenario

To quickly create indexes for user data, HBase provides the TableIndexer tool for you tocreate, add, and delete an index using MapReduce functions. The application scenarios are asfollows:

l You want to add an index for a specified column in a table where a large amount of dataexists. However, if you use the addIndicesWithData() API to add an index, index datacorresponding to the related user data will be generated, which is time-consuming. If youuse addIndices() to create an index, index data corresponding to user data will not bebuilt. Therefore, to create index data for user data, you can use the TableIndexer tool tocreate an index.

l If the index data is inconsistent with the user data, the tool can be used to rebuild indexdata.– If you temporarily disable the index, put new data to the disabled index column, and

then directly enable the index from the disabled state, index data and user data maybe inconsistent. Therefore, you must rebuild all index data before using it again.

l You can use the TableIndexer tool to completely delete a large amount of existing indexdata from a user table.

l For user tables that do not have indexes, this tool allows you to add and build indexes atthe same time.

Use Method

l Adding a new index to the user tableRun the following command: hbase org.apache.hadoop.hbase.hindex.mapreduce.TableIndexer -Dtablename.to.index=tablename -Dindexspecs.to.add='idx_0=>cf_0:[q_0->string],[q_1];cf_1:[q_2],[q_3]#idx_1=>cf_1:[q_4]'The following parameters are required.– tablename.to.index: Indicates the name of a table for which an index is created.– indexspecs.to.add: Indicates the mapping between the index name and the column

in the corresponding user table.– scan.caching (optional): Contains an integer value, indicating the number of cached

rows to be transmitted to the scanner during data table scanning.The parameters in the preceding command are described as follows:– idx_1: Indicates an index name.– cf_0: Indicates the name of a column family.– q_0: Indicates the name of a column.– string: Indicates a data type. The parameter value can be STRING, INTEGER,

FLOAT, LONG, DOUBLE, SHORT, BYTE, or CHAR.


Issue 01 (2018-09-06) 511

NOTE

l The pound key (#) is used to separate indexes. The semicolon (;) is used to separate columnfamilies. The comma (,) is used to separate column qualifiers.

l The column name and its data type must be included in '[]'.

l Column names and their data types are separated by '->'.

l If the data type of a specific column is not specified, the default data type (string) is used.

l If scan.caching is not configured, the default value 1000 is used.

l The user table must exist.

l The index specified in the table must not exist.

l If a column family named d exists in the user table, you must use the TableIndexer tool tobuild index data.

After the preceding command is executed, the specified index is added to the table and isin INACTIVE state. This behavior is similar to the addIndices() API.

l Creating index data for existing indexes in the user table

The command is as follows:> hbase org.apache.hadoop.hbase.hindex.mapreduce.TableIndexer -Dtablename.to.index=tablename -Dindexnames.to.build ='idx_0#idx_1'

The following parameters are required.

– tablename.to.index: Indicates the name of a table for which an index is created.

– indexspecs.to.build: Indicates an index name.

– scan.caching (optional): Contains an integer value, indicating the number of cachedrows to be transmitted to the scanner during data table scanning.

The parameters in the preceding command are described as follows:

– idx_1: Indicates an index name.

NOTE

l The pound key (#) is used to separate index names.

l If scan.caching is not configured, the default value 1000 is used.

l The user table must exist.

After the preceding command is executed, the specified index is set to the ACTIVE state.Users can use them when scanning data.

l Deleting the existing indexes and their data from the user table

The command is as follows:> hbase org.apache.hadoop.hbase.hindex.mapreduce.TableIndexer -Dtablename.to.index=tablename -Dindexnames.to.drop='idx_0#idx_1'

The following parameters are required.

– tablename.to.index: Indicates the name of a table for which an index is created.

– indexnames.to.drop: Indicates the name of the index that should be deleted with itsdata (must exist in the table).

– scan.caching (optional): Contains an integer value, indicating the number of cachedrows to be transmitted to the scanner during data table scanning.

The parameters in the preceding command are described as follows:

– idx_1: Indicates an index name.


Issue 01 (2018-09-06) 512

NOTE

l The pound key (#) is used to separate index names.l If scan.caching is not configured, the default value 1000 is used.l The user table must exist.

After the preceding command is executed, the specified index is deleted from the table.l Adding new indexes to user tables and building data based on existing data

The command is as follows:> hbase org.apache.hadoop.hbase.hindex.mapreduce.TableIndexer -Dtablename.to.index=tablename -Dindexspecs.to.add='idx_0 => cf_0:[q_0-> string],[q_1];cf_1:[ q_2],[q_3]#idx_1 => cf_1:[q_4]' -Dindexnames.to.build='idx_0'

NOTE

l The parameters are the same as the previous ones.l The user table must exist.l The indexes specified in indexspecs.to.add must not exist in the table.l The index names specified in indexnames.to.build must exist in the table or be part of the

value of indexspecs.to.add.

After the preceding command is executed, all indexes specified in indexspecs.to.addwill be added to this table, and index data will be built for all specified indexes usingindexnames.to.build.

8.6.4.4 Migrating Index DataScenarios

The indexes used in MRS 1.7 or later are incompatible with secondary indexes used by HBasein earlier MRS versions. Therefore, you need to perform the following operations to migrateindex data from an earlier version (MRS 1.5 or earlier) to MRS 1.7 or later.

Prerequisites

1. During data migration, the cluster of the old version must be MRS 1.5 or earlier, and thecluster of the new version must be MRS 1.7 or later.

2. Before data migration, you must have old index data.

3. A cross-cluster mutual trust relationship must be configured and the inter-cluster replicationfunction must be enabled for a security cluster. For a non-secure cluster, only the inter-clusterreplication function needs to be enabled. For details, see Configuring Cross-Cluster MutualTrust Relationships and Enabling the Cross-Cluster Copy Function.

Procedure

Migrate user data from an old cluster to a new cluster. To migrate data, you need to manuallysynchronize data of the old and new clusters in a single table by export, distcp, and import.

For example, the current old cluster has user table (t1, index name: idx_t1) and itscorresponding index table (t1_idx). Perform the following operations to migrate data.

1. Export table data from the old cluster.hbase org.apache.hadoop.hbase.mapreduce.Export -Dhbase.mapreduce.include.deleted.rows=true <tableName> <path/for/data>– <tableName>: Indicates a table name, for example, t1.– <path/for/data>: Indicates the path for storing source data, for example, /user/

hbase/t1.


Issue 01 (2018-09-06) 513

Example: hbase org.apache.hadoop.hbase.mapreduce.Export -Dhbase.mapreduce.include.deleted.rows=true t1 /user/hbase/t1

2. Copy the exported data to the new cluster as follows:hadoop distcp <path/for/data> hdfs://ActiveNameNodeIP:9820/<path/for/newData>– <path/for/data>: Indicates the path for storing source data in the old cluster, for

example, /user/hbase/t1.– <path/for/newData>: Indicates the path for storing source data in the new cluster,

for example, /user/hbase/t1.ActiveNameNodeIP indicates the IP address of the active NameNode in the new cluster,for example, hadoop distcp /user/hbase/t1 hdfs://192.168.40.2:9820/user/hbase/t1

NOTE

1. Manually copy the exported data to HDFS of the new cluster, for example, /user/hbase/t1.2. In MRS 1.6.2 or earlier versions, the default port is 25000. For details, see List of Open SourceComponent Ports.

3. Use the HBase table user of the new cluster to generate HFiles in the new cluster. hbase org.apache.hadoop.hbase.mapreduce.Import -Dimport.bulk.output=<path/for/hfiles> <tableName> <path/for/newData>– <path/for/hfiles>: Indicates the path of the HFiles generated in the new cluster, for

example, /user/hbase/output_t1.– <tableName>: Indicates a table name, for example, t1.– <path/for/newData>: Indicates the path for storing source data in the new cluster,

for example, /user/hbase/t1.Example:hbase org.apache.hadoop.hbase.mapreduce.Import -Dimport.bulk.output=/user/hbase/output_t1 t1 /user/hbase/t1

4. Import the generated HFiles to the table in the new cluster.The command is as follows: hbase org.apache.hadoop.hbase.mapreduce.LoadIncrementalHFiles <path/for/hfiles> <tableName>

– <path/for/hfiles>: Indicates the path of the HFiles generated in the new cluster, forexample, /user/hbase/output_t1.

– <tableName>: Indicates a table name, for example, t1.Example:hbase org.apache.hadoop.hbase.mapreduce.LoadIncrementalHFiles /user/hbase/output_t1 t1

NOTE

1. The preceding shows the process of migrating user data. You only need to perform the first threesteps to migrate the index data of the old cluster and change the corresponding table name to anindex table name (for example, t1_idx).2. Skip 4 when migrating index data.

5. Import index data to a table in the new cluster.

a. Add the index as the same as the index of the user table of the previous version tothe user table of the new cluster (the column family named 'd' must not exist in theuser table).The command is as follows:hbase org.apache.hadoop.hbase.hindex.mapreduce.TableIndexer -Dtablename.to.index=<tableName> -Dindexspecs.to.add=<indexspecs>


Issue 01 (2018-09-06) 514

n -Dtablename.to.index=<tableName>: Indicates a table name, for example, -Dtablename.to.index=t1.

n -Dindexspecs.to.add=<indexspecs>: Indicates the mapping between an indexname and a column, for example, -Dindexspecs.to.add='idx_t1=>info:[name->String]'.

Example:

hbase org.apache.hadoop.hbase.hindex.mapreduce.TableIndexer -Dtablename.to.index=t1 -Dindexspecs.to.add='idx_t1=>info:[name->String]'

NOTE

If a column family named d exists in the user table, you must use the TableIndexer tool tobuild index data.

b. Run the LoadIncrementalHFiles tool to load the index data of the old cluster to atable in the new cluster.

The command is as follows:hbase org.apache.hadoop.hbase.mapreduce.LoadIncrementalHFiles </path/for/hfiles> <tableName>

n </path/for/hfiles>: Indicates the path of index data on HDFS. The path is theindex generation path specified in -Dimport.bulk.output, for example, /user/hbase/output_t1_idx.

n <tableName>: Indicates a table name of the new cluster, for example, t1.

Example:

hbase org.apache.hadoop.hbase.mapreduce.LoadIncrementalHFiles /user/hbase/output_t1_idx t1

8.7 Using Hue

8.7.1 Accessing the Hue WebUI

Scenario

After Kerberos authentication is enabled and Hue is installed for an MRS cluster, users canuse Hadoop and Hive on the Hue WebUI.

This section describes how to open the Hue WebUI on the MRS cluster supporting Kerberosauthentication.

NOTE

To access the Hue WebUI, you are advised to use a browser that is compatible with the Hue WebUI, forexample, Google Chrome 50. The Internet Explorer may be incompatible with the Hue WebUI.


Site trust must be added to the browser when you access MRS Manager and Hue WebUI forthe first time. Otherwise, the Hue WebUI cannot be accessed.


Issue 01 (2018-09-06) 515

Prerequisites

The MRS cluster administrator has assigned the permission for using Hive to the user. Fordetails, see Creating a User. For example, create Human-machine user hueuser, add theuser to the hive group, and assign the user the System_administrator role.

Procedure

Step 1 Access MRS Manager.

For details, see Accessing MRS Manager Supporting Kerberos Authentication.

Step 2 On MRS Manager, choose Service > Hue. In Hue WebUI of Hue Summary, click Hue(Active). The Hue WebUI is opened.

Hue WebUI provides the following functions:

l If Hive is installed in the MRS cluster, you can use Query Editors to execute querystatements of Hive.

l If Hive is installed in the MRS cluster, you can use Data Browsers to manage Hivetables.

l If HDFS is installed in the MRS cluster, you can use File Browser to view directoriesand files in HDFS.

l If Yarn is installed in the MRS cluster, you can use Job Browser to view all jobs in theMRS cluster.

NOTE

l After obtaining the URL for accessing the Hue WebUI, user can give the URL to other users whocannot access MRS Manager for accessing the Hue WebUI.

l If you perform operations on the Hue WebUI only but not on MRS Manager, you must enter thepassword of the current login user when accessing MRS Manager again.

----End

8.7.2 Using HiveQL Editor on the Hue WebUI

Scenario

After Kerberos authentication is enabled for an MRS cluster, users can use the Hue WebUI toexecute HiveQL statements in the cluster.

Prerequisites

The MRS cluster administrator has assigned the permission for using Hive to the user. Fordetails, see Creating a User.

Accessing Query Editors

Step 1 Access the Hue WebUI and choose Query Editors > Hive. The Hive page is displayed.

Hive supports the following functions.

l Executes and manages HiveQL statements.l Queries HiveQL statements saved by the current user in Saved Queries.


Issue 01 (2018-09-06) 516

l Queries HiveQL statements executed by the current user in Query History.

Step 2 Click to display all databases in Hive.

----End

Executing HiveQL Statements

Step 1 Access Query Editors.

Step 2 Select a Hive database in Databases. The default database is default.

The system displays all available tables. You can enter a keyword of the table name to searchfor the desired table.

Step 3 Click the desired table name. All columns in the table are displayed.

Move the cursor to the row of the table and click . Column details are displayed.

Step 4 Enter the query statements in the area for editing HiveQL statements.

Click and choose Explain. The editor checks the syntax and execution plan of theentered statements. If the statements have syntax errors, the editor reports Error whilecompiling statement.

Step 5 Select the engine for executing the HiveQL statements.l mr: MapReduce computing frameworkl spark: Spark computing framework

Step 6 Click to execute the HiveQL statements.

NOTE

l If you want to use the entered HiveQL statements again, click to save them.

l To format the HiveQL statements, click and choose Format.

l To delete the entered HiveQL statements, click and choose Clear.

l To clear the entered statements and start a new query, click and choose New query.

----End

Querying Execution Results

Step 1 View the execution results below the execution area on Hive. The Query History tab isdisplayed by default.

Step 2 Click a result to view the executed statements.

----End

Managing Statements

Step 1 Access Query Editors.


Issue 01 (2018-09-06) 517

Step 2 Click Saved Queries.

Click a saved statement. The system automatically fills the statement in the editing area.

----End

Modifying Query Editors Settings

Step 1 On the Hive page, click .

Step 2 Click on the right side of Files, and click to specify the save path of the file.

You can click to add a file resource.

Step 3 Click on the right side of Functions. Enter the name of the user-defined function and thefunction class.

You can click to add a function.

Step 4 Click on the right side of Settings. Enter the Hive parameter name in Key under Settingsand the parameter value in Value. The session connects to Hive using the user-definedconfiguration.

You can click to add a parameter.

----End

8.7.3 Using the Metadata Browser on the Hue WebUI

Scenario

After Kerberos authentication is enabled for an MRS cluster, users can use the Hue WebUI tomanage Hive metadata in the cluster.

Prerequisites

The MRS cluster administrator has assigned the permission for using Hive to the user. Fordetails, see Creating a Role.

Accessing the Metadata Browser

Step 1 Access the Hue WebUI.

Step 2 Choose Data Browsers > Metastore Tables to access Metastore Manager.

Metastore Manager supports the following functions.

l Creating a Hive table from a filel Manually creating a Hive tablel Viewing Hive table metadata

----End


Issue 01 (2018-09-06) 518

Creating a Hive Table from a File

Step 1 Access Metastore Manager and select a database in Databases.

The default database is default.

Step 2 Click to access the Create a new table from a file page.

Step 3 Select a file.

1. In Table Name, enter a Hive table name.

A Hive table name contains no more than 128 letters, numbers, or underscores (_) andmust start with a letter or number.

2. In Description, enter description about the Hive table as required.

3. In Input File or Location, click and select a file in HDFS for creating a Hive table.The file is used to store new data of the Hive table.

If the file is not stored in HDFS, click Upload a file to upload the file from the localdirectory to HDFS. Multiple files can be simultaneously uploaded. The files cannot beempty.

4. If you want to import data in the file to the Hive table, select Import data (selected bydefault) in Load method.

If you select Create External Table, an external Hive table is created.

NOTE

If you select Create External Table, select a path in Input File or Location.

If you select Leave Empty, an empty Hive table is created.

5. Click Next.

Step 4 Set a delimiter.

1. In Delimiter, select one.

If the delimiter you want to select is not in the list, select Other.. and enter a delimiter.

2. Click Preview to preview data processing.

3. Click Next.

Step 5 Define a column.

1. If you click on the right side of Use first row as column names, the first row ofdata in the file is used as a column name. If you do not click it, the first row of data is notused as the column name.

2. In Column name, set a name for each column.

A Hive table name contains no more than 128 letters, numbers, or underscores (_) andmust start with a letter or number.

NOTE

You can rename columns in batches by clicking on the right side of Bulk edit column names.Enter all column names and separate them by commas (,).

3. In Column Type, select a type for each column.


Issue 01 (2018-09-06) 519

Step 6 Click Create Table to create the table. Wait for Hue to display information about the Hivetable.

----End

Manually Creating a Hive Table

Step 1 Access Metastore Manager and select a database in Databases.


Step 2 Click to access the Create a new table manually page.

Step 3 Set a table name.

1. In Table Name, enter a Hive table name.A Hive table name contains no more than 128 letters, numbers, or underscores (_) andmust start with a letter or number.

2. In Description, enter description about the Hive table as required.3. Click Next.

Step 4 Select a data storage format.l If data needs to be separated by delimiters, select Delimited and perform Step 5.l If data needs to be stored in serialization format, select SerDe and perform Step 6.

Step 5 Set a delimiter.

1. In Field terminator, set a column delimiter.If the delimiter you want to select is not in the list, select Other.. and enter a delimiter.

2. In Collection terminator, set a delimiter to separate the data set of columns of the arraytype in Hive. For example, the type of a column is array. A value needs to storeemployee and manager. The user specifies : as the delimiter. Therefore, the final valueis employee:manager.

3. In Map key terminator, set a delimiter to separate the data set of columns of the arraytype in Hive. For example, the type of a column is map. A value needs to store homethat is described as aaa and company that is described as bbb. The user defines | as thedelimiter. Therefore, the final value is home|aaa:company|bbb.

4. Click Next and perform Step 7.

Step 6 Set serialization properties.

1. In SerDe Name, enter the class name of the serialization format:org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe.Users can expand Hive to support more user-defined serialization classes.

2. In Serde properties, enter the value of the serialization format: "field.delim"=",""colelction.delim"=":" "mapkey.delim"="|".

3. Click Next and perform Step 7.

Step 7 Select a data table format and click Next.l TextFile: indicates that data is stored in text files.l SequenceFile: indicates that data is stored in binary files.l InputFormat: indicates that data in files is used in the user-defined input and output

formats.


Issue 01 (2018-09-06) 520

Users can expand Hive to support more user-defined formatting classes.

a. In InputFormat Class, enter the class used by input data:org.apache.hadoop.hive.ql.io.RCFileInputFormat.

b. In OutputFormat Class, enter the class used by output data:org.apache.hadoop.hive.ql.io.RCFileOutputFormat.

Step 8 Select a file storage location and click Next.

Use default location is selected by default. If you want to customize a storage location,deselect the default value and specify a file storage location in External location.

Step 9 Set columns of the Hive table.

1. In Column name, set a column name.A Hive table name contains no more than 128 letters, numbers, or underscores (_) andmust start with a letter or number.

2. In Column type, select a column type.Click Add a column to add a new column.

3. Click Add a partition to add partitions for the Hive table, which can improve queryefficiency.

Step 10 Click Create Table to create the table. Wait for Hue to display information about the Hivetable.

----End

Managing the Hive Table

Step 1 Access Metastore Manager and select a database in Databases. All tables in the database aredisplayed on the page.


Step 2 Click a table name in the database to view table details.

The following operations are supported: importing data, browsing data, deleting tables, andviewing file storage location.

NOTE

When viewing all tables in the database, you can select tables and perform the following operations:viewing tables, browsing data, and delete tables.

----End

8.7.4 Using File Browser on the Hue WebUI

ScenarioAfter Kerberos authentication is enabled for an MRS cluster, users can use the Hue WebUI tomanage files in HDFS.

PrerequisitesAn MRS cluster administrator has granted users with permission to view, create, modify, anddelete files in HDFS. For details, see Creating a User.


Issue 01 (2018-09-06) 521

Accessing File Browser

Step 1 Access the Hue WebUI and click File Browser.

Step 2 You can view the home directory of the current login user.

On the File Browser page, the following information about subdirectories or files in thedirectory is displayed.

Table 8-9 HDFS file attributes

Attribute Description

Name Name of a directory or file

Size File size

User Owner of a directory or file

Group Group of a directory or file

Permissions Permission of a directory or file

Date Time when a directory or file is created

Step 3 In the search box, enter a keyword. The system automatically searches directories or files inthe current directory.

Step 4 Clear the search box. The system displays all directories or files.

----End

Performing Operations

Step 1 On File Browser, select one or more directories or files.

Step 2 Click Actions. On the menu that is displayed, select an operation.l Rename: renames a directory or file.l Move: moves a file. In Move to, select a new directory and click Move.l Copy: copies the selected files or directories.l Download: downloads the selected files. Directories are not supported.l Change permissions: changes permission to access the selected directory or file.

– You can grant the owner, the group, or other users with the Read, Write, andExecute permission.

– Sticky indicates that only HDFS administrators, directory owners, and file ownerscan delete or move files in the directory.

– Recursive indicates that permission is granted to subdirectories recursively.l Storage policies: sets the policies for storing files or directories in HDFS.l Summary: views HDFS storage information about the selected file or directory.

----End


Issue 01 (2018-09-06) 522

Deleting Directories or Files

Step 1 On File Browser, select one or more directories or files.

Step 2 Click Move to trash. In Confirm Delete, click Yes to move them to the recycle bin.

If you want to directly delete the files without moving them to the recycle bin, click andselect Delete forever. In Confirm Delete, click Yes to confirm the operation.

----End

Accessing Other Directories

Step 1 Click the directory name, type a complete directory you want to access, for example, /mr-history/tmp, and press Enter.

The current user must have permission to access other directories.

Step 2 Click Home to go to the home directory.

Step 3 Click History. The history records of directory access are displayed and the directories can beaccessed again.

Step 4 Click Trash to access the recycle bin of the current directory.

Click Empty trash to clean up the recycle bin.

----End

Uploading User Files

Step 1 On File Browser, click Upload.

Step 2 Select an operation.

l Files: uploads user files to the current user.

l Zip/Tgz/Bz2 file: uploads a compressed file. In the dialog box that is displayed, clickZip, Tgz or Bz2 file to select the compressed file to be uploaded. The systemautomatically decompresses the file in HDFS. Compressed files in ZIP, TGZ, and BZ2formats are supported.

----End

Creating a New File or Directory

Step 1 On File Browser, click New.

Step 2 Select an operation.

l File: creates a file. Enter a file name and click Create.

l Directory: creates a directory. Enter a directory name and click Create.

----End


Issue 01 (2018-09-06) 523

8.7.5 Using Job Browser on the Hue WebUI

ScenarioAfter Kerberos authentication is enabled for an MRS cluster, users can use the Hue WebUI toquery all jobs in the cluster.

Accessing Job Browser

Step 1 Access the Hue WebUI and click Job Browser.

Step 2 View the jobs in the cluster.

NOTE

The number on Job Browser indicates the total number of jobs in the cluster.

Job Browser displays the following job information.

Table 8-10 MRS job attributes

Attribute Description

Logs Log information. If a job has logs, is displayed.

ID Job ID, which is generated by the system automatically

Name Job name

Application Type Job type information

Status Job status. Possible values are RUNNING, SUCCEEDED,FAILED, and KILLED.

User User who starts the job

Maps Map progress

Reduces Reduce progress

Queue Yarn queue used for job running

Priority Job running priority

Duration Job running duration

Submitted Time when the job is submitted to the MRS cluster

NOTE

If the MRS cluster has Spark, the Spark-JDBCServer job is started by default to execute tasks.

----End


Issue 01 (2018-09-06) 524

Searching for Jobs

Step 1 Enter keywords in Username or Text on Job Browser to search for the desired jobs.

Step 2 Clear the search criteria. The system displays all jobs.

----End

Querying Job Details

Step 1 In the job list on Job Browser, click the row that contains the desired job to view details.

Step 2 On the Metadata tab page, you can view the metadata of the job.

NOTE

You can click to open job running logs.

----End

8.8 Using Kafka

8.8.1 Managing Kafka Topics

Scenario

Users can manage Kafka topics on the MRS cluster client to meet service requirements. Forclusters with Kerberos authentication enabled, the management permission is required.

Prerequisites

The client has been updated.

Procedure

Step 1 On MRS Manager, choose Service > ZooKeeper > Instance. Query the IP addresses of theZooKeeper instances.

Record the IP address of any ZooKeeper instance.

Step 2 Log in to the node where the client is installed.



sudo su - omm

Step 4 Run the following command to switch to the client directory, for example, /opt/client/Kafka/kafka/bin.

cd /opt/client/Kafka/kafka/bin



Issue 01 (2018-09-06) 525

source /opt/client/bigdata_env

Step 6 If Kerberos authentication is enabled, run the following command to authenticate the user. IfKerberos authentication is disabled, skip this step.

kinit Kafka username

For example, kinit admin

Step 7 Manage Kafka topics using the following commands:l Create a topic.

sh kafka-topics.sh --create --topic Topic name --partitions Number of partitions usedby the topic --replication-factor Number of replicas of the topic --zookeeper IPaddress of the node where the ZooKeeper instance is located:clientPort/kafka

l Delete a topic.sh kafka-topics.sh --delete --topic Topic name --zookeeper IP address of the nodewhere the ZooKeeper instance is located:clientPort/kafka

NOTE

l The number of topic partitions or topic backups cannot exceed the number of Kafka instances.l By default, clientPort of ZooKeeper is 2181

For MRS 1.6.2 or earlier, the default ZooKeeper clientPort is 24002. For details, see List of OpenSource Component Ports.

l There are three ZooKeeper instances. Use the IP address of any one.l For details about managing messages in Kafka topics, see Managing Messages in Kafka Topics.

----End

8.8.2 Querying Kafka Topics

ScenarioUsers can query existing Kafka topics on MRS Manager.

Procedure


Step 2 Choose Service > Kafka > Kafka Topic Monitor.

All topics are displayed in the list by default. Users can view the number of partitions andbackups of the topics.

Step 3 Click the desired topic in the list to view its details.

----End

8.8.3 Managing Kafka User Permission

ScenarioFor clusters with Kerberos authentication enabled, using Kafka requires relevant permission.MRS clusters can grant the use permission of Kafka to different users.

Table 8-11 lists the default Kafka user groups.


Issue 01 (2018-09-06) 526

Table 8-11 Default Kafka user groups


kafkaadmin Kafka administrator group. Users in this group have the permissionto create, delete, read, and write all topics, and authorize otherusers.

kafkasuperuser Kafka super user group. Users in this group have the permission toread and write all topics.

kafka Kafka common user group. Users in this group must be authorizedby the users in kafkaadmin to read and write certain topics.

Prerequisitesl The client has been updated.l A user in the kafkaadmin group, for example admin, has been prepared.

Procedure

Step 1 On MRS Manager, choose Service > ZooKeeper > Instance. Query the IP addresses of theZooKeeper instances.

Record the IP address of any ZooKeeper instance.




sudo su - omm





Step 6 Run the following command to authenticate the Kafka administrator account.

kinit Administrator account


Step 7 Manage Kafka user permission using the following commands:l Query the permission list of a topic.

sh kafka-acls.sh --authorizer-properties zookeeper.connect=IP address of the nodewhere the ZooKeeper instance is located:2181/kafka --list --topic Topic name

l Add producer permission to a user.


Issue 01 (2018-09-06) 527

sh kafka-acls.sh --authorizer-properties zookeeper.connect=IP address of the nodewhere the ZooKeeper instance is located:2181/kafka --add --allow-principalUser:Username --producer --topic Topic name

l Remove producer permission of a user.sh kafka-acls.sh --authorizer-properties zookeeper.connect=IP address of the nodewhere the ZooKeeper instance is located:2181/kafka --remove --allow-principalUser:Username --producer --topic Topic name

l Add consumer permission to a user.sh kafka-acls.sh --authorizer-properties zookeeper.connect=IP address of the nodewhere the ZooKeeper instance is located:2181/kafka --add --allow-principalUser:Username --consumer --topic Topic name --group Consumer group name

l Remove consumer permission of a user.sh kafka-acls.sh --authorizer-properties zookeeper.connect=IP address of the nodewhere the ZooKeeper instance is located:2181/kafka --remove --allow-principalUser:Username --consumer --topic Topic name --group Consumer group name

NOTE

You need to enter y twice to confirm the removal of permission.

For MRS 1.6.2 or earlier, the default ZooKeeper port is 24002. For details, see List of Open SourceComponent Ports.

----End

8.8.4 Managing Messages in Kafka Topics

Scenario

Users can produce or consume messages in Kafka topics using the MRS cluster client. Forclusters with Kerberos authentication enabled, the user must have the permission to performthese operations.

Prerequisites


Procedure

Step 1 On MRS Manager, choose Service > Kafka > Instance. Query the IP addresses of the Kafkainstances.

Record the IP address of any Kafka instance.




sudo su - omm



Issue 01 (2018-09-06) 528





kinit Kafka username


Step 7 Manage messages in Kafka topics using the following commands:l Produce messages.

sh kafka-console-producer.sh --broker-list IP address of the node where the Kafkainstance is located:9092 --topic Topic name --producer.config /opt/client/Kafka/kafka/config/producer.propertiesYou can input specified information as the messages produced by the producer and thenpress Enter to send the messages. To end message producing, press Ctrl + C to exit.

l Consume messages.sh kafka-console-consumer.sh --topic Topic name --bootstrap-server IP address ofthe node where the Kafka instance is located:9092 --new-consumer --consumer.config /opt/client/Kafka/kafka/config/consumer.propertiesIn the configuration file, group.id (indicating the consumer group) is set to example-group1 by default. Users can change the value as required. The value takes effect eachtime a consumption occurs.By default, the system reads unprocessed messages in the current consumer group whenthe command is executed. If a new consumer group is specified in the configuration fileand the --from-beginning parameter is added to the command, the system reads allmessages that have not been automatically deleted in Kafka.

NOTE

l For the IP address of the node where the Kafka instance is located, use the IP address of any Brokerinstance.

l If Kerberos authentication is enabled, change the port to 21007.

l For MRS 1.6.2 or earlier, the default non-security port is 21005. For details, see List of OpenSource Component Ports.

----End

8.9 Using Storm

8.9.1 Submitting Storm Topologies on the Client

Scenario

Users can submit Storm topologies on the MRS cluster client to continuously process streamdata. For clusters with Kerberos authentication enabled, users who submit topologies must bemembers of the stormadmin or storm group.


Issue 01 (2018-09-06) 529

Prerequisites


Procedure




sudo su - omm

Step 3 Run the following command to switch to the client directory, for example, /opt/client:

cd /opt/client


source bigdata_env


kinit Storm username


Step 6 Run the following command to submit the Storm topology:

storm jar Topology package path Class name of the main topology method Topology name

Example:

storm jar /opt/client/Storm/storm-1.0.2/examples/storm-starter/storm-starter-topologies-1.0.2.jar org.apache.storm.starter.WordCountTopology topo1

If the following information is displayed, the topology is submitted successfully.

Finished submitting topology: topo1

NOTE

l To support sampling messages, add the topology.debug and topology.eventlogger.executorsparameters. For example:

storm jar /opt/client/Storm/storm-1.0.2/examples/storm-starter/storm-starter-topologies-1.0.2.jar org.apache.storm.starter.WordCountTopology topo1 -ctopology.debug=true -c topology.eventlogger.executors=1

l Data processing methods vary with topologies. The topology in the example generates charactersrandomly and separates character strings. To query the processing status, enable the samplingfunction and perform operations according to Querying Data Processing Logs of the Topology.

Step 7 Run the following command to query Storm topologies. For clusters with Kerberosauthentication enabled, only users in the stormadmin or storm group can query alltopologies.

storm list

----End


Issue 01 (2018-09-06) 530

8.9.2 Accessing the Storm WebUI

Scenario

The Storm WebUI provides a graphical interface for using Storm. Only streaming clusterswith Kerberos authentication enabled support this function.

Prerequisitesl The password of user admin has been obtained. The admin password is specified by the

user when the MRS cluster is created.l If a user other than admin is used to access the Storm WebUI, the user must be added to

the storm or stormadmin user group.

Procedure

Step 1 Access MRS Manager.

Step 2 Choose Service > Storm. In Storm WebUI of Storm Summary, click any UI link to accessthe Storm WebUI.

NOTE

When accessing the Storm WebUI for the first time, you must add the address to the trusted site list.

The following information can be queried on the Storm WebUI:

l Storm cluster summaryl Nimbus summaryl Topology summaryl Supervisor summaryl Nimbus configurations

----End

Relevant Tasks

Query topology details.

Step 1 Access the Storm WebUI.

Step 2 In Topology Summary, click the desired topology to view its detailed information, status,Spouts information, Bolts information, and configurations.

----End

8.9.3 Managing Storm Topologies

Scenario

Users can manage Storm topologies on the Storm WebUI. Users in the storm group canmanage only the topology tasks submitted by themselves, while users in the stormadmingroup can manage all topology tasks.


Issue 01 (2018-09-06) 531

Procedure


Step 2 In the Topology summary area, click the desired topology.

Step 3 Use options in Topology actions to manage the Storm topology.l Activate the topology.

Click Activate to activate the topology.l Deactivate the topology.

Click Deactivate to deactivate the topology.l Re-deploy the topology.

Click Rebalance and specify the wait time (in seconds) of re-deployment. Generally, ifthe number of nodes in a cluster changes, the topology can be re-deployed to maximizeresource usage.

l Delete the topology.Click Kill and specify the wait time (in seconds) of the deletion.

l Start or stop sampling messages.Click Debug. In the dialog box that is displayed, specify the percentage of the sampleddata volume. For example, if the value is set to 10, 10% of data is sampled.To stop sampling, click Stop Debug.

NOTE

This function is available only when the sampling function is enabled during topology submission.For details about querying data processing information, see Querying Data Processing Logs ofthe Topology.

l Modify the topology log level.Click Change Log Level to specify the new log level.

Step 4 Display the topology.

In the Topology Visualization area, click Show Visualization to visualize the topology.

----End

8.9.4 Querying Storm Topology Logs

Scenario

Users can query topology logs to check the execution of a Storm topology in a workerprocess. To query the data processing logs of a topology, users must enable the Debugfunction when submitting the topology. Only streaming clusters with Kerberos authenticationenabled support this function. In addition, the user who queries topology logs must be the onewho submits the topology or a member of the stormadmin group.

Prerequisitesl The network of the work environment has been configured according to Related

Operations.l The sampling function has been enabled for the topology.


Issue 01 (2018-09-06) 532

Querying Worker Process Logs


Step 2 In the Topology Summary area, click the desired topology to view details.

Step 3 Click the desired Spouts or Bolts task. In the Executors (All time) area, click a port in Portto view detailed logs.

----End

Querying Data Processing Logs of the Topology


Step 2 In the Topology Summary area, click the desired topology to view details.

Step 3 Click Debug, specify the data sampling ratio, and click OK.

Step 4 Click the Spouts or Bolts task. In Component summary, click events to view dataprocessing logs.

----End

8.10 Using CarbonData

8.10.1 Getting Started with CarbonDataThis section describes the procedure of using Spark CarbonData. All tasks are based on theSpark-beeline environment. The following tasks are included:

1. Connect to Spark.Before performing any operation on CarbonData, users must connect CarbonData toSpark.

2. Create a CarbonData table.After CarbonData connects to Spark, users must create a CarbonData table to load andquery data.

3. Load data to the CarbonData table.Users load data from CSV files in HDFS to the CarbonData table.

4. Query data in CarbonData.After data is loaded to the CarbonData table, users can run query commands such asgroupby and where.

Prerequisites


Procedure

Step 1 Connect CarbonData to Spark.


Issue 01 (2018-09-06) 533

1. Log in to the node where the client is installed.For example, if you have updated the client on the Master2 node, log in to the Master2node to use the client. For details, see Client Management.

2. Switch the user and configure environment variables.sudo su - ommsource /opt/client/bigdata_env

3. If Kerberos authentication is enabled, run the following command to authenticate theuser. If Kerberos authentication is disabled, skip this step.kinit Spark usernameThe user must be added to the hive group.

4. Run the following command to connect to the Spark environment.spark-beeline

Step 2 Create a CarbonData table.

Run the following command to create a CarbonData table, which is used to load and querydata.

CREATE TABLE x1 (imei string, deviceInformationId int, mac string, productdatetimestamp, updatetime timestamp, gamePointId double, contractNumber double)

STORED BY 'org.apache.carbondata.format'

TBLPROPERTIES('DICTIONARY_EXCLUDE'='mac','DICTIONARY_INCLUDE'='deviceInformationId');

Command result:+---------+--+| result |+---------+--++---------+--+No rows selected (1.551 seconds)

Step 3 Load data from CSV files to the CarbonData table.

Currently, only CSV files are supported. The CSV column names specified in the LOADcommand must be the same and in the same sequence as the column names in the CarbonDatatable. The data formats and number of data columns in the CSV files must also be the same asthose in the CarbonData table.

The CSV files must be stored on HDFS. Users can upload the files to OBS and import themfrom OBS to HDFS on the File Management page of the MRS management console. IfKerberos authentication is enabled, prepare the CSV files in the work environment and importthem to HDFS using open-source HDFS commands. In addition, assign the Spark user withthe read and execute permissions of the files on HDFS.

For example, the data.csv file is saved in the tmp directory of HDFS and has the followingcontents:x123,111,dd,2017-04-20 08:51:27,2017-04-20 07:56:51,2222,33333

The command for loading data from the file is as follows:

LOAD DATA inpath 'hdfs://hacluster/tmp/data.csv' into table x1options('DELIMITER'=',','QUOTECHAR'='"','FILEHEADER'='imei,deviceinformationid,mac,productdate,updatetime,gamepointid,contractnumber');


Issue 01 (2018-09-06) 534

Command result:

+---------+--+| Result |+---------+--++---------+--+No rows selected (3.039 seconds)

Step 4 Query data in the CarbonData table.l Obtaining the number of records

Run the following command to obtain the number of records in the CarbonData table:select count(*) from x1;

l Querying with the groupby conditionRun the following command to obtain the deviceinformationid records withoutrepetition in the CarbonData table:select deviceinformationid,count (distinct deviceinformationid) from x1 group bydeviceinformationid;

l Querying with the where conditionRun the following command to obtain specific deviceinformationid records:select * from x1 where deviceinformationid='111';

NOTE

If the query result has Chinese or other non-English characters, the columns in the query result may notbe aligned. This is because characters of different languages occupy different widths.

Step 5 Run the following command to exit the Spark environment.

!quit

----End

8.10.2 About CarbonData Table

Overview

CarbonData tables are similar to tables in the relational database management system(RDBMS). RDBMS tables consist of rows and columns to store data. CarbonData tables havefixed columns and also store structured data. In CarbonData, data is saved in entity files.

Data Types Supported

CarbonData tables support the following data types:

l Intl Stringl BigIntl Decimall Doublel TimeStamp

Table 8-12 describes the details about the data types.


Issue 01 (2018-09-06) 535

Table 8-12 CarbonData data types

Data Type Description

Int 4-byte signed integer ranging from -2,147,483,648 to 2,147,483,647NOTE

If non-dictionary columns have Int data, the data is saved as BigInt data in the table.

String The maximum character string length is 100,000.

BigInt Data is saved using the 64-bit technology. The value ranges from-9,223,372,036,854,775,808 to 9,223,372,036,854,775,807.

Decimal The default value is (10,0) and maximum value is (38,38).NOTE

If a query condition is used, users can add BD after the number to obtain accurateresults. For example, select * from carbon_table where num =1234567890123456.22BD.

Double Data is saved using the 64-bit technology. The value ranges from 4.9E-324to 1.7976931348623157E308.

TimeStamp The default format is yyyy-MM-dd HH:mm:ss.

NOTE

Measurement of all Int data is processed and displayed using the BigInt data type.

8.10.3 Creating a CarbonData Table

Scenario

A CarbonData table must be created to load and query data.

Creating a Table with Self-Defined Columns

Users can create a table by specifying its columns and data types. For analysis clusters withKerberos authentication enabled, if a user wants to create a CarbonData table in a databaseother than the default database, the Create permission of the database must be added to therole that the user is bound to in Hive role management.

Command example:

CREATE TABLE IF NOT EXISTS productdb.productSalesTable (

productNumber Int,

productName String,

storeCity String,

storeProvince String,

revenue Int)

STORED BY 'org.apache.carbondata.format'


Issue 01 (2018-09-06) 536

TBLPROPERTIES (

'table_blocksize'='128',

'DICTIONARY_EXCLUDE'='productName',

'DICTIONARY_INCLUDE'='productNumber');

The following table describes the command parameters.

Table 8-13 Parameter description


productSalesTable Indicates the table name. The table is used to load data foranalysis.The table name consists of letters, digits, and underscores (_).

productdb Indicates the database name. The database maintains logicalconnections with tables that it stores to identify and manage thetables.The database name consists of letters, digits, and underscores(_).

productNumberproductNamestoreCitystoreProvincerevenue

Indicate columns in the table. The columns are service entitiesfor data analysis.The column name (field name) consists of letters, digits, andunderscores (_).

table_blocksize Indicates the block size of data files used by the CarbonDatatable. The value ranges from 1 MB to 2048 MB. The default is1024 MB.l If table_blocksize is too small, a large number of small files

will be generated when data is loaded. This may affect theperformance of HDFS.

l If table_blocksize is too large, a large volume of data mustbe read from a block and the read concurrency is low whendata is queried. As a result, the query performancedeteriorates.

It is advised to set the block size based on the data volume. Forexample, set the block size to 256 MB for GB-level data, 512MB for TB-level data, and 1024 MB for PB-level data.


Issue 01 (2018-09-06) 537


DICTIONARY_EXCLUDE

Specifies the columns that do not generate dictionaries. Thisfunction is optional and applicable to columns of highcomplexity. By default, the system generates dictionaries forcolumns of the String type. However, as the number of valuesin the dictionaries increases, conversion operations by thedictionaries increase and the system performance deteriorates.Generally, if a column has over 50,000 unique data records, it isconsidered as a highly complex column and dictionarygeneration must be disabled.NOTE

Non-dictionary columns support only the String and Timestamp datatypes.

DICTIONARY_INCLUDE

Specifies the columns that generate dictionaries. This functionis optional and applicable to columns of low complexity (withfewer than 50,000 unique data records). It improves theperformance of queries with the groupby condition.

8.10.4 Deleting a CarbonData Table

Scenario

Unused CarbonData tables can be deleted. After a CarbonData table is deleted, its metadataand loaded data are deleted together.

Procedure

Step 1 Run the following command to delete a CarbonData table.

DROP TABLE [IF EXISTS] [db_name.]table_name;

db_name is optional. If db_name is not specified, the table named table_name in the currentdatabase is deleted.

For example, run the following command to delete the productSalesTable table in theproductdb database:

DROP TABLE productdb.productSalesTable;

Step 2 Run the following command to check whether the table is deleted.

SHOW TABLES;

----End

8.11 Using Flume


Issue 01 (2018-09-06) 538

8.11.1 Introduction

Process

The process for collecting logs using Flume is as follows:

1. Install the Flume client.

2. Configure the Flume server and client parameters.

3. Collect and query logs using the Flume client.

4. Stop and uninstall the Flume client.

Flume Client

A Flume client consists of the source, channel, and sink. The source sends the data to thechannel, and then the sink transmits the data from the channel to the external device.

Table 8-14 Module description

Module Description

Source A source receives or generates data and sends the data to one or more channels.Sources can work in either data-driven or polling mode.Typical sources include:l Syslog and Netcat, which are integrated in the system to receive datal Exec and SEQ that generate event data automaticallyl Avro that is used for communication between agentsA source must be associated with at least one channel.

Channel A channel is used to buffer data between a source and a sink. After the sinktransmits the data to the next channel or the destination, the cache is deletedautomatically.The persistency of the channels varies with the channel types:l Memory channel: no persistencyl File channel: persistency implemented based on write-ahead logging (WAL)l JDBC channel: persistency implemented based on the embedded databaseChannels support the transaction feature to ensure simple sequential operations.A channel can work with sources and sinks of any quantity.

Sink A sink transmits data to the next hop or destination. After the transmission iscomplete, it deletes the data from the channel.Typical sinks include:l HDFS and Kafka that store data to the destinationl Null sink that automatically consumes the datal Avro that is used for communication between agentsA sink must be associated with at least one channel.


Issue 01 (2018-09-06) 539

A Flume client can have multiple sources, channels, and sinks. A source can send data tomultiple channels, and then multiple sinks send the data out of the client.

Multiple Flume clients can be cascaded. That is, a sink can send data to the source of anotherclient.

Supplementary Information1. What are the reliability measures of Flume?

– The transaction mechanism is implemented between sources and channels, andbetween channels and sinks.

– The sink processor supports failover and load balancing.The following is an example of the load balancing configuration:server.sinkgroups=g1server.sinkgroups.g1.sinks=k1 k2server.sinkgroups.g1.processor.type=load_balanceserver.sinkgroups.g1.processor.backoff=trueserver.sinkgroups.g1.processor.selector=random

2. What are the precautions for the aggregation and cascading of multiple Flume clients?– Use the Avro or Thrift protocol for cascading.– When the aggregation end contains multiple nodes, evenly distribute the clients to

these nodes. Do not connect all the clients to a single node.

8.11.2 Installing the Flume Client

ScenarioTo use Flume to collect logs, you must install the Flume client on the log host. You can createan ECS to install the client.

Prerequisitesl A streaming cluster with the Flume component has been created.l The log host is in the same VPC and subnet with the cluster. For details, see Using the

Client on Another Node of a VPC.l You have obtained the username and password for logging in to the log host.

Procedure

Step 1 Create an ECS that meets the requirements in the prerequisites.

Step 2 Log in to MRS Manager. Choose Service > Flume > Download Client.

1. In Client Type, select All client files.2. In Download Path, select Remote host.3. Set Host IP Address to the IP address of the ECS, set Host Port to 22, and set Save

Path to /home/linux.– If the default port 22 for logging in to an ECS using SSH has been changed, set

Host Port to the new port.– Save Path contains a maximum of 256 characters.

4. For clusters of versions earlier than MRS 1.6.2, set Login User to linux. For clusters ofMRS 1.6.2 or later, set Login User to root.


Issue 01 (2018-09-06) 540

If other users are used, ensure that they have read, write, and execute permission on thesave path.

5. In SSH Private Key, select and upload the private key used for creating the ECS.6. Click OK to start downloading the client to the ECS.

If the following information is displayed, the client package is successfully saved.Client files downloaded to the remote host successfully.

Step 3 Click Instance. Query the Business IP Address of any Flume instance and any twoMonitorServer instances.

Step 4 Log in to the ECS using VNC. See "Logging In to a Linux ECS Using VNC" in the ElasticCloud Server User Guide (Getting Started > Logging In to an ECS > Logging In to aLinux ECS Using VNC).

All images support Cloud-Init. For clusters of versions earlier than MRS 1.6.2, the presetusername and password for Cloud-Init are linux and cloud.1234, respectively. If you havechanged the password, log in to the ECS using the new password. For clusters of MRS 1.6.2or later, the preset username for Cloud-init is root and the password is the one you set duringcluster creation. See "How Do I Log In to an ECS Once All Images Support Cloud-Init?" inthe ECS FAQs. It is recommended that you change the password upon the first login.


sudo su - root

cp /home/linux/MRS_Flume_Client.tar /opt

Step 6 Run the following command in the /opt directory to decompress the package and obtain theverification file and the configuration package of the client:

tar -xvf MRS_Flume_Client.tar

Step 7 Run the following command to verify the configuration package of the client:

sha256sum -c MRS_Flume_ClientConfig.tar.sha256

The command output is as follows:

MRS_Flume_ClientConfig.tar: OK

Step 8 Run the following command to decompress MRS_Flume_ClientConfig.tar:

tar -xvf MRS_Flume_ClientConfig.tar

Step 9 Run the following command to install the client running environment to a new directory, forexample, /opt/Flumeenv. The directory is automatically generated during installation.

sh /opt/MRS_Flume_ClientConfig/install.sh /opt/Flumeenv

If the following information is displayed, the client running environment is successfullyinstalled:


Step 10 Run the following command to configure the environment variable:

source /opt/Flumeenv/bigdata_env

Step 11 Run the following commands to decompress the Flume client package:

cd /opt/MRS_Flume_ClientConfig/Flume


Issue 01 (2018-09-06) 541

tar -xvf FusionInsight-Flume-1.6.0.tar.gz

Step 12 Run the following command to check whether the password of the current user has expired:

chage -l root

If the value of Password expires is earlier than the current time, the password has expired.Run the chage -M -1 root command to validate the password.

Step 13 Run the following command to install the Flume client to a new directory, for example, /opt/FlumeClient. The directory is automatically generated during installation.

sh /opt/MRS_Flume_ClientConfig/Flume/install.sh -d /opt/FlumeClient -f Service IPaddresses of the MonitorServer instances -c Path of the Flume configuration file -l /var/log/ -e Service IP address of Flume -n Name of the Flume client

The parameters are described as follows:

l -d: indicates the installation path of the Flume client.l -f: (optional) indicates the service IP addresses of the two MonitorServer instances,

separated by a comma. If the IP addresses are not configured, the Flume client will notsend alarm information to MonitorServer, and the client information will not bedisplayed on MRS Manager.

l -c: (optional) indicates the properties.properties configuration file that the Flume clientloads after installation. If this parameter is not specified, the fusioninsight-flume-1.6.0/conf/properties.properties file in the client installation directory is used by default. Theconfiguration file of the client is empty. You can modify properties.properties asrequired and the Flume client will load it automatically.

l -l: (optional) indicates the log directory. The default value is /var/log/Bigdata.l -e: (optional) indicates the service IP address of the Flume instance. It is used to receive

the monitoring indicators reported by the client.l -n: (optional) indicates the name of the Flume client.l IBM JDK does not support -Xloggc. You must change -Xloggc to -Xverbosegclog in

flume/conf/flume-env.sh. For 32-bit JDK, the value of -Xmx must not exceed 3.25 GB.l In flume/conf/flume-env.sh, the default value of -Xmx is 4 GB. If the client memory is

too small, you can change it to 512 MB or even 1 GB.

For example, run sh install.sh -d /opt/FlumeClient.

If the following information is displayed, the client is successfully installed:

install flume client successfully.

----End

8.11.3 Viewing Flume Client Logs

Scenario

This section describes how to locate problems using logs.

Prerequisites

You have correctly installed the Flume client.


Issue 01 (2018-09-06) 542

Procedure

Step 1 Go to the Flume client log directory (/var/log/Bigdata by default).

Step 2 Run the following command to view the list of log files:

ls -lR flume-client-*

A log file example is shown as follows:

flume-client-1/flume:total 7672-rw-------. 1 root root 0 Sep 8 19:43 Flume-audit.log-rw-------. 1 root root 1562037 Sep 11 06:05 FlumeClient.2017-09-11_04-05-09.[1].log.zip-rw-------. 1 root root 6127274 Sep 11 14:47 FlumeClient.log-rw-------. 1 root root 2935 Sep 8 22:20 flume-root-20170908202009-pid72456-gc.log.0.current-rw-------. 1 root root 2935 Sep 8 22:27 flume-root-20170908202634-pid78789-gc.log.0.current-rw-------. 1 root root 4382 Sep 8 22:47 flume-root-20170908203137-pid84925-gc.log.0.current-rw-------. 1 root root 4390 Sep 8 23:46 flume-root-20170908204918-pid103920-gc.log.0.current-rw-------. 1 root root 3196 Sep 9 10:12 flume-root-20170908215351-pid44372-gc.log.0.current-rw-------. 1 root root 2935 Sep 9 10:13 flume-root-20170909101233-pid55119-gc.log.0.current-rw-------. 1 root root 6441 Sep 9 11:10 flume-root-20170909101631-pid59301-gc.log.0.current-rw-------. 1 root root 0 Sep 9 11:10 flume-root-20170909111009-pid119477-gc.log.0.current-rw-------. 1 root root 92896 Sep 11 13:24 flume-root-20170909111126-pid120689-gc.log.0.current-rw-------. 1 root root 5588 Sep 11 14:46 flume-root-20170911132445-pid42259-gc.log.0.current-rw-------. 1 root root 2576 Sep 11 13:24 prestartDetail.log-rw-------. 1 root root 3303 Sep 11 13:24 startDetail.log-rw-------. 1 root root 1253 Sep 11 13:24 stopDetail.log

flume-client-1/monitor:total 8-rw-------. 1 root root 141 Sep 8 19:43 flumeMonitorChecker.log-rw-------. 1 root root 2946 Sep 11 13:24 flumeMonitor.log

FlumeClient.log is the run log of the Flume client.

----End

8.11.4 Stopping or Uninstalling the Flume Client

ScenarioThis section describes how to stop and start the Flume client as well as uninstall it when theFlume data collection channel is not required.

Procedurel Stopping the Flume client

Suppose the installation path of the Flume client is /opt/FlumeClient. Run the followingcommand to stop the Flume client:

cd /opt/FlumeClient/fusioninsight-flume-1.6.0/bin


Issue 01 (2018-09-06) 543

./flume-manage.sh stop

If the following information is displayed after the command execution, the Flume client issuccessfully stopped.

Stop Flume PID=120689 successful..

NOTE

The Flume client will be automatically restarted after being stopped. If you do not need automaticrestart, run the following command:./flume-manage.sh stop forceIf you want to restart the Flume client, run the following command:./flume-manage.sh start force

l Uninstalling the Flume client

Suppose the installation path of the Flume client is /opt/FlumeClient. Run the followingcommand to uninstall the Flume client:

cd /opt/FlumeClient/fusioninsight-flume-1.6.0/inst

./uninstall.sh

8.11.5 Using the Encryption Tool of the Flume Client

ScenarioThe Flume client provides an encryption tool to encrypt some parameter values in theconfiguration file.

PrerequisitesYou have installed the Flume client.

Procedure

Step 1 Log in to the Flume client node and go to the client installation directory, for example, /opt/FlumeClient.

Step 2 Run the following command to switch the directory:

cd fusioninsight-flume-1.6.0/bin

Step 3 Run the following command to encrypt information:

./genPwFile.sh

Input the information that you want to encrypt twice.

Step 4 Run the following command to query the encrypted information:

cat password.property

NOTE

If the encryption parameter is used for the Flume server, you need to perform encryption on thecorresponding Flume server node. The path for the encryption script is /opt/Bigdata/FusionInsight/FusionInsight-Flume-1.6.0/flume/bin/genPwFile.sh. Execute the encryption script as user omm.

----End


Issue 01 (2018-09-06) 544

8.11.6 Flume Configuration Parameter Description

Scenario

This section describes how to configure the sources, channels, and sinks of Flume, andmodify the configuration items of each module.

NOTE

You must input encrypted information for some configurations. For details on how to encryptinformation, see Using the Encryption Tool of the Flume Client.

Common Source Configurationsl Avro Source

An Avro source listens to the Avro port, receives data from the external Avro client, andplaces data into configured channels. Common configurations are as follows.

Table 8-15 Common configurations of an Avro source

Parameter DefaultValue

Description

channels - Channel connected to the source.Multiple channels can be configured butmust be separated by spaces.To define the flow within a single agent,you need to link the sources and sinks viaa channel. A source instance can specifymultiple channels, but a sink instance canonly specify one channel.The format is as follows:<Agent >.sources.<Source>.channels =<channel1> <channel2> <channel3>...<Agent >.sinks.<Sink>.channels =<channel1>

type avro Type, which is set to avro. The type ofeach source is fixed.

bind - Bind to the host name or IP address thatis associated with the source.

port - Bound port

ssl false Indicates whether to use SSL encryption.l truel false

truststore-type JKS Java truststore type. Enter JKS or othersupported Java truststore type.

truststore - Java truststore file.


Issue 01 (2018-09-06) 545


Description

truststore-password - Java truststore password.

keystore-type JKS Keystore type. Enter JKS or othersupported Java keystore type.

keystore - Keystore file.

keystore-password - Keystore password.

l Spooling Source

A Spooling source monitors and transmits new files that have been added to directoriesin quasi-real-time mode. Common configurations are as follows.

Table 8-16 Common configurations of a Spooling source


Description

channels - Channel connected to the source.Multiple channels can be configured.

type spooldir Type, which is set to spooldir.

monTime 0 (disabled) Thread monitoring threshold. When theupdate time (seconds) exceeds thethreshold, the source is restarted.

spoolDir - Monitoring directory.

fileSuffix .COMPLETED

Suffix added after file transmission iscomplete.

deletePolicy never Source file deletion policy after filetransmission is complete. The value canbe either never or immediate.

ignorePattern ^$ Regular expression of a file to be ignored.

trackerDir .flumespool Metadata storage directory duringtransmission.

batchSize 1000 Source transmission granularity.


Issue 01 (2018-09-06) 546


Description

decodeErrorPolicy FAIL Code error policyThe options are FAIL, REPLACE, andIGNORE.FAIL: Throw an exception and makeresolution fail.REPLACE: Replace unidentifiedcharacters with other characters(typically, U+FFFD).IGNORE: Directly discard characterstrings that fail to be resolved.NOTE

If a code error occurs in the file, setdecodeErrorPolicy to REPLACE orIGNORE. Flume will skip the code error andcontinue to collect subsequent logs.

deserializer LINE File parser. The value can be either LINEor BufferedLine.l When the value is set to LINE,

characters read from the file aretranscoded one by one.

l When the value is set toBufferedLine, one line or multiplelines of characters read from the fileare transcoded in batches, whichdelivers better performance.

deserializer.maxLineLength

2048 Maximum length for resolution by line.The value ranges from 0 to2,147,483,647.

deserializer.maxBatchLine

1 Maximum number of lines for resolutionby line. If multiple lines are set,maxLineLength must be set to acorresponding multiplier.For example, if maxBatchLine is set to2, maxLineLength is set to 4096 (2048 x2) accordingly.

selector.type replicating Selector type. The value can be eitherreplicating or multiplexing.l replicating indicates that the same

content is sent to every channel.l multiplexing indicates that content is

selectively sent to some channelsaccording to the replicatingdistribution rule.


Issue 01 (2018-09-06) 547


Description

interceptors - InterceptorFor details about configuration, seeFlume User Guide.

NOTE

The Spooling source ignores the last line feed character of each event when data is read by line.Therefore, Flume does not calculate the data volume counters used by the last line feed character.

l Kafka SourceA Kafka source consumes data from Kafka topics. Multiple sources can consume data ofthe same topic, and the sources consume different partitions of the topic. Commonconfigurations are as follows.

Table 8-17 Common configurations of a Kafka source


Description


type org.apache.flume.source.kafka.KafkaSource

Type, which is set toorg.apache.flume.source.kafka.KafkaSource.


nodatatime 0 (disabled) Alarm threshold. An alarm is triggeredwhen the duration (seconds) that Kafkadoes not release data to subscribersexceeds the threshold.

batchSize 1000 Number of events written into a channelat a time.

batchDurationMillis 1000 Maximum duration of topic dataconsumption at a time. The unit ismillisecond.

keepTopicInHeader false Indicates whether to save topics in theevent header. If topics are saved, topicsconfigured in Kafka sinks becomeinvalid.l truel false


Issue 01 (2018-09-06) 548

https://flume.apache.org/FlumeUserGuide.html#flume-interceptors


Description

keepPartitionInHeader false Indicates whether to save partition IDs inthe event header. If partition IDs aresaved, Kafka sinks write data to thecorresponding partitions.l truel false

kafka.bootstrap.servers - List of Broker addresses, which areseparated by commas.

kafka.consumer.group.id

- Kafka consumer group ID.

kafka.topics - List of subscribed Kafka topics, whichare separated by commas.

kafka.topics.regex - Subscribed topics that comply withregular expressions. kafka.topics.regexhas a higher priority than kafka.topicsand will overwrite kafka.topics.

kafka.security.protocol SASL_PLAINTEXT

Security protocol of Kafka. The valuemust be set to PLAINTEXT for clustersin which Kerberos authentication isdisabled.

Other Kafka ConsumerProperties

- Other Kafka configurations. Thisparameter can be set to any consumptionconfiguration supported by Kafka, andthe .kafka prefix must be added to theconfiguration.

l Taildir Source

A Taildir source monitors file changes in a directory and automatically reads the filecontent. In addition, it can transmit data in real time. Common configurations are asfollows.

Table 8-18 Common configurations of a Taildir source


Description


type taildir Type, which is set to taildir.

filegroups - Group name of a collection file directory.Group names are separated by spaces.


Issue 01 (2018-09-06) 549


Description

filegroups.<filegroupName>.parentDir

- Parent directory. The value must be anabsolute path.

filegroups.<filegroupName>.filePattern

- Relative file path of the file group'sparent directory. Directories can beincluded and regular expressions aresupported. It must be used together withparentDir.

positionFile - Metadata storage directory duringtransmission.

headers.<filegroupName>.<headerKey>

- Key-value of an event when data of agroup is being collected.

byteOffsetHeader false Indicates whether each event headershould contain the location informationabout the event in the source file. Thelocation information is saved in thebyteoffset variable.

skipToEnd false Indicates whether Flume can locate thelatest location of a file and read the latestdata after restart.

idleTimeout 120000 Idle period during file reading, expressedin milliseconds. If the file data is notchanged in this idle period, the sourcecloses the file. If data is written into thisfile after it is closed, the source opens thefile and reads data.

writePosInterval 3000 Interval for writing metadata to a file,expressed in milliseconds.

batchSize 1000 Number of events written into a channelin a batch.


l HTTP Source

An HTTP source receives data from an external HTTP client and sends the data to theconfigured channels. Common configurations are as follows.


Issue 01 (2018-09-06) 550

Table 8-19 Common configurations of an HTTP source


Description


type http Type, which is set to http.

bind - Name or IP address of the bound host

port - Bound port

handler org.apache.flume.source.http.JSONHandler

Message parsing method of an HTTPrequest. The following methods aresupported:l org.apache.flume.source.http.JSON

Handler: JSONl org.apache.flume.sink.solr.morphlin

e.BlobHandler: BLOB

handler.* - Handler parameters.

enableSSL false Indicates whether SSL is enabled inHTTP.

keystore - Keystore path after SSL is enabled inHTTP.

keystorePassword - Keystore password after SSL is enabledin HTTP.

l OBS Source

An OBS source monitors and transmits new files that have been added to specifiedbuckets in quasi-real-time mode. Common configurations are as follows.

Table 8-20 Common configurations of an OBS source


Description


type http Type, which is set toorg.apache.flume.source.s3.OBSSource.

bucketName - OBS bucket name.


Issue 01 (2018-09-06) 551


Description

prefix - Monitored OBS path of the specifiedbucket. The path cannot start with a slash(/). If this parameter is not set, the rootdirectory of the bucket will be monitoredby default.

accessKey - User AK information.

secretKey - User SK information in ciphertext.

backingDir - Metadata storage directory duringtransmission.

endPoint - OBS access address. The address must bein the same region as MRS. The valuecan be either a domain name or an IPaddress.

basenameHeader false Indicates whether to save file names inthe event header. false indicates that filenames are not saved.

basenameHeaderKey basename Name of the field that the event headeruses to save a file name, which is alsocalled the key name.

batchSize 1000 Source transmission granularity.

decodeErrorPolicy FAIL Code error policyNOTE

If a code error occurs in the file, setdecodeErrorPolicy to REPLACE orIGNORE. Flume will skip the code error andcontinue to collect subsequent logs.

deserializer LINE File parser. The value can be either LINEor BufferedLine.l When the value is set to LINE,

characters read from the file aretranscoded one by one.

l When the value is set toBufferedLine, one line or multiplelines of characters read from the fileare transcoded in batches, whichdelivers better performance.

deserializer.maxLineLength

2048 Maximum length for resolution by line.

deserializer.maxBatchLine

1 Maximum number of lines for resolutionby line. If multiple lines are set,maxLineLength must be set to acorresponding multiplier.


Issue 01 (2018-09-06) 552


Description

selector.type replicating Selector type. The value can be eitherreplicating or multiplexing.

interceptors - Interceptor

Common Channel Configurationsl Memory Channel

A memory channel uses memory as the cache. Events are stored in memory queues.Common configurations are as follows.

Table 8-21 Common configurations of a memory channel


Description

type - Type, which is set to memory.

capacity 10000 Maximum number of events cached in achannel.

transactionCapacity 1000 Maximum number of events accessedeach time.

channelfullcount 10 Channel full count. When the countreaches the threshold, an alarm isreported.

l File Channel

A file channel uses local disks as the cache. Events are stored in the folder specified bydataDirs. Common configurations are as follows.

Table 8-22 Common configurations of a file channel


Description

type - Type, which is set to file.

checkpointDir ${BIGDATA_DATA_HOME}/flume/checkpoint

Checkpoint storage directory.

dataDirs ${BIGDATA_DATA_HOME}/flume/data

Data cache directory. Multiple directoriescan be configured to improveperformance. The directories areseparated by commas (,).


Issue 01 (2018-09-06) 553


Description

maxFileSize 2146435071 Maximum size of a single cache file. Theunit is byte.

minimumRequired-Space

524288000 Minimum idle space in the cache. Theunit is byte.

capacity 1000000 Maximum number of events cached in achannel.

transactionCapacity 10000 Maximum number of events accessedeach time.


l Memory File Channel

A memory file channel uses both memory and local disks as its cache and supportsmessage persistence. It provides similar performance as a memory channel and betterperformance than a file channel. Common configurations are as follows.

Table 8-23 Common configurations of a memory file channel


Description

type org.apache.flume.channel.MemoryFileChannel

Type, which is set toorg.apache.flume.channel.MemoryFileChannel.

capacity 50000 Channel cache: maximum number ofevents cached in a channel.

transactionCapacity 5000 Transaction cache: maximum number ofevents processed by a transaction.l The parameter value must be greater

than the batchSize of the source andsink.

l The value of transactionCapacitymust be less than or equal to that ofcapacity.


Issue 01 (2018-09-06) 554


Description

subqueueByteCapacity 20971520 Maximum size (bytes) of events that canbe stored in a subqueue.A memory file channel uses both queuesand subqueues to cache data. Events arestored in a subqueue, and subqueues arestored in a queue.subqueueCapacity andsubqueueInterval determine the size ofevents that can be stored in a subqueue.subqueueCapacity specifies the capacityof a subqueue, and subqueueIntervalspecifies the duration that a subqueue canstore events. Events in a subqueue aresent to the destination only after thesubqueue reaches the upper limit ofsubqueueCapacity orsubqueueInterval.NOTE

The value of subqueueByteCapacity must begreater than the number of events specified bybatchSize.

subqueueInterval 2000 Maximum duration (milliseconds) that asubqueue can store events.

keep-alive 3 Waiting time of the Put and Take threadswhen the transaction or channel cache isfull. The unit is second.

dataDir - Cache directory for local files.

byteCapacity 80% of themaximumJVM memory

Channel cache capacity. Unit: byte

compression-type None Message compression format. The valuecan be either None or Snappy. When theformat is Snappy, event message bodiesthat are compressed in the Snappy formatcan be decompressed.


The following is a configuration example of a memory file channel:server.channels.c1.type = org.apache.flume.channel.MemoryFileChannelserver.channels.c1.dataDir = /opt/flume/mfdataserver.channels.c1.subqueueByteCapacity = 20971520server.channels.c1.subqueueInterval=2000


Issue 01 (2018-09-06) 555

server.channels.c1.capacity = 500000server.channels.c1.transactionCapacity = 40000

l Kafka Channel

A Kafka channel uses a Kafka cluster as the cache. Kafka provides high availability andmultiple copies to prevent data from being immediately consumed by sinks when Flumeor Kafka Broker crashes.

Table 8-24 Common configurations of a Kafka channel


Description

type - Type, which is set toorg.apache.flume.channel.kafka.KafkaChannel.

kafka.bootstrap.servers - List of Brokers in the Kafka cluster.

kafka.topic flume-channel Kafka topic used by the channel to cachedata.

kafka.consumer.group.id

flume Kafka consumer group ID.

parseAsFlumeEvent true Indicates whether data is parsed intoFlume events.

migrateZookeeperOff-sets

true Indicates whether to search for offsets inZooKeeper and submits them to Kafkawhen there is no offset in Kafka.

kafka.consumer.auto.offset.reset

latest Consumes data from the specifiedlocation when there is no offset.

kafka.producer.security.protocol

SASL_PLAINTEXT

Kafka producer security protocol.

kafka.consumer.security.protocol

SASL_PLAINTEXT

Kafka consumer security protocol.

Common Sink Configurationsl HDFS Sink

An HDFS sink writes data into HDFS. Common configurations are as follows.

Table 8-25 Common configurations of an HDFS sink


Description

channel - Channel connected to the sink.

type hdfs Type, which is set to hdfs.


Issue 01 (2018-09-06) 556


Description

monTime 0 (disabled) Thread monitoring threshold. When theupdate time (seconds) exceeds thethreshold, the sink is restarted.

hdfs.path - HDFS path.

hdfs.inUseSuffix .tmp Suffix of the HDFS file being written.

hdfs.rollInterval 30 Interval for file rolling. The unit issecond.

hdfs.rollSize 1024 Size for file rolling. The unit is byte.

hdfs.rollCount 10 Number of events for file rolling.

hdfs.idleTimeout 0 Timeout interval for closing idle filesautomatically. The unit is second.

hdfs.batchSize 1000 Number of events written into HDFS at atime.

hdfs.kerberosPrincipal - Kerberos username for HDFSauthentication. This parameter is notrequired for a cluster in which Kerberosauthentication is disabled.

hdfs.kerberosKeytab - Kerberos keytab for HDFSauthentication. This parameter is notrequired for a cluster in which Kerberosauthentication is disabled.

hdfs.fileCloseByEndE-vent

true Indicates whether the file is closed whenthe last event is received.

hdfs.batchCallTimeout - Timeout control duration (milliseconds)each time events are written into HDFS.If this parameter is not specified, thetimeout duration is controlled when eachevent is written into HDFS. When thevalue of hdfs.batchSize is greater than 0,configure this parameter to improve theperformance of writing data into HDFS.NOTE

The value of hdfs.batchCallTimeoutdepends on hdfs.batchSize. A greaterhdfs.batchSize requires a largerhdfs.batchCallTimeout. If the value ofhdfs.batchCallTimeout is too small, writingevents to HDFS may fail.


Issue 01 (2018-09-06) 557


Description

serializer.appendNewline

true Indicates whether to add a line feedcharacter (\n) after an event is written toHDFS. If a line feed character is added,the data volume counters used by the linefeed character will not be calculated byHDFS sinks.

l Avro SinkAn Avro sink converts events into Avro events and sends them to the monitoring ports ofthe hosts. Common configurations are as follows.

Table 8-26 Common configurations of an Avro sink


Description


type - Type, which is set to avro.

hostname - Name or IP address of the bound host

port - Monitoring port

batch-size 1000 Number of events sent in a batch.

ssl false Indicates whether to use SSL encryption.

truststore-type JKS Java truststore type.

truststore - Java truststore file.

truststore-password - Java truststore password.

keystore-type JKS Keystore type.

keystore - Keystore file.

keystore-password - Keystore password.

l HBase SinkAn HBase sink writes data into HBase. Common configurations are as follows.

Table 8-27 Common configurations of an HBase sink


Description


type - Type, which is set to hbase.


Issue 01 (2018-09-06) 558


Description

table - HBase table name.


columnFamily - HBase column family.

batchSize 1000 Number of events written into HBase at atime.

kerberosPrincipal - Kerberos username for HBaseauthentication. This parameter is notrequired for a cluster in which Kerberosauthentication is disabled.

kerberosKeytab - Kerberos keytab for HBaseauthentication. This parameter is notrequired for a cluster in which Kerberosauthentication is disabled.

l Kafka SinkA Kafka sink writes data into Kafka. Common configurations are as follows.

Table 8-28 Common configurations of a Kafka sink


Description


type - Type, which is set toorg.apache.flume.sink.kafka.KafkaSink.

kafka.bootstrap.servers - List of Kafka Brokers, which areseparated by commas.


kafka.topic default-flume-topic

Topic where data is written.

flumeBatchSize 1000 Number of events written into Kafka at atime.

kafka.security.protocol SASL_PLAINTEXT

Security protocol of Kafka. The valuemust be set to PLAINTEXT for clustersin which Kerberos authentication isdisabled.


Issue 01 (2018-09-06) 559


Description

Other Kafka ProducerProperties

- Other Kafka configurations. Thisparameter can be set to any productionconfiguration supported by Kafka, andthe .kafka prefix must be added to theconfiguration.

l OBS Sink

An OBS sink writes data into OBS. As OBS sink and HDFS sink use the same filesystem interface, their parameter configurations are similar. The following tableprovides common configurations of an OBS sink:

Table 8-29 Common configurations of an OBS sink


Description


type hdfs Type, which is set to hdfs.


hdfs.path - OBS path in the s3a://AK:SK@Bucket/Path/ format, for example, s3a://AK:SK@obs-nemon-sink/obs-sink/

hdfs.inUseSuffix .tmp Suffix of the OBS file being written.

hdfs.rollInterval 30 Interval for file rolling. The unit issecond.

hdfs.rollSize 1024 Size for file rolling. The unit is byte.

hdfs.rollCount 10 Number of events for file rolling.

hdfs.idleTimeout 0 Timeout interval for closing idle filesautomatically. The unit is second.

hdfs.batchSize 1000 Number of events written into OBS at atime.

hdfs.calltimeout 10000 Timeout interval for interaction withOBS. The unit is millisecond. Thetimeout interval must be as maximum aspossible, for example, 1000000, becausefiles are copied when some operations(such as OBS renaming) are performed,which requires a long time.


Issue 01 (2018-09-06) 560


Description

hdfs.fileCloseByEndE-vent

true Indicates whether the file is closed whenthe last event is received.

hdfs.batchCallTimeout - Timeout control duration (milliseconds)each time events are written into OBS.If this parameter is not specified, thetimeout duration is controlled when eachevent is written into OBS. When thevalue of hdfs.batchSize is greater than 0,configure this parameter to improve theperformance of writing data into OBS.NOTE

The value of hdfs.batchCallTimeoutdepends on hdfs.batchSize. A greaterhdfs.batchSize requires a largerhdfs.batchCallTimeout. If the value ofhdfs.batchCallTimeout is too small, writingevents to OBS may fail.

serializer.appendNewline

true Indicates whether to add a line feedcharacter (\n) after an event is written toOBS. If a line feed character is added, thedata volume counters used by the linefeed character will not be calculated byOBS sinks.

8.11.7 Example: Using Flume to Collect Logs and Import Them toKafka

Scenario

This section describes how to use Flume to import log information to Kafka.

Prerequisitesl A streaming cluster with Kerberos authentication enabled has been created.

l The Flume client has been installed on the node where logs are generated. For details,see Installing the Flume Client.

l The streaming cluster can properly communicate with the node where logs are generated.

ProcedureNOTE

Start from Step 7 for a non-security cluster.

Step 1 Copy the configuration file of the authentication server from the Master1 node to the confdirectory on the Flume client node.


Issue 01 (2018-09-06) 561

The full path of the configuration file on the Master1 node is /opt/Bigdata/FusionInsight/etc/1_X_KerberosClient/kdc.conf. X is a random number. The file must be saved by the userwho installs the Flume client, for example, user root.

Step 2 Log in to MRS Manager. Choose Service > Flume > Instance. Query the Business IPAddress of any node on which the Flume role is deployed.

Step 3 Copy the user authentication file from this node to the conf directory on the Flume clientnode.

The full path of the file is /opt/Bigdata/FusionInsight/FusionInsight-Flume-1.6.0/flume/conf/flume-keytab. The file must be saved by the user who installs the Flume client, forexample, user root.

Step 4 Copy the jaas.conf file from this node to the conf directory on the Flume client node.

The full path of the jaas.conf file is /opt/Bigdata/FusionInsight/etc/1_X_Flume/jaas.conf.X is a random number. The file must be saved by the user who installs the Flume client, forexample, user root.

Step 5 Log in to the Flume client node and go to the client installation directory. Run the followingcommand to edit the file:

vi conf/jaas.conf

Set the keyTab parameter to the full path of the user authentication file on the Flume client.Then save and exit the file.

Step 6 Run the following command to modify the flume-env.sh configuration file of the Flumeclient:

vi Flume client installation directory/fusioninsight-flume-1.6.0/conf/flume-env.sh

Add the following information after -XX:+UseCMSCompactAtFullCollection:

-Djava.security.krb5.conf=Flume client installation directory/fusioninsight-flume-1.6.0/conf/kdc.conf -Djava.security.auth.login.config=Flume client installation directory/fusioninsight-flume-1.6.0/conf/jaas.conf -Dzookeeper.server.principal=zookeeper/hadoop.xxx.com -Dzookeeper.request.timeout=120000

Change Flume client installation directory to the actual one and modifyzookeeper.server.principal. Then save and exit the file.

Step 7 Assume that the Flume client is installed in /opt/FlumeClient. Run the following commandsto restart the Flume client:


./flume-manage.sh restart

Step 8 Run the following command to modify the properties.properties configuration file of theFlume client:

vi Flume client installation directory/fusioninsight-flume-1.6.0/conf/properties.properties

Add the following information to the file:

#########################################################################################client.sources = static_log_source client.channels = static_log_channel client.sinks = kafka_sink


Issue 01 (2018-09-06) 562

##########################################################################################LOG_TO_HDFS_ONLINE_1

client.sources.static_log_source.type = spooldirclient.sources.static_log_source.spoolDir = PATHclient.sources.static_log_source.fileSuffix = .COMPLETEDclient.sources.static_log_source.ignorePattern = ^$client.sources.static_log_source.trackerDir = PATHclient.sources.static_log_source.maxBlobLength = 16384client.sources.static_log_source.batchSize = 51200client.sources.static_log_source.inputCharset = UTF-8client.sources.static_log_source.deserializer = LINEclient.sources.static_log_source.selector.type = replicatingclient.sources.static_log_source.fileHeaderKey = fileclient.sources.static_log_source.fileHeader = falseclient.sources.static_log_source.basenameHeader = trueclient.sources.static_log_source.basenameHeaderKey = basenameclient.sources.static_log_source.deletePolicy = never

client.channels.static_log_channel.type = fileclient.channels.static_log_channel.dataDirs = PATHclient.channels.static_log_channel.checkpointDir = PATHclient.channels.static_log_channel.maxFileSize = 2146435071client.channels.static_log_channel.capacity = 1000000client.channels.static_log_channel.transactionCapacity = 612000client.channels.static_log_channel.minimumRequiredSpace = 524288000

client.sinks.kafka_sink.type = org.apache.flume.sink.kafka.KafkaSinkclient.sinks.kafka_sink.kafka.topic = flume_testclient.sinks.kafka_sink.kafka.bootstrap.servers = XXX.XXX.XXX.XXX:21007,XXX.XXX.XXX.XXX:21007,XXX.XXX.XXX.XXX:21007client.sinks.kafka_sink.flumeBatchSize = 1000client.sinks.kafka_sink.kafka.producer.type = syncclient.sinks.kafka_sink.kafka.security.protocol = SASL_PLAINTEXTclient.sinks.kafka_sink.kafka.kerberos.domain.name = hadoop.XXX.comclient.sinks.kafka_sink.requiredAcks = 0

client.sources.static_log_source.channels = static_log_channelclient.sinks.kafka_sink.channel = static_log_channel

Modify the following parameters as required. Then save and exit the file.

l spoolDirl trackerDirl dataDirsl checkpointDirl topic

If the topic does not exist in Kafka, it will be automatically created by default.l kafka.bootstrap.servers

By default, the port for a security cluster is port 21007 and that for a non-security clusteris port 21005.

l kafka.kerberos.domain.nameThis parameter value is the value of default_realm of kerberos in the Kafka cluster.

Step 9 The Flume client automatically loads the information in the properties.properties file.

After new log files are generated in the directory specified by spoolDir, the logs will be sentto Kafka producers and can be consumed by Kafka consumers.

----End


Issue 01 (2018-09-06) 563

8.11.8 Example: Using Flume to Collect Logs and Import Them toOBS

ScenarioThis section describes how to use Flume to import log information to OBS.

Prerequisitesl A streaming cluster has been created.l The Flume client has been installed on the node where logs are generated. For details,

see Installing the Flume Client.l The streaming cluster can properly communicate with the node where logs are generated.l The node where logs are generated can parse the domain name of OBS.

Procedure

Step 1 Create the core-site.xml file and save it to the conf directory of the Flume client.

An example of parameter file content is as follows:

<?xml version="1.0" encoding="UTF-8"?><configuration><property><name>fs.s3a.connection.ssl.enabled</name><value>true</value></property><property><name>fs.s3a.endpoint</name><value></value></property></configuration>

The value of fs.s3a.endpoint is an OBS access address. The address must be in the sameregion as MRS. The parameter value can be either a domain name or an IP address. On MRSManager, you can choose Service > Flume > Service Configuration, set Type to All, andview the value of s3service.s3-endpoint in S3service.

Step 2 Encrypt SK using the encryption tool of the Flume client. For details, see Using theEncryption Tool of the Flume Client.


vi conf/fusioninsight-flume-1.6.0/conf/properties.properties

Add the following information to the file:

client.sources = linuxclient.channels = flumeclient.sinks = obs

client.sources.linux.type = spooldirclient.sources.linux.spoolDir = /tmp/nemonclient.sources.linux.montime = client.sources.linux.fileSuffix = .COMPLETEDclient.sources.linux.deletePolicy = neverclient.sources.linux.trackerDir = .flumespoolclient.sources.linux.ignorePattern = ^$client.sources.linux.batchSize = 1000


Issue 01 (2018-09-06) 564

client.sources.linux.inputCharset = UTF-8client.sources.linux.selector.type = replicatingclient.sources.linux.fileHeader = falseclient.sources.linux.fileHeaderKey = fileclient.sources.linux.basenameHeader = trueclient.sources.linux.basenameHeaderKey = basenameclient.sources.linux.deserializer = LINEclient.sources.linux.deserializer.maxBatchLine = 1client.sources.linux.deserializer.maxLineLength = 2048client.sources.linux.channels = flume

client.channels.flume.type = memoryclient.channels.flume.capacity = 10000client.channels.flume.transactionCapacity = 1000client.channels.flume.channelfullcount = 10client.channels.flume.keep-alive = 3client.channels.flume.byteCapacity = client.channels.flume.byteCapacityBufferPercentage = 20

client.sinks.obs.type = hdfsclient.sinks.obs.hdfs.path = s3a://AK:SK@obs-nemon-sink/obs-sinkclient.sinks.obs.montime = client.sinks.obs.hdfs.filePrefix = obs_%{basename}client.sinks.obs.hdfs.fileSuffix = client.sinks.obs.hdfs.inUsePrefix = client.sinks.obs.hdfs.inUseSuffix = .tmpclient.sinks.obs.hdfs.idleTimeout = 0client.sinks.obs.hdfs.batchSize = 1000client.sinks.obs.hdfs.codeC = client.sinks.obs.hdfs.fileType = DataStreamclient.sinks.obs.hdfs.maxOpenFiles = 5000client.sinks.obs.hdfs.writeFormat = Writableclient.sinks.obs.hdfs.callTimeout = 1000000client.sinks.obs.hdfs.threadsPoolSize = 10client.sinks.obs.hdfs.rollTimerPoolSize = 1client.sinks.obs.hdfs.round = falseclient.sinks.obs.hdfs.roundUnit = secondclient.sinks.obs.hdfs.useLocalTimeStamp = falseclient.sinks.obs.hdfs.failcount = 10client.sinks.obs.hdfs.fileCloseByEndEvent = trueclient.sinks.obs.hdfs.rollInterval = 30client.sinks.obs.hdfs.rollSize = 1024client.sinks.obs.hdfs.rollCount = 10client.sinks.obs.hdfs.batchCallTimeout = 0client.sinks.obs.serializer.appendNewline = trueclient.sinks.obs.channel = flume


l spoolDirl trackerDirl hdfs.path (AK and SK in the path must be actual values. SK is the encrypted one.)


After new log files are generated in the directory specified by spoolDir, the logs will be sentto OBS.

----End


Issue 01 (2018-09-06) 565

8.11.9 Example: Using Flume to Read OBS Files and Upload Themto HDFS

Scenario

This section describes how to use Flume to read the specified OBS directory and upload filesto HDFS.

Prerequisitesl A streaming cluster has been created.l The Flume client has been installed on the client node. For details, see Installing the

Flume Client.l The client node can properly communicate with the streaming cluster and HDFS cluster

nodes, including master and core nodes.l The client node can parse the domain name of OBS.

ProcedureNOTE

You do not need to perform Steps 2 to 4 for a non-security cluster.

Step 1 Copy the core-site.xml and hdfs-site.xml files from the HDFS cluster client to the Flumeclient installation directory/fusioninsight-flume-1.6.0/conf directory of the Flume clientnode.

Generally, the core-site.xml and hdfs-site.xml files are saved in the /HDFS/hadoop/etc/hadoop/ HDFS client installation directory.

The files must be saved as a Flume client installation user, for example, user root.

Step 2 Download a user's authentication credential file from the HDFS cluster.

1. On MRS Manager, click System.2. In the Permission area, click Manage User.3. Select the user from the user list and click More to download the user's authentication

credential file.4. Decompress the authentication credential file to obtain the krb5.conf and user.keytab

files.

Step 3 Copy the krb5.conf and user.keytab files to the Flume client installation directory/fusioninsight-flume-1.6.0/conf directory of the Flume client node. The files must be saved asa Flume client installation user, for example, user root.

Step 4 Modify the flume-env.sh Flume client configuration file.

Run the following command to edit the flume-env.sh file:

vi Flume client installation directory/fusioninsight-flume-1.6.0/conf/flume-env.sh

Add the following information after -XX:+UseCMSCompactAtFullCollection.-Djava.security.krb5.conf=Flume client installation directory/fusioninsight-flume-1.6.0/conf/krb5.conf


Issue 01 (2018-09-06) 566

Change Flume client installation directory to the actual one. Then save and exit theconfiguration file.

Step 5 Add content that matches host in the /etc/hosts file of the HDFS cluster to the /etc/hosts fileof the Flume client node.

Step 6 Restart the Flume client.

Suppose the Flume client installation directory is /opt/FlumeClient. Run the followingcommand to restart the Flume client:


./flume-manage.sh restart

Step 7 Encrypt SK using the encryption tool of the Flume client. For details, see Using theEncryption Tool of the Flume Client.


vi Flume client installation directory/fusioninsight-flume-1.6.0/conf/properties.properties

Add the following information to the properties.properties file:

client.sources = obsclient.channels = flumeclient.sinks = hdfs

client.sources.obs.type=org.apache.flume.source.s3.OBSSourceclient.sources.obs.bucketName = obs-nemon-sinkclient.sources.obs.prefix = obs-source/client.sources.obs.accessKey = AK client.sources.obs.secretKey = SK client.sources.obs.backingDir = /tmp/obs/client.sources.obs.endPoint = client.sources.obs.basenameHeader = trueclient.sources.obs.basenameHeaderKey = basenameclient.sources.obs.channels = flume

client.channels.flume.type = memoryclient.channels.flume.capacity = 10000client.channels.flume.transactionCapacity = 1000client.channels.flume.channelfullcount = 10client.channels.flume.keep-alive = 3client.channels.flume.byteCapacity = client.channels.flume.byteCapacityBufferPercentage = 20

client.sinks.hdfs.type = hdfsclient.sinks.hdfs.hdfs.path = hdfs://hacluster/tmpclient.sinks.hdfs.montime = client.sinks.hdfs.hdfs.filePrefix = over_%{basename}client.sinks.hdfs.hdfs.fileSuffix = client.sinks.hdfs.hdfs.inUsePrefix = client.sinks.hdfs.hdfs.inUseSuffix = .tmpclient.sinks.hdfs.hdfs.idleTimeout = 0client.sinks.hdfs.hdfs.batchSize = 1000client.sinks.hdfs.hdfs.codeC = client.sinks.hdfs.hdfs.fileType = DataStreamclient.sinks.hdfs.hdfs.maxOpenFiles = 5000client.sinks.hdfs.hdfs.writeFormat = Writableclient.sinks.hdfs.hdfs.callTimeout = 10000client.sinks.hdfs.hdfs.threadsPoolSize = 10client.sinks.hdfs.hdfs.rollTimerPoolSize = 1client.sinks.hdfs.hdfs.kerberosPrincipal = adminclient.sinks.hdfs.hdfs.kerberosKeytab = /opt/FlumeClient/fusioninsight-


Issue 01 (2018-09-06) 567

flume-1.6.0/conf/user.keytabclient.sinks.hdfs.hdfs.round = falseclient.sinks.hdfs.hdfs.roundUnit = secondclient.sinks.hdfs.hdfs.useLocalTimeStamp = falseclient.sinks.hdfs.hdfs.failcount = 10client.sinks.hdfs.hdfs.fileCloseByEndEvent = trueclient.sinks.hdfs.hdfs.rollInterval = 30client.sinks.hdfs.hdfs.rollSize = 1024client.sinks.hdfs.hdfs.rollCount = 10client.sinks.hdfs.hdfs.batchCallTimeout = 0client.sinks.hdfs.serializer.appendNewline = trueclient.sinks.hdfs.channel = flume


l bucketName

l prefix

l backingDir

l endPoint

l accessKey

(Enter the actual AK value.)

l sercretKey

(Enter the actual encrypted SK value.)

l kerberosPrincipal

(The username must be configured in the security cluster.)

l kerberosKeytab

(The absolute path of the user's authentication credential file must be configured in thesecurity cluster.)


After new log files are generated in the prefix directory under bucketName, the logs will besent to OBS.

----End

8.12 Using Loader

8.12.1 Introduction

Process

The process for migrating user data with Loader is as follows:

1. Access the Loader page of the Hue WebUI.

2. Manage Loader links.

3. Create a job and select a data source link and a link for saving data.

4. Run the job to complete data migration.


Issue 01 (2018-09-06) 568

Loader Page

The Loader page is a graphical data migration management tool based on the open sourceSqoop WebUI and is hosted on the Hue WebUI. Perform the following operations to accessthe Loader page:

1. Access the Hue WebUI. For details, see Accessing the UI of the Open SourceComponent.

2. Choose Data Browsers > Sqoop.

The job management tab page is displayed by default on the Loader page.

Loader Link

Loader links save data location information. Loader uses links to access data or save data tothe specified location. Perform the following operations to access the Loader linkmanagement page:

1. Access the Loader page.

2. Click Manage links.

The Loader link management page is displayed.

You can click Manage jobs to return to the job management page.

3. Click New link to go to the configuration page and set parameters to create a Loaderlink.

Loader Job

Loader jobs are used to manage data migration tasks. Each job consists of a source data linkand a destination data link. A job reads data from the source link and saves data to thedestination link to complete a data migration task.

8.12.2 Loader Link Configuration

Overview

Loader supports the following links. This section describes configurations of each link.

l obs-connector

l generic-jdbc-connector

l ftp-connector or sftp-connector

l hbase-connector, hdfs-connector, or hive-connector

l voltdb-connector

OBS Link

An OBS link is a data exchange channel between Loader and OBS. Table 8-30 describes theconfiguration parameters.


Issue 01 (2018-09-06) 569

Table 8-30 obs-connector configuration


Name Name of a Loader link

OBS Server Enter the OBS endpoint address. The common format isOBS.Region.DomainName.For example, run the following command to view the OBS endpointaddress:cat /opt/Bigdata/apache-tomcat-7.0.78/webapps/web/WEB-INF/classes/cloud-obs.properties

Port Port for accessing OBS data. The default value is 443.

Access Key AK for a user to access OBS

Security Key SK corresponding to AK

Relational Database Link

A relational database link is a data exchange channel between Loader and a relationaldatabase. Table 8-31 describes the configuration parameters.

NOTE

Some parameters are hidden by default. They appear only after you click Show Senior Parameter.

Table 8-31 generic-jdbc-connector configuration



Database type Data types supported by Loader links: ORACLE, MYSQL, andMPPDB

Host Database access address, which can be an IP address or domainname

Port Port for accessing the database

Database Name of the database saving data

Username Username for accessing the database

Password Password of the user. Use the actual password.

Table 8-32 Senior parameter configuration


Fetch Size A maximum volume of data obtained during each database access


Issue 01 (2018-09-06) 570


ConnectionProperties

Driver properties exclusive to the database link that is supported bydatabases of different types, for example, autoReconnect ofMYSQL. If you want to define the driver properties, click Add.

Identifier enclose Delimiter for reserving keywords in the database SQL. Delimitersdefined in different databases vary.

File Server LinkFile server links include FTP and SFTP links and serve as a data exchange channel betweenLoader and a file server. Table 8-33 describes the configuration parameters.

Table 8-33 ftp-connector or sftp-connector configuration



Hostname/IP Enter the file server access address, which can be a host name or IPaddress.

Port Port for accessing the file server.l Use port 21 for FTP.l Use port 22 for SFTP.

Username Username for logging in to the file server.

Password Password of the user

MRS Cluster LinkMRS cluster links include HBase, HDFS, and Hive links and serve as a data exchangechannel between Loader and data.

When configuring an MRS cluster name and link, select a connector, for example, hbase-connector, hdfs-connector, or hive-connector, and save it.

VoltDB LinkA VoltDB link is a data exchange channel between Loader and an in-memory database. Table8-34 describes the configuration parameters.

NOTE

Some parameters are hidden by default. They appear only after you click Show Senior Parameter.


Issue 01 (2018-09-06) 571

Table 8-34 voltdb-connector configuration



Database servers Database access address, which can be an IP address or domainname. You can configure multiple database addresses and separatethem by comma (,).

Port Port for accessing the database

Username Username for accessing the database

Password Password of the user. Use the actual password.

Table 8-35 Senior parameter configuration


ConnectionProperties

Delimiter for reserving keywords in the memory database SQL

8.12.3 Managing Loader Links

Scenario

You can create, view, edit, and delete links on the Loader page.

Prerequisites

You have accessed the Loader page. For details, see Loader Page.

Creating a Link

Step 1 On the Loader page, click Manage links.

Step 2 Click New link and configure link parameters.

For details about the parameters, see Loader Link Configuration.

Step 3 Click Save.

If link configurations, for example, IP address, port, and access user information, areincorrect, the link will fail to be verified and saved. In addition, VPC configurations mayaffect the network connectivity.

NOTE

You can click Test to immediately check whether the link is available.

----End


Issue 01 (2018-09-06) 572

Viewing a Link


l If Kerberos authentication is enabled in the cluster, all links created by the current userare displayed by default and other users' links cannot be displayed.

l If Kerberos authentication is disabled in the cluster, all Loader links of the cluster aredisplayed.

Step 2 In the search box of the Sqoop Links, enter a link name to filter the link.

----End

Editing a Link


Step 2 Click the link name to go to the edit page.

Step 3 Modify the link configuration parameters based on service requirements.

Step 4 Click Test.

If the test is successful, go to Step 5. If OBS Server fails to be connected, repeat Step 3.

Step 5 Click Save.

If a Loader job has integrated into a Loader link, editing the link parameters may affectLoader running.

----End

Deleting a Link


Step 2 In the line of the link, click Delete.

Step 3 In the dialog box, click Yes, delete it.

If a Loader job has been integrated into a Loader link, the deletion of the Loader link is notallowed.

----End

8.12.4 Source Link Configurations of Loader Jobs

Overview

When Loader jobs obtain data from different data sources, a link corresponding to a datasource type needs to be selected and the link properties need to be configured.


Issue 01 (2018-09-06) 573

obs-connector

Table 8-36 Data source link properties of obs-connector


Bucket Name OBS bucket for storing source data

Input directory orfile

Actual storage form of source data. It can be either all data files in adirectory or single data file contained in the bucket.

File format Loader supports the following file formats of data stored in OBS:l CSV_FILE: Specifies a text file. When the destination link is a

database link, only the text file is supported.l BINARY_FILE: Specifies binary files excluding text files.

Line Separator Identifier of each line end of source data

Field Separator Identifier of each field end of source data

Encode type Text encoding type of source data. It takes effect for text files only.

File split type The following types are supported:l File: The number of files is assigned to a map task by the total

number of files. The calculation formula is Total number offiles/Extractors.

l Size: A file size is assigned to a map task by the total file size.The calculation formula is Total file size/Extractors.

generic-jdbc-connector

Table 8-37 Data source link properties of generic-jdbc-connector


Schema name Name of the database storing source data. You can query and selectit on the interface.

Table name Data table storing the source data. You can query and select it onthe interface.

Partition column If multiple columns need to be read, use this column to split theresult and obtain data.

Where clause Query statement used for accessing the database


Issue 01 (2018-09-06) 574

ftp-connector or sftp-connector

Table 8-38 Data source link properties of ftp-connector or sftp-connector



Actual storage form of source data. It can be either all data files in adirectory or single data file contained in the file server.

File format Loader supports the following file formats of data stored in the fileserver:l CSV_FILE: Specifies a text file. When the destination link is a


Line Separator Identifier of each line end of source dataNOTE

When FTP or SFTP serves as a source link and File format is set toBINARY_FILE, the value of Line Separator in the advanced properties isinvalid.

Field Separator Identifier of each field end of source dataNOTE

When FTP or SFTP serves as a source link and File format is set toBINARY_FILE, the value of Field Separator in the advanced properties isinvalid.

Encode type Text encoding type of source data. It takes effect for text files only.




hbase-connector

Table 8-39 Data source link properties of hbase-connector


Table name HBase table storing source data


Issue 01 (2018-09-06) 575

hdfs-connector

Table 8-40 Data source link properties of hdfs-connector



Actual storage form of source data. It can be either all data files in adirectory or single data file contained in HDFS.

File format Loader supports the following file formats of data stored in HDFS:l CSV_FILE: Specifies a text file. When the destination link is a


Line Separator Identifier of each line end of source dataNOTE

When HDFS serves as a source link and File format is set toBINARY_FILE, the value of Line Separator in the advanced properties isinvalid.

Field Separator Identifier of each field end of source dataNOTE

When HDFS serves as a source link and File format is set toBINARY_FILE, the value of Field Separator in the advanced properties isinvalid.




hive-connector

Table 8-41 Data source link properties of hive-connector


Database Name of the Hive database storing the data source. You can queryand select it on the interface.

Table Name of the Hive table storing the data source. You can query andselect it on the interface.


Issue 01 (2018-09-06) 576

voltdb-connector

Table 8-42 Data source link properties of voltdb-connector


Partition column If multiple columns need to be read, use this column to split theresult and obtain data.

Table Name of the memory database table storing source data. You canquery and select it on the interface.

8.12.5 Destination Link Configurations of Loader Jobs

Overview

When Loader jobs save data to different storage locations, a destination link needs to beselected and the link properties need to be configured.

obs-connector

Table 8-43 Destination link properties of obs-connector


Bucket Name OBS bucket for storing final data

Output directory Directory for storing final data in the bucket. A directory must bespecified.

File format Loader supports the following file formats of data stored in OBS:l CSV_FILE: Specifies a text file. When the destination link is a


Line Separator Identifier of each line end of final data

Field Separator Identifier of each field end of final data

Encode type Text encoding type of final data. It takes effect for text files only.

generic-jdbc-connector

Table 8-44 Destination link properties of generic-jdbc-connector


Schema name Name of the database saving final data


Issue 01 (2018-09-06) 577


Table name Name of the table saving final data

ftp-connector or sftp-connector

Table 8-45 Destination link properties of ftp-connector or sftp-connector


Output directory Directory for storing final data in the file server. A directory mustbe specified.

File format Loader supports the following file formats of data stored in the fileserver:l CSV_FILE: Specifies a text file. When the destination link is a


Line Separator Identifier of each line end of final dataNOTE

When FTP or SFTP serves as a destination link and File format is set toBINARY_FILE, the value of Line Separator in the advanced properties isinvalid.

Field Separator Identifier of each field end of final dataNOTE

When FTP or SFTP serves as a destination link and File format is set toBINARY_FILE, the value of Field Separator in the advanced properties isinvalid.

Encode type Text encoding type of final data. It takes effect for text files only.

hbase-connector

Table 8-46 Destination link properties of hbase-connector


Table name Name of the HBase table saving final data. You can query andselect it on the interface.

Method Data can be imported to an HBase table using either BULKLOADor PUTLIST.


Issue 01 (2018-09-06) 578


Clear data beforeimport

Whether to clear data in the destination HBase table. Options are asfollows:l True: Clean up data in the table.l False: Do not clean up data in the table. When you select False,

an error is reported during job running if data exists in the table.

hdfs-connector

Table 8-47 Destination link properties of hdfs-connector


Output directory Directory for storing final data in HDFS. A directory must bespecified.

File format Loader supports the following file formats of data stored in HDFS:l CSV_FILE: Specifies a text file. When the destination link is a


Compression codec Compression mode used when a file is saved to HDFS. Thefollowing modes are supported: NONE, DEFLATE, GZIP,BZIP2, LZ4 and SNAPPY.

Overwrite How to process files in the output directory when files are importedto HDFS. Options are as follows:l True: Clean up files in the directory and import new files by

default.l False: Do not clean up files. If files exist in the output directory,

job running fails.

Line Separator Identifier of each line end of final dataNOTE

When HDFS serves as a destination link and File format is set toBINARY_FILE, the value of Line Separator in the advanced properties isinvalid.

Field Separator Identifier of each field end of final dataNOTE

When HDFS serves as a destination link and File format is set toBINARY_FILE, the value of Field Separator in the advanced properties isinvalid.


Issue 01 (2018-09-06) 579

hive-connector

Table 8-48 Destination link properties of hive-connector


Database Name of the Hive database storing final data. You can query andselect it on the interface.

Table Name of the Hive table saving final data. You can query and selectit on the interface.

voltdb-connector

Table 8-49 Destination link properties of voltdb-connector


Table Name of the memory database table storing final data. You canquery and select it on the interface.

8.12.6 Managing Loader Jobs

Scenario

You can create, view, edit, and delete jobs on the Loader page.

Prerequisites

You have accessed the Loader page. For details, see Loader Page.

Create a Job

Step 1 On the Loader page, click New job.

Step 2 In Information, set parameters.

1. In Name, enter a job name.2. In From link and To link, select links accordingly.

After you select a link of a type, data is obtained from the specified source and saved tothe destination.

NOTE

If no available link exists, click Add a new link.

Step 3 In From, configure the job of the source link.

For details, see Source Link Configurations of Loader Jobs.

Step 4 In To, configure the job of the destination link.


Issue 01 (2018-09-06) 580

For details, see Destination Link Configurations of Loader Jobs.

Step 5 Check whether a database link is selected in To link.

Database links include:

l generic-jdbc-connector

l hbase-connector

l hive-connector

l voltdb-connector

If you set To link to a database link, you need to configure a mapping between service dataand a field in the database table.

l If you set it to a database link, go to Step 6.

l If you do not set it to a database link, go to Step 7.

Step 6 In Field Mapping, enter a field mapping. Perform Step 7.

Field Mapping specifies a mapping between each column of user data and a field in thedatabase table.

Table 8-50 Field Mapping properties


Column Num Field sequence of service data

Sample First line of sample values of service data

Column Family When To link is hbase-connector, you can select a column familyfor storing data.

Destination Field Field for storing data

Type Type of the field selected by the user

Row Key When To link is hbase-connector, you need to select DestinationField as a row key.

NOTE

If the value of From is a connector of a file type, for example, SFTP, FTP, OBS, and HDFS files, thevalue of Field Mapping is the first row of data in the file. Ensure that the first row of data is complete.Otherwise, the Loader job will not extract columns that are not mapped.

Step 7 In Task Config, set job running parameters.

Table 8-51 Loader job running properties


Extractors Number of map tasks


Issue 01 (2018-09-06) 581


Loaders Number of reduce tasksThis parameter appears only when the destination field is HBase orHive.

Max error records insingle split

Error record threshold. If error records of a single map task exceedthe threshold, the task automatically stops and the obtained data isnot returned.NOTE

Data is read and written in batches for MYSQL and MPPDB of generic-jdbc-connector by default. Errors are recorded once at most for each batchof data.

Dirty data directory Directory for saving dirty data. If you leave this parameter blank,dirty data will not be saved.

Step 8 Click Save.

----End

Viewing a Job

Step 1 Access the Loader page. The Loader job management page is displayed by default.l If Kerberos authentication is enabled in the cluster, all jobs created by the current user

are displayed by default and other users' jobs cannot be displayed.l If Kerberos authentication is disabled in the cluster, all Loader jobs of the cluster are

displayed.

Step 2 In Sqoop Jobs, enter a job name or link type to filter the job.

Step 3 Click Refresh to obtain the latest job status.

----End

Editing a Job

Step 1 Access the Loader page. The Loader job management page is displayed by default.

Step 2 Click the job name to go to the edit page.

Step 3 Modify the job configuration parameters based on service requirements.

Step 4 Click Save.

NOTE

Basic job operations in the navigation bar on the left are Run, Copy, Delete, Disable, History Record,and Show Job JSON Definition.

----End

Deleting a Job

Step 1 Access the Loader page.


Issue 01 (2018-09-06) 582

Step 2 In the line of the specified job, click .

Alternatively, you can select one or more jobs and click Delete jobs in the upper right cornerof the job list.

Step 3 In the dialog box, click Yes, delete it.

If the state of a Loader job is Running, the job fails to be deleted.

----End

8.12.7 Preparing a Driver for MySQL Database Link

Scenario

As a component for batch data export, Loader can import and export data using a relationaldatabase.

Prerequisites

You have prepared service data.

Procedure

Step 1 Download the mysql-connector-java-5.1.21.jar MySQL JDBC driver from the MySQLofficial website.

Step 2 Upload mysql-connector-java-5.1.21.jar to the /opt/Bigdata/FusionInsight/FusionInsight-Sqoop-1.99.7/FusionInsight-Sqoop-1.99.7/server/jdbc Loader installation directory onactive and standby MRS master nodes.

Step 3 Change the owner of mysql-connector-java-5.1.21.jar to omm:wheel.

Step 4 Modify the jdbc.properties configuration file.

Change the key-value of MYSQL to mysql-connector-java-5.1.21.jar, for example,MYSQL=mysql-connector-java-5.1.21.jar.

Step 5 Restart Loader.

----End

8.12.8 Example: Using Loader to Import Data from OBS to HDFS

Scenario

If you need to import a large volume of data from the external cluster to the internal cluster,import it from OBS to HDFS.

Prerequisitesl You have prepared service data.

l You have created an analysis cluster.


Issue 01 (2018-09-06) 583

Procedure

Step 1 Upload service data to your OBS bucket.

Step 2 Obtain AK/SK information and create an OBS link and an HDFS link.

For details, see Loader Link Configuration.

Step 3 Access the Loader page. For details, see Loader Page.

If Kerberos authentication is enabled in the analysis cluster, follow instructions in Accessingthe Hue WebUI.

Step 4 Click New Job.

Step 5 In Information, set parameters.

1. In Name, enter a job name, for example, obs2hdfs.2. In From link, select the OBS link you create.3. In To link, select the HDFS link you create.

Step 6 In From, set source link parameters.

1. In Bucket Name, enter a name of the bucket storing service data.2. In Input directory or file, enter a detailed location of service data in the bucket.

If it is a single file, enter a complete path containing the file name. If it is a directory,enter the complete path of the directory.

3. In File format, enter the type of the service data file.

For details, see Table 8-36.

Step 7 In To, set destination link parameters.

1. In Output directory, enter the directory for storing service data in HDFS.If Kerberos authentication is enabled in the cluster, the current user who accesses Loaderneeds to have permission to write data to the directory.

2. In File format, enter the type of the service data file.The type must correspond to the type in Step 6.3.

3. In Compression codec, enter a compression algorithm. For example, if you do notcompress data, select NONE.

4. In Overwrite, select True.5. Click Show Senior Parameter and set Line Separator.6. Set Field Separator.

For details, see Table 8-47.

Step 8 In Task Config, set job running parameters.

1. In Extractors, enter the number of map tasks.2. In Loaders, enter the number of reduce tasks.

When the destination link is an HDFS link, Loaders is hidden.3. In Max error records in single split, enter an error record threshold.4. In Dirty data directory, enter a directory for saving dirty data, for example, /user/

sqoop/obs2hdfs-dd.


Issue 01 (2018-09-06) 584

Step 9 Click Save and execute.

On the Manage jobs page, view the job running result. You can click Refresh to obtain thelatest job status.

----End


Issue 01 (2018-09-06) 585

9 MRS Patch Description

9.1 MRS 1.5.1.4 Patch Description

Basic Information


Patch Version MRS 1.5.1.4

Release Date 2018-08-23

MapReduce ServiceUser Guide 9 MRS Patch Description

Issue 01 (2018-09-06) 586

Resolved Issues Spark issues:l When the metadata file of a Carbon table is large, the query is slow.l Carbon fails to be converted into SHORT_INT during data

compression in some scenarios.l When Spark parses zlib, the java.io.IOException: unknown

compression method exception is thrown.l When a certain amount of user data is imported to Carbon, an

executor breaks down.l In Yarn-cluster mode, the Spark program is automatically stopped

after exiting a client.l When a Carbon table has a large number of segments, the

execution of the delete statement becomes slow. This function hasbeen optimized.

l Error GSS initiate failed occurs when Spark SQL is executed in aSpark job running for a long time.

l When the select operation is executed in the Carbon table, an errorindicating that the carbonindex file cannot be found is reported.

l When the select operation is executed in the Carbon table, a nullpointer error is reported because the tablestatues file is empty.

l When the select operation is executed in the Carbon table, a nullpointer error is reported because the deletedelta file is empty.

l When the select operation is executed in the Carbon table,duplicate entries exist in the tablestatues file due to concurrentoperations and an error is reported indicating that the segmentfolder cannot be found.

Kafka issues:l No data is displayed on the Kafka topic monitoring page of MRS

Manager.l The Scala version used by SparkStreaming is different from Kafka.

As a result, Spark fails to access Kafka.l When SparkStreaming accesses Kafka, only one partition can be

read.

HBase issues:l During an HBase health check, an error code caused by a non-

HBase problem overlaps with that of HBase. As a result, a falsealarm is generated.

l On MRS Manager, some configuration files (hdfs-site.xml, core-site.xml, mapred-site.xml, and yarn-site.xml) of the HBase serverfail to be modified. Even though the configuration files aremodified in the background, they will be forcibly restored after theservices are restarted.

l The dfs.client.read.shortcircuit configuration item of HBase failsto be modified on MRS Manager.


Issue 01 (2018-09-06) 587

Hadoop issues:l After resources of the archives type are downloaded during Yarn

resource localization, the automatically decompressed directorymay be injected.

l Disks are full because local resource files of Yarn NodeManagerand history files of Spark JobHistory are not cleared periodically.

l When a user clicks Allocated Memory MB on the native Yarnpage, a page response exception occurs.

Other issues:l After a user logs in to MRS Manager and clicks Tenant, the tenant

information cannot be loaded.l The MRS reliability has been improved in MRS capacity

expansion scenarios.l Display of some UIs on MRS Manager has been optimized.l When a role is created in an MRS cluster in security mode, Hive

component permissions cannot be added.

Compatibilitywith OtherPatches

The patch can resolve all problems that have been resolved by patchesof version 1.5.1.

Impact of Patch Installationl After installing the patch, you need to restart the service for the patch to take effect.

During the restart, the service is unavailable.l After installing the patch, you need to download and install all clients again, including

the original clients of Master nodes and the clients used by other nodes of the VPC (thatis, the clients that you set up).

l For details about how to fully update the original client of the active Master node, seeFully Updating the Original Client of the Active Master Node.

l For details about how to fully update the original client of the standby Master node, seeFully Updating the Original Client of the Standby Master Node.

l For details about how to fully install the clients you set up, see Using the Client onAnother Node of a VPC.

NOTE

l You are advised to back up the old clients before reinstalling the new ones.

l If you have modified client configurations based on the service scenario, modify them again afterreinstalling the clients.


Issue 01 (2018-09-06) 588

9.2 MRS 1.7.1.2 Patch Description

Basic Information


PatchVersion

MRS 1.7.1.2

ReleaseDate

2018-07-26

ResolvedIssues

MRS Manager issues:l An error is reported when audit log details are downloaded.l The computing of the disk usage on the Host tab of MRS Manager is

optimized.

Kafka issues:l KAFKA-5413 Failed to clear Kafka logs, because the offset span of the

segment file is too large.l KAFKA-6529 The client is disconnected accidentally due to a Broker

memory leakage.l KAFKA-5417 In a concurrency scenario, the client connection statuses

are inconsistent.

HBase issues:l The region location is repeatedly calculated each time the balance

command is executed.

Compatibility withOtherPatches

The patch can resolve all problems that have been resolved by patches ofversion 1.7.1.

Precautions

A cluster health check is performed before patch installation and a false alarm will betriggered in the cluster health check of MRS 1.7.1. As a result, after you submit a patchinstallation request for the first time, the system prompts a cluster exception and the patchinstallation stops. After confirming that the error is a false alarm, you can submit a patchinstallation request again. And then, the system skips the health check and installs the patch.

You can use the following method to confirm the false alarm.

Check and export the health check report. For details, see Viewing and Exporting a CheckReport. The health check result shows that a host reports an error as follows: "InstallationDirectory and Data Directory Check: Files under the directory are abnormal. Pleasecheck the contents under the installation directory and data directory."


Issue 01 (2018-09-06) 589

Impact of Patch Installationl After installing the patch, you need to restart the service for the patch to take effect.

During the restart, the service is unavailable.l After installing the patch, you need to download and install all clients again, including

the original clients of Master nodes and the clients used by other nodes of the VPC (thatis, the clients that you set up).

l For details about how to fully update the original client of the active Master node, seeFully Updating the Original Client of the Active Master Node.

l For details about how to fully update the original client of the standby Master node, seeFully Updating the Original Client of the Standby Master Node.

l For details about how to fully install the clients you set up, see Using the Client onAnother Node of a VPC.

NOTE

l You are advised to back up the old clients before reinstalling the new ones.

l If you have modified client configurations based on the service scenario, modify them again afterreinstalling the clients.


Issue 01 (2018-09-06) 590

A ECS Specifications Used by MRS

MRS uses ECSs of the following types in different application scenarios.

l General computing (S1)l General computing (S3)l General computing (C2)l General computing-plus (C3)l Disk-intensive (D2)l General network enhancement (C3ne)

ECS Flavor Naming Rules

ECS flavors are named using the format "AB.C.D".

Example: m2.8xlarge.8

The format is defined as follows:

l A specifies the ECS type. For example, s indicates a general-purpose ECS, c acomputing ECS, and m a memory-optimized ECS.

l B specifies the type ID. For example, the 1 in s1 indicates a general-purpose first-generation ECS, and the 2 in s2 indicates a general-purpose second-generation ECS.

l C specifies a flavor size and can be any of the following options: medium, large, andxlarge

l D specifies the ratio of memory to vCPUs expressed in a digit. For example, value 4indicates that the ratio of memory to vCPUs is 4.

Specifications

Table A-1 General computing ECS specifications

ECS Type vCPUs Memory (GB) Flavor VirtualizationType

S1 4 16 s1.xlarge XEN

MapReduce ServiceUser Guide A ECS Specifications Used by MRS

Issue 01 (2018-09-06) 591

ECS Type vCPUs Memory (GB) Flavor VirtualizationType

16 64 s1.4xlarge XEN

32 128 s1.8xlarge XEN

S3 8 16 s3.2xlarge.2 KVM

16 32 s3.4xlarge.2 KVM

4 16 s3.xlarge.4 KVM

16 64 s3.4xlarge.4 KVM

C2 8 16 c2.2xlarge XEN

16 32 c2.4xlarge XEN

Table A-2 General computing-plus (C3) ECS specifications

ECS Type vCPUs Memory(GB)

Flavor VirtualizationType

C3 4 8 c3.xlarge.2 KVM

16 32 c3.4xlarge.2 KVM

4 16 c3.xlarge.4 KVM





Table A-3 D2 ECS specifications

ECSType

vCPUs

Memory(GB)


LocalDisk(GB)

Hardware

D2 8 64 d2.2xlarge.8

KVM 4×1800 CPU: Intel® Xeon®Gold 6151 Processor v5Memory: 20 × 32 GB16 128 d2.4xlarge.

8KVM 8×1800


Issue 01 (2018-09-06) 592

ECSType

vCPUs

Memory(GB)


LocalDisk(GB)

Hardware

32 256 d2.8xlarge.8

KVM 16×1800

Table A-4 General network enhancement (C3ne) ECS specifications

ECS Type vCPUs Memory(GB)


C3ne 4 16 c3ne.xlarge.4 KVM

8 32 c3ne.2xlarge.4 KVM






Issue 01 (2018-09-06) 593

B Change History

Release Date What's New

2018-09-06 This issue is the fifteenth official release.Modified the following content:l Creating a Clusterl Creating a Cluster (History Versions)l ECS Specifications Used by MRS

2018-08-13 This issue is the fifteenth official release.Added the following content:l Security Configuration Suggestions for Clusters with

Kerberos Authentication Disabledl Managing Cluster Tagsl Bootstrap Actionsl Introduction to Bootstrap Actionsl Preparing the Bootstrap Action Scriptl Adding a Bootstrap Actionl View Execution Recordsl Sample ScriptsModified the following content:l Creating a Clusterl Creating the Smallest Clusterl Creating a Cluster (History Versions)l Viewing Basic Information About an Active Clusterl Viewing Patch Information About an Active Cluster

MapReduce ServiceUser Guide B Change History

Issue 01 (2018-09-06) 594


2018-05-04 This issue is the fourteenth official release.Modified the following content:l Submitting a Spark SQL Statementl Configuring Cross-Cluster Mutual Trust Relationshipsl Using Spark SQL from Scratch

2018-04-27 This issue is the thirteenth official release.Modified the following content:l Creating a Clusterl Cluster Listl Creating a Clusterl Creating the Smallest Clusterl Creating a Cluster (History Versions)l Viewing Basic Information About an Active Clusterl Viewing Basic Information About a Historical Cluster

2018-04-19 This issue is the twelfth official release.Added the following content:l Expanding a Clusterl List of Open Source Component Portsl Using HBase

Modified the following content:l Limitationsl Creating a Clusterl Creating the Smallest Clusterl Creating a Cluster (History Versions)l Viewing Basic Information About an Active Clusterl Shrinking a Clusterl Performing Auto Scaling for a Clusterl Viewing Basic Information About a Historical Clusterl Backing Up Metadatal Configuring Cross-Cluster Mutual Trust Relationshipsl Configuring Users to Access Resources of a Trusted Cluster


Issue 01 (2018-09-06) 595


2018-03-21 This issue is the eleventh official release.Added the following content:l List of MRS Component Versionsl Creating the Smallest Clusterl Creating a Cluster (History Versions)

Modified the following content:l Creating a Clusterl Expanding a Clusterl Shrinking a Clusterl Viewing Basic Information About a Historical Clusterl Accessing the UI of the Open Source Component

2018-03-7 This issue is the tenth official release.Modified the following content:l Creating a Clusterl Changing the Password for User adminl Changing the Password for the Kerberos Administratorl Changing the Password for the OMS Kerberos

Administratorl Changing the Password for a Component Running Userl Changing the Password of an Operation Userl Initializing the Password of a System User


Issue 01 (2018-09-06) 596


2018-02-09 This issue is the ninth official release.l Added the following content:

– Shrinking a Cluster– Performing Auto Scaling for a Cluster– Configuring Message Notification– ALM-12014 Partition Lost– ALM-12015 Partition Filesystem Readonly– ALM-12043 DNS Resolution Duration Exceeds the

Threshold– ALM-12045 Network Read Packet Dropped Rate

Exceeds the Threshold– ALM-12046 Network Write Packet Dropped Rate

Exceeds the Threshold– ALM-12047 Network Read Packet Error Rate Exceeds

the Threshold– ALM-12048 Network Write Packet Error Rate Exceeds

the Threshold– ALM-12049 Network Read Throughput Rate Exceeds

the Threshold– ALM-12050 Network Write Throughput Rate Exceeds

the Threshold– ALM-12051 Disk Inode Usage Exceeds the Threshold– ALM-12052 TCP Temporary Port Usage Exceeds the

Threshold– ALM-12053 File Handle Usage Exceeds the Threshold– ALM-12054 The Certificate File Is Invalid– ALM-12055 The Certificate File Is About to Expire– ALM-18008 Heap Memory Usage of Yarn

ResourceManager Exceeds the Threshold– ALM-18009 Heap Memory Usage of MapReduce

JobHistoryServer Exceeds the Threshold– ALM-20002 Hue Service Unavailable– ALM-43001 Spark Service Unavailable– ALM-43006 Heap Memory Usage of the JobHistory

Process Exceeds the Threshold– ALM-43007 Non-Heap Memory Usage of the JobHistory

Process Exceeds the Threshold– ALM-43008 Direct Memory Usage of the JobHistory

Process Exceeds the Threshold– ALM-43009 JobHistory GC Time Exceeds the Threshold– ALM-43010 Heap Memory Usage of the JDBCServer

Process Exceeds the Threshold


Issue 01 (2018-09-06) 597


– ALM-43011 Non-Heap Memory Usage of theJDBCServer Process Exceeds the Threshold

– ALM-43012 Direct Memory Usage of the JDBCServerProcess Exceeds the Threshold

– ALM-43013 JDBCServer GC Time Exceeds theThreshold

l Modified the following content:– Hue– Creating a Cluster– Managing Files– Creating a Job– Overview– Cluster List– Creating a Cluster– Viewing Basic Information About an Active Cluster– Accessing the Cluster Management Page– Expanding a Cluster– Managing Data Files– Viewing Basic Information About a Historical Cluster– Accessing MRS Manager– Changing the Password for User admin– Changing the Password for the Kerberos Administrator– Changing the Password for the OMS Kerberos

Administrator– Changing the Password for a Component Running User– Changing the Password of an Operation User– Initializing the Password of a System User– Overview– Using Hadoop from Scratch


Issue 01 (2018-09-06) 598


2017-12-30 This issue is the eighth official release.l Added the following content:

– Shrinking a Cluster– Configuring Message Notification

l Modified the following content:– Creating a Cluster– Creating a Cluster– Viewing Basic Information About an Active Cluster– Accessing the Cluster Management Page– Expanding a Cluster– Viewing Basic Information About a Historical Cluster– Accessing MRS Manager– Changing the Password for User admin– Changing the Password for the Kerberos Administrator– Changing the Password for the OMS Kerberos

Administrator– Changing the Password for a Component Running User– Changing the Password of an Operation User– Initializing the Password of a System User– Using Hadoop from Scratch

2017-11-22 This issue is the seventh official release.l Modified the following content:

– Creating a Cluster– Creating a Cluster– Viewing Basic Information About an Active Cluster– Viewing Basic Information About a Historical Cluster– Adding a Jar or Script Job– Replicating Jobs– Logging In to an ECS Using VNC– Changing the Password for User admin– Changing the Password for the Kerberos Administrator– Changing the Password for the OMS Kerberos

Administrator– Changing the Password for a Component Running User– Changing the Password of an Operation User– Initializing the Password of a System User– Using Hadoop from Scratch– Using Spark from Scratch– Using HBase from Scratch


Issue 01 (2018-09-06) 599


2017-10-24 This issue is the sixth official release.l Modified the following content:

– Limitations– Creating a Cluster– Creating a Cluster– Viewing Basic Information About an Active Cluster– Viewing Basic Information About a Historical Cluster– Accessing the UI of the Open Source Component


Issue 01 (2018-09-06) 600


2017-09-29 This issue is the fifth official release.l Added the following content:

– ALM-12357 Failed to Export Audit Logs to the OBS– ALM-24000 Flume Service Unavailable– ALM-24001 Flume Agent Is Abnormal– ALM-24003 Flume Client Connection Failure– ALM-24004 Flume Fails to Read Data– ALM-24005 Data Transmission by Flume Is Abnormal– Viewing Flume Client Logs– Stopping or Uninstalling the Flume Client– Example: Using Flume to Collect Logs and Import Them

to OBS– Example: Using Flume to Read OBS Files and Upload

Them to HDFSl Modified the following contents:

– Loader– Relationships with Other Services– Limitations– Creating a Cluster– Creating a Cluster– Viewing Basic Information About an Active Cluster– Adding a Jar or Script Job– Viewing Basic Information About a Historical Cluster– MRS Manager Introduction– ALM-16002 Successful Hive SQL Operations Are Lower

than the Threshold– Viewing and Exporting Audit Logs– Configuring Audit Log Export Parameters– List of Default Users– Overview– Creating an SSH Channel to Connect an MRS Cluster

and Configuring the Browser– Introduction– Flume Configuration Parameter Description– Example: Using Flume to Collect Logs and Import Them

to Kafka


Issue 01 (2018-09-06) 601


2017-08-21 This issue is the fourth official release.l Added the following content:

– Flume– Loader– Accessing the UI of the Open Source Component– Using Flume– Using Loader

l Modified the following contents:– Hue– Relationships with Other Services– Limitations– Creating a Cluster– Creating a Cluster– Viewing Basic Information About an Active Cluster– Expanding a Cluster– Viewing the Alarm List– Viewing Basic Information About a Historical Cluster– Viewing Job Configurations and Logs

2017-07-14 This issue is the third official release.Modified the following contents:l Sparkl Spark SQLl Creating a Clusterl Creating a Clusterl Viewing Basic Information About an Active Clusterl Viewing Basic Information About a Historical Cluster


Issue 01 (2018-09-06) 602


2017-06-23 This issue is the second official release.l Added the following content:

– Functions– Required Permission for Using MRS– MRS Quick Start– Managing Active Clusters– Managing Historical Clusters– Accessing MRS Manager Supporting Kerberos

Authentication– Management of Clusters with Kerberos Authentication

Enabled– Using MRS

l Modified the following contents:– Overview– Cluster List– Creating a Cluster

2016-08-25 This issue is the first official release.


Issue 01 (2018-09-06) 603

Documents

User Guide · MapReduce Service