Quick Start Installation Administration Development ...€¦ · MapR Administrator Training April 2012 Version 1.2.10 Quick Start Installation Administration Development Reference

MapR Administrator Training

April 2012

Version 1.2.10

Quick StartInstallation

AdministrationDevelopment

Reference

All rights reserved.

The MapR logo is a registered trademark of MapR Technologies, Inc.

DOCUMENTATION IS PROVIDED “AS IS” AND ALL EXPRESS OR IMPLIED CONDITIONS, REPRESENTATIONS AND WARRANTIES, INCLUDING ANY IMPLIED WARRANTY OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE OR NON-INFRINGEMENT, ARE DISCLAIMED, EXCEPT TO THE EXTENT THAT SUCH DISCLAIMERS ARE HELD TO BE LEGALLY INVALID.

MapR Technologies, Inc. has intellectual property rights relating to technology embodied in the product that is described in this document. In particular, and without limitation, these intellectual property rights may include one or more U.S. patents or pending patent applications in the U.S. and in other countries.

Table of Contents: Home . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5

Start Here . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6 Quick Start - Single Node . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8

Single Node - RHEL or CentOS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9 Single Node - SUSE . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14 Single Node - Ubuntu . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19 Quick Start - Small Cluster . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24 M3 - RHEL or CentOS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25

M3 - Ubuntu . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29 M5 - RHEL or CentOS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33

M5 - Ubuntu . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37 M3 - SUSE . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42 M5 - SUSE . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46

Quick Start - MapR Virtual Machine . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50 Installing the MapR Virtual Machine . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51 A Tour of the MapR Virtual Machine . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52

Working with Snapshots, Mirrors, and Schedules . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56 Getting Started with HBase . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59

Getting Started with Hive . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61 Getting Started with Pig . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 63

MapR 2.0 Beta . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 64 Our Partners . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 71 Datameer . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 72

Karmasphere . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 73 HParser . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 74 Installation Guide . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 75 Requirements . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 77

PAM Configuration . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 81 Setting Up Disks for MapR . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 83

ulimit . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 85 Planning the Deployment . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 87

Cluster Architecture . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 91 Isolating CLDB Nodes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 92

Isolating ZooKeeper Nodes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 93 Installing MapR . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 94

Cluster Configuration . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 100 Component Setup . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 102

Flume . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 103 HBase . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 104

HBase Best Practices . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 108 Hive . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 109

Hive ODBC Connector . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 118 Mahout . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 123

MultiTool . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 126 Oozie . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 127

Pig . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 129 Using Whirr to Install on Amazon EC2 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 131

Setting Up the Client . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 135 Uninstalling MapR . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 141

Working with Multiple Clusters . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 142 Administration Guide . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 143

Monitoring . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 144 Alarms and Notifications . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 145

Ganglia . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 146 Nagios Integration . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 147

Service Metrics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 148 Managing the Cluster . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 150

Balancers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 151 Cluster Upgrade . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 153

Manual Upgrade . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 154 Rolling Upgrade . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 157

Dial Home . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 159 Disks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 160 Nodes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 162

Adding Nodes to a Cluster . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 166 Adding Roles . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 167

Node Topology . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 168 Removing Roles . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 169

Services . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 170 CLDB Failover . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 171

TaskTracker Blacklisting . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 174 Startup and Shutdown . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 175

Managing Data with Volumes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 177 Mirrors . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 181

Schedules . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 184 Snapshots . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 186

Users and Groups . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 188 Managing Permissions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 189

Managing Quotas . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 192 Troubleshooting . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 194

Disaster Recovery . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 195 Out of Memory Troubleshooting . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 196

Troubleshooting Alarms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 197 Development Guide . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 205

Working with MapReduce . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 206 Compiling Pipes Programs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 207

ExpressLane . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 208 Secured TaskTracker . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 209 Standalone Operation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 210

Tuning MapReduce . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 211 Working with MapR-FS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 213

Chunk Size . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 217 Compression . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 218 Working with Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 220

Accessing Data with NFS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 221 Copying Data from Apache Hadoop . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 225

Data Protection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 227 Provisioning Applications . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 229

Provisioning for Capacity . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 232 Provisioning for Performance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 233

Migration Guide . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 234 Planning the Migration . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 235

Initial MapR Deployment . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 236 Component Migration . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 237 Application Migration . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 239

Data Migration . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 240 Node Migration . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 242

Reference Guide . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 243 Release Notes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 244

Version 1.2 Release Notes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 245 Version 1.2.10 Release Notes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 247

Version 1.2.9 Release Notes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 248 Version 1.2.7 Release Notes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 250

Hadoop Compatibility in Version 1.2 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 251 Version 1.2.2 Release Notes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 258 Version 1.2.3 Release Notes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 259

Version 1.1 Release Notes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 260 Version 1.1.3 Release Notes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 262 Version 1.1.2 Release Notes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 263 Version 1.1.1 Release Notes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 265

Hadoop Compatibility in Version 1.1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 267 Version 1.0 Release Notes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 273

Hadoop Compatibility in Version 1.0 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 276 Beta Release Notes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 282 Alpha Release Notes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 285

MapR Control System . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 286 Cluster . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 288

MapR-FS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 299 NFS HA . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 308

Alarms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 310 System Settings . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 313

Other Views . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 318 Hadoop Commands . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 319 hadoop archive . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 321

hadoop classpath . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 322 hadoop daemonlog . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 323

hadoop distcp . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 325 hadoop fs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 328 hadoop jar . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 331 hadoop job . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 332

hadoop jobtracker . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 334 hadoop mfs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 335

hadoop mradmin . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 337 hadoop pipes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 338 hadoop queue . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 339

hadoop tasktracker . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 340 hadoop version . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 344

API Reference . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 345 acl . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 348

acl edit . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 349 acl set . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 350

acl show . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 352 alarm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 354

alarm clear . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 355 alarm clearall . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 356

alarm config load . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 357 alarm config save . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 359

alarm list . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 360 alarm names . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 362

alarm raise . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 363 config . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 364

config load . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 367 config save . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 369

dashboard . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 370 dashboard info . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 371

dialhome . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 375 dialhome ackdial . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 376 dialhome enable . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 377

dialhome lastdialed . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 378 dialhome metrics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 379

dialhome status . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 380 disk . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 381

disk add . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 382 disk list . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 384

disk listall . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 385 disk remove . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 386

entity . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 388 entity info . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 389

entity list . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 391 entity modify . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 393

license . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 394 license add . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 395

license addcrl . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 396 license apps . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 397

license list . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 398 license listcrl . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 399

license remove . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 400 license showid . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 401

nagios . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 402 nagios generate . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 403

nfsmgmt . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 406 nfsmgmt refreshexports . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 407

node . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 408 node heatmap . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 409

node list . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 411 node move . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 416

node path . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 417 node remove . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 418 node services . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 419

node topo . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 420 schedule . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 421

schedule create . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 423 schedule list . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 424

schedule modify . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 425 schedule remove . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 426

service list . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 427 setloglevel . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 428

setloglevel cldb . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 429 setloglevel fileserver . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 430 setloglevel hbmaster . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 431

setloglevel hbregionserver . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 432 setloglevel jobtracker . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 433

setloglevel nfs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 434 setloglevel tasktracker . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 435

trace . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 436 trace dump . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 437

trace info . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 438 trace print . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 440 trace reset . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 441 trace resize . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 442

trace setlevel . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 443 trace setmode . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 444

urls . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 445 virtualip . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 446

virtualip add . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 447 virtualip edit . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 448

virtualip list . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 449 virtualip remove . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 450

volume . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 451 volume create . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 452

volume dump create . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 455 volume dump restore . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 457 volume fixmountpath . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 459

volume info . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 460 volume link create . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 461 volume link remove . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 462

volume list . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 463 volume mirror push . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 467 volume mirror start . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 468 volume mirror stop . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 469

volume modify . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 470 volume mount . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 472

volume move . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 473 volume remove . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 474 volume rename . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 475

volume snapshot create . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 476 volume snapshot list . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 477

volume snapshot preserve . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 479 volume snapshot remove . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 481

volume unmount . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 483 Utilities . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 484

configure.sh . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 485 disksetup . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 487

mapr-support-collect.sh . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 489 rollingupgrade.sh . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 491 Environment Variables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 492

Configuration Files . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 493 .dfs_attributes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 494

cldb.conf . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 495 core-site.xml . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 497

disktab . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 498 hadoop-metrics.properties . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 499

mapr-clusters.conf . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 502 mapred-default.xml . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 503

mapred-site.xml . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 513 mfs.conf . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 518

taskcontroller.cfg . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 519 warden.conf . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 520

Ports Used by MapR . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 524 Best Practices . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 525

Glossary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 531

MapR v1.2.10 Documentation, Page 5For the latest documentation visit http://www.mapr.com/doc

Copyright © 2012, MapR Technologies, Inc.

HomeWelcome to MapR! If you are not sure how to get started, here are a few places to find the information you are looking for:

Quick Start - MapR Virtual Machine - A single-node cluster that's ready to roll, right out of the box!Quick Start - Single Node - Set up a single-node Hadoop cluster and try some of the features that set MapR Hadoop apart

- Set up a Hadoop cluster with a small to moderate number of nodesQuick Start - Small ClusterInstallation Guide - Learn how to set up a production cluster, large or smallDevelopment Guide - Read more about what you can do with a MapR clusterAdministration Guide - Learn how to configure and tune a MapR cluster for performance



Start HereMapR Distribution for Apache Hadoop™ is the easiest, most dependable, and fastest Hadoop distribution on the planet. It is the only Hadoopdistribution that allows direct data input and output via MapR Direct Access NFS™ with realtime analytics, and the first to provide true HighAvailability (HA) at all levels. MapR introduces logical volumes to Hadoop. A volume is a way to group data and apply policy across an entire dataset. MapR provides hardware status and control with the MapR Control System, a comprehensive UI including a Heatmap™ that displays thehealth of the entire cluster at a glance. Read on to learn about how the unique features of MapR provide the highest-performance, lowest costHadoop available.

To get started right away, read the Quick Start guides:

Quick Start - MapR Virtual MachineQuick Start - Single NodeQuick Start - Small Cluster

To learn more about MapR, read on!

Ease of Use

With MapR, it is easy to run Hadoop jobs reliably, while isolating resources between different departments or jobs, applying data and performancepolicies, and tracking resource usage and job performance. MapR enables you to perform the following tasks:

Create a volume and set policy. The MapR Control System makes it simple to set up a volume and assign granular control to users orgroups. Use replication, mirroring, and snapshots for data protection, isolation, or performance.Provision resources. You can limit the size of data on a volume, or place the volume on specific racks or nodes for performance orprotection.Run the Hadoop job normally. Realtime Hadoop analytics let you track resource usage and job performance, while Direct Access NFSgives you easy data input and direct access to the results.

MapR lets you control data access and placement, so that multiple concurrent Hadoop jobs can safely share the cluster.

With MapR, you can mount the cluster on any server or client and have your applications write data and log files directly into the cluster, insteadof the batch processing model of the past. You do not have to wait for a file to be closed before reading it; you can tail a file as it is being written.Direct Access NFS even makes it possible to use standard shell scripts to work with Hadoop data directly.

Provisioning resources is simple. You can easily create a volume for a project or department in a few clicks. MapR integrates with NIS and LDAP,making it easy to manage users and groups. The MapR Control System makes it a breeze to assign user or group quotas, to limit how much dataa user or group can write; or volume quotas, to limit the size of a volume. You can assign topology to a volume, to limit it to a specific rack or setof nodes. Setting recovery time objective (RTO) and recovery point objective (RPO) for a data set is a simple matter of scheduling snapshots andmirrors on a volume through the MapR Control System. You can set read and write permissions on volumes directly via NFS or using hadoop fscommands, and volumes provide administrative delegation through ACLs; for example, through the MapR Control System you can control whocan mount, unmount, snapshot, or mirror a volume.

At its heart, MapR is a Hadoop distribution. You can run Hadoop jobs the way you always have.

MapR has partnered with Datameer, which provides a self-service Business Intelligence platform that runs best on the MapR Distribution forApache Hadoop. Your download of MapR includes a 30-day trial version of Datameer Analytics Solution (DAS), which provides spreadsheet-styleanalytics, ETL and data visualization capabilities.

For more information:

Read about Provisioning ApplicationsLearn about Direct Access NFSCheck out Datameer

Dependability

With clusters growing to thousands of nodes, hardware failures are inevitable even with the most reliable machines in place. MapR Distribution forApache Hadoop has been designed from the ground up to tolerate hardware failure seamlessly.

MapR is the first Hadoop distibution to provide true HA and failover at all levels, including a MapR Distributed HA NameNode™. If a disk or nodein the cluster fails, MapR automatically restarts any affected processes on another node without requiring administrative intervention. The HAJobTracker ensures that any tasks interrupted by a node or disk failure are re-started on another TaskTracker node. In the event of any failure,the job's completed task state is preserved and no tasks are lost. For additional data reliability, every bit of data on the wire is compressed andCRC-checked.

With volumes, you can control access to data, set the replication factor, and place specific data sets on specific racks or nodes for performance ordata protection. Volumes control data access to specific users or groups with Linux-style permissions that integrate with existing LDAP and NIS



directories. Volumes can be size-limited with volume quotas to prevent data overruns from using excessive storage capacity. One of the mostpowerful aspects of the volume concept is the ways in which a volume provides data protection:

To enable point-in-time recovery and easy backups, volumes have manual and policy-based snapshot capability.For true business continuity, you can manually or automatically mirror volumes and synchronize them between clusters or datacenters toenable easy disaster recovery.You can set volume read/write permission and delegate administrative functions, to control access to data.

Volumes can be exported with MapR Direct Access NFS with HA, allowing data to be read and written directly to Hadoop without the need fortemporary storage or log collection. You can load-balance across NFS nodes; clients connecting to different nodes see the same view of thecluster.

The MapR Control System provides powerful hardware insight down to the node level, as well as complete control of users, volumes, quotas,mirroring, and snapshots. Filterable alarms and notifications provide immediate warnings about hardware failures or other conditions that requireattention, allowing a cluster administrator to detect and resolve problems quickly.


Take a look at the HeatmapLearn about Volumes, Snapshots, and MirroringExplore scenariosData Protection

Performance

MapR Distribution for Apache Hadoop achieves up to three times the performance of any other Hadoop distribution, and can reduce yourequipment costs by half.

MapR Direct Shuffle uses the Distributed NameNode to improve Reduce phase performance drastically. Unlike Hadoop distributions that use thelocal filesystem for shuffle and HTTP to transport shuffle data, MapR shuffle data is readable directly from anywhere on the network. MapR storesdata with Lockless Storage Services™, a sharded system that eliminates contention and overhead from data transport and retrieval. Automatic,transparent client-side compression reduces network overhead and reduces footprint on disk, while direct block device I/O provides throughput athardware speed with no additional overhead. As an additional performance boost, with MapR Realtime Hadoop™, you can read files while theyare still being written.

MapR gives you ways to tune the performance of your cluster. Using mirrors, you can load-balance reads on highly-accessed data to alleviatebottlenecks and improve read bandwidth to multiple users. You can run MapR Direct Access NFS on many nodes – all nodes in the cluster, ifdesired – and load-balance reads and writes across the entire cluster. Volume topology helps you further tune performance by allowing you toplace resource-intensive Hadoop jobs and high-activity data on the fastest machines in the cluster.


Read about Provisioning for Performance

Get Started

Now that you know a bit about how the features of MapR Distribution for Apache Hadoop work, take a quick tour to see for yourself how they canwork for you:

If you would like to give MapR a try on a single machine, check out Quick Start - MapR Virtual MachineTo explore cluster installation scenarios, see Planning the DeploymentFor more about provisioning, see Provisioning ApplicationsFor more about data policy, see Working with Data



Quick Start - Single Node

You're just a few minutes away from installing and running a real single-node MapR Hadoop cluster. To get started, choose your operatingsystem:

Red Hat Enterprise Linux (RHEL) or CentOSUbuntu



1. 2.

Single Node - RHEL or CentOS

Use the following steps to install a simple single-node cluster with a basic set of services. These instructions bring up a single node with thefollowing roles:

CLDBFileServerJobTrackerNFSTaskTrackerWebServerZooKeeper

Step 1. Requirements

64-bit Red Hat 5.4 or greater, or 64-bit CentOS 5.4 or greaterRAM: 4 GB or moreAt least one free unmounted drive or partition, 500 GB or moreAt least 10 GB of free space on the operating system partitionSun Java JDK version 1.6.0_24 (not JRE)The password, or privilegesroot sudoA Linux user chosen to have administrative privileges on the cluster

Make sure the user has a password (using for example)sudo passwd <user>

If Java is already installed, check which versions of Java are installed:

java -version

If JDK 6 is installed, the output will include a version number starting with 1.6, and then below that the text . Example:Java(TM)

java version "1.6.0_24"Java(TM) SE Environment (build 1.6.0_24-b07)Runtime

If necessary, install Sun Java JDK 6. Once Sun Java JDK 6 is installed, use update-alternatives to make sure it is the default Java: sudoupdate-alternatives --config java

This procedure assumes you have free, unmounted physical partitions or hard disks for use by MapR. If you are not sure,please read .Setting Up Disks for MapR

Create a text file listing disks and partitions for use by MapR. Each line lists a single disk, or partitions on a single disk./tmp/disks.txtExample:

/dev/sdb/dev/sdc1 /dev/sdc2 /dev/sdc4/dev/sdd

Later, when you run to format the disks, specify the disks and partitions file. Example:disksetup

disksetup -F /tmp/disks.txt

To get the most out of this tutorial, be sure to register for your free M3 license after installing the MapR software.

Step 2. Add the MapR Repository

Change to the user (or use for the following commands).root sudoCreate a text file called in the directory with the following contents:maprtech.repo /etc/yum.repos.d/[maprtech]name=MapR Technologiesbaseurl=http://package.mapr.com/releases/v1.2.10/redhat/enabled=1gpgcheck=0



2.

3.

1.

2.

1.

2. 3.

protect=1To install a previous release, see the for the correct path to use in the parameter.Release Notes baseurlIf your connection to the Internet is through a proxy server, you must set the environment variable before installation:http_proxy

http_proxy=http://<host>:<port>export http_proxy

If you don't have Internet connectivity, do one of the following:

Set up a local repository.Download and install packages manually.

Step 3. Install the Software

Before installing the software, make sure you have created the file containing a list of disks and partitions for use by MapR./tmp/disks.txt

Change to the user (or use ) and use the following commands to install a MapR single-node cluster:root sudo

yum install mapr-single-node/opt/mapr/server/disksetup -F /tmp/disks.txt/etc/init.d/mapr-zookeeper start/etc/init.d/mapr-warden start

Specify an administrative user for the cluster by running the following command (as or with ), replacing with yourroot sudo <user>linux username:

/opt/mapr/bin/maprcli acl edit -type cluster -user <user>:fc

If you are running MapR Single Node on a laptop or a computer that you reboot frequently, you might want to use the command (as or with ) to prevent the cluster from starting automatically when thechkconfig --del mapr-warden root sudo

computer restarts. You can always start the cluster manually by running and /etc/init.d/mapr-zookeeper start /etc/ (as shown above).init.d/mapr-warden start



Step 4. Check out the MapR Control System

After a few minutes the MapR Control System starts.

The MapR Control System gives you power at a glance over the entire cluster. With the MapR Control System, you can see immediately whichnodes are healthy; how much bandwidth, disk space, and CPU time are in use; and the status of MapReduce jobs. But there's more: the MapRControl System lets you manage your data in ways that were previously impossible with Hadoop. During the next few exercises, you will learn afew of the easy, powerful capabilities of MapR.

Open a browser and navigate to the MapR Control System on the server where you installed MapR. The URL is https://<host>:844. Example:3

https://localhost:8443

Your computer won't have an HTTPS certificate yet, so the browser will warn you that the connection is not trustworthy. You can safelyignore the warning this time.The first time MapR starts, you must accept the agreement and choose whether to enable the MapR service.Dial HomeLog in using the Linux username and password you specified as the administrative user in .Step 3

Welcome to the MapR Control System, which displays information about the health and performance of the cluster. Notice the navigation pane to

http://www.mapr.com/doc/display/MapR12/Local+Repository+-+Red+Hat

http://www.mapr.com/doc/display/MapR12/Local+Packages+-+Red+Hat



1.

2. 3.

1. 2. 3. 4. 5. 6. 7.

the left and the larger view pane to the right. The navigation pane lets you quickly switch between views, to display different types of information.

When the MapR Control System starts, the first view displayed is the Dashboard. The MapR single node is represented as a square. The squareis not green; the node health is not perfect: because the license has not yet been applied, the NFS service has failed to start, resulting in an alarmon the node. Later, you'll see how to obtain the license and mount the cluster with NFS.

Click the node to display the Node Properties view. Notice the types of information available about the node:

Alarms - timely information about any problems on the nodeMachine Performance - resource usageGeneral Information - physical topology of the node and the times of the most recent heartbeatsMapReduce - map and reduce slot usageManage Node Services - services on the nodeMapR-FS and Available Disks - disks used for storage by MapRSystem Disks - disks available to add to MapR storage

Step 5. Get a License

In the navigation pane of the MapR Control System, expand the group and click to display the MapRSystem Settings MapR LicensesLicense Management dialog.Click .Add Licenses via WebIf the cluster is already registered, the license is applied automatically. Otherwise, click to register the cluster on MapR.com andOKfollow the instructions there.

If the cluster is not yet registered, the message "Cluster not found" appears and the browser is redirected to a registration page.On the registration page, create an account and log in.On the Register Cluster page, choose M3 and click .RegisterWhen the message "Cluster Registered" appears, click .Return to your MapR Cluster UI

Step 6. Start Working with Volumes

MapR provides as a way to organize data into groups, so that you can manage your data and apply policy all at once instead of file byvolumesfile. Think of a volume as being similar to a huge hard drive---it can be mounted or unmounted, belong to a specific department or user, and havepermissions set as a whole or on any directory or file within. In this section, you will create a volume that you can use for later parts of the tutorial.

Create a volume:

In the Navigation pane, click in the group.Volumes MapR-FSClick the button to display the dialog.New Volume New VolumeFor the , select .Volume Type Standard VolumeType the name in the field.MyVolume Volume NameType the path in the field./myvolume Mount PathSelect in the field./default-rack TopologyScroll to the bottom and click to create the volume.OK

Step 7. Mount the Cluster via NFS

With MapR, you can export and mount the Hadoop cluster as a read/write volume via NFS from the machine where you installed MapR, or from adifferent machine.

If you are mounting from the machine where you installed MapR, replace in the steps below with <host> localhostIf you are mounting from a different machine, make sure the machine where you installed MapR is reachable over the network andreplace in the steps below with the hostname of the machine where you installed MapR.<host>



1. 2.

3.

4.

5.

6.

7.

8.

1. 2.

Note: To use NFS you must first add a license as discussed earlier in the Get a License step.

Try the following steps to see how it works:

Change to the user (or use for the following commands).root sudoSee what is exported from the machine where you installed MapR:

showmount -e <host>

Set up a mount point for the NFS share:

mkdir /mapr

Mount the cluster via NFS:

mount <host>:/mapr /mapr

TipsIf you get an error such as it likely means that the NFS service is not running. See fRPC: Program not registered Setting Up NFSor tips.To see the cluster, list the directory:/mapr

# ls /mapr/my.cluster.commy.cluster.com

List the cluster itself, and notice that the volume you created is there:

# ls -l /mapr/my.cluster.comFound 3 itemsdrwxrwxrwx - root root 0 2011-11-22-12:44 /myvolumedrwxr-xr-x - mapr mapr 0 2011-01-03 13:50 /tmpdrwxr-xr-x - mapr mapr 0 2011-01-04 13:57 /userdrwxr-xr-x - root root 0 2010-11-25 09:41 /var

Try creating a directory in your new volume via NFS:

mkdir /mapr/my.cluster.com/myvolume/foo

List the contents of :/myvolume

hadoop fs -ls /myvolume

Notice that Hadoop can see the directory you just created with NFS. Try navigating to the cluster using the computer's file browser --- you candrag and drop files directly to your new volume, and see them immediately in Hadoop!

If you are already running an NFS server, MapR will not run its own NFS gateway. In that case, you will not be able to mount thesingle-node cluster via NFS, but your previous NFS exports will remain available.

Step 8. Try MapReduce

In this section, you will run the well-known Word Count MapReduce example. You'll need one or more text files. The Word Count program readsfiles from an input directory, counts the words, and writes the results of the job to files in an output directory. For this exercise we will use /myvol

for the input, and for the output. The input directory must exist and must contain the input files before running the job;ume/in /myvolume/outthe output directory must not exist, as the Word Count example creates it.

Open a terminal (select )Applications > Accessories > TerminalCopy a couple of text files into the cluster, either using the file browser or the command line. Create the directory and put/myvolume/inthe files there. Example:

http://www.mapr.com/doc/display/MapR12/Setting+Up+NFS



2.

3.

4.

mkdir /mapr/my.cluster.com/myvolume/incp <some files> /mapr/my.cluster.com/myvolume/in

Type the following line to run the Word Count job:

hadoop jar /opt/mapr/hadoop/hadoop-0.20.2/hadoop-0.20.2-dev-examples.jar wordcount /myvolume/in/myvolume/out

Look in the newly-created for a file called containing the results./myvolume/out part-r-00000

Step 9. Stop the Single Node Cluster

To stop the single node cluster:

1. Stop the warden:

/etc/init.d/mapr-warden stop

2. Stop zookeeper:

/etc/init.d/mapr-zookeeper stop

Note: For information about stopping a multi-node cluster, see .Startup and Shutdown

Next Steps

MapR works with the leaders in the Hadoop ecosystem to provide the most powerful data analysis solutions. For more information about ourpartners, take a look at the following pages:

DatameerHParserKarmasphere



1. 2.

3.

Single Node - SUSE




64-bit SUSE Linux Enterprise Server 11.xRAM: 4 GB or moreAt least one free unmounted drive or partition, 500 GB or moreAt least 10 GB of free space on the operating system partitionSun Java JDK version 1.6.0_24 (not JRE)The password, or privilegesroot sudoA Linux user chosen to have administrative privileges on the cluster


If Java is already installed, check which versions of Java are installed with the command

java -version

. If JDK 6 is installed, the output will include a version number starting with 1.6, and then below that the text . Example:Java(TM)










Change to the user (or use for the following commands).root sudouse the following command to add the MapR repository:zypper ar http://package.mapr.com/releases/v1.2.10/redhat/ maprTo install a previous release, see the for the correct path.Release NotesIf your connection to the Internet is through a proxy server, you must set the environment variable before installation:http_proxy



1.

2.

1.

2. 3.


Update the system package index by running the following command:zypper refreshExecute the following command:zypper install mapr-compat-suse






zypper install mapr-single-node/opt/mapr/server/disksetup -F /tmp/disks.txt/etc/init.d/mapr-zookeeper start/etc/init.d/mapr-warden start



If you are running MapR Single Node on a laptop or a computer that you reboot frequently, you might want to use the command (as or with ) to prevent the cluster from starting automatically when thechkconfig --del mapr-warden root sudo








Welcome to the MapR Control System, which displays information about the health and performance of the cluster. Notice the navigation pane tothe left and the larger view pane to the right. The navigation pane lets you quickly switch between views, to display different types of information.

http://www.mapr.com/doc/display/MapR12/Local+Repository+-+SUSE

http://www.mapr.com/doc/display/MapR12/Local+Packages+-+SUSE



1.

2. 3.

1. 2. 3. 4. 5. 6. 7.









Create a volume:





Note: To use NFS you must first add a license as discussed earlier in the Get a License step.



1. 2.

3.

4.

5.

6.

7.

8.

1. 2.



showmount -e <host>


mkdir /mapr





















3.

4.






1. Stop the warden:


2. Stop zookeeper:



Next Steps





1. 2.

3.

Single Node - Ubuntu




64-bit Ubuntu 9.04 or greaterRAM: 4 GB or moreAt least one free unmounted drive or partition, 500 GB or moreAt least 10 GB of free space on the operating system partitionSun Java JDK version 1.6.0_24 (not JRE)The password, or privilegesroot sudoA Linux user chosen to have administrative privileges on the cluster


If Java is already installed, check which versions of Java are installed:

java -version

If JDK 6 is installed, the output will include a version number starting with 1.6, and then below that the text . Example:Java(TM)










Change to the user (or use for the following commands).root sudoAdd the following line to :/etc/apt/sources.listdeb mapr optionalhttp://package.mapr.com/releases/v1.2.10/ubuntu/(To install a previous release, see the for the correct path.)Release NotesIf your connection to the Internet is through a proxy server, add the following lines to :/etc/apt/apt.conf

http://package.mapr.com/releases/v1.2.10/ubuntu/



1.

2.

1.

2. 3.

Acquire Retries ;"0"HTTP Proxy "http: ;//<user>:<password>@<host>:<port>";;






apt-get updateapt-get install mapr-single-node/opt/mapr/server/disksetup -F /tmp/disks.txt/etc/init.d/mapr-zookeeper start/etc/init.d/mapr-warden start



If you are running MapR Single Node on a laptop or a computer that you reboot frequently, you might want to use the command (as or with ) to prevent the cluster from starting automatically when theupdate-rc.d -f mapr-warden remove root sudo








Welcome to the MapR Control System, which displays information about the health and performance of the cluster. Notice the navigation pane tothe left and the larger view pane to the right. The navigation pane lets you quickly switch between views, to display different types of information.

http://www.mapr.com/doc/display/MapR12/Local+Repository+-+Ubuntu

http://www.mapr.com/doc/display/MapR12/Local+Packages+-+Ubuntu



1.

2. 3.

1. 2. 3. 4. 5. 6. 7.








MapR provides as a way to organize data into groups so you can manage your data and apply policy all at once instead of file by file.volumesThink of a volume as being similar to a huge hard drive---it can be mounted or unmounted, belong to a specific department or user, and havepermissions set as a whole or on any directory or file within it. In this section, you will create a volume that you can use for later parts of thetutorial.

Create a volume:





Note: To use NFS, you must first add a license as discussed earlier in the Get a License step.



1. 2.

3.

4.

5.

6.

7.

8.

1. 2.



showmount -e <host>


mkdir /mapr




















2.

3.

4.


Copy and paste the following code into the command line to run the Word Count job:





1. Stop the warden:


2. Stop zookeeper:



Next Steps





Quick Start - Small Cluster

Choose the Quick Start guide that is right for your operating system:

M3 - RHEL or CentOSM3 - UbuntuM3 - SUSEM5 - RHEL or CentOSM5 - UbuntuM5 - SUSE



1. 2.

3.

M3 - RHEL or CentOS

Use the following steps to install a simple MapR cluster up to 100 nodes with a basic set of services. To build a larger cluster, or to build a clusterthat includes additional services (such as Hive, Pig, Flume, or Oozie), see the . To add services to nodes on a running cluster,Installation Guidesee . To get the most out of this tutorial, and to enable NFS, be sure to register for your free M3 license after installing theReconfiguring a NodeMapR software.

Setup

Follow these instructions to install a small MapR cluster (3-100 nodes) on machines that meet the following requirements:



Each node must have a unique hostname, and keyless SSH set up to all other nodes.






For the steps that follow, make the following substitutions:

<user> - the chosen administrative username<node 1>, , ... - the IP addresses of nodes 1, 2, 3 ...<node 2> <node 3><proxy user>, , , - proxy server credentials and settings<proxy password> <host> <port>

If you are installing a MapR cluster on nodes that are not connected to the Internet, contact MapR for assistance. If you areinstalling a cluster larger than 100 nodes, see the . In particular, CLDB nodes on large clusters should not runInstallation Guideany other service (see ).Isolating CLDB Nodes

Deployment

Change to the user (or use for the following commands).root sudoOn all nodes, create a text file called in the directory with the following contents:maprtech.repo /etc/yum.repos.d/[maprtech]name=MapR Technologiesbaseurl=http://package.mapr.com/releases/v1.2.10/redhat/enabled=1gpgcheck=0protect=1To install a previous release, see the for the correct path to use in the parameter.Release Notes baseurlIf your connection to the Internet is through a proxy server, you must set the environment variable before installation:http_proxy



3.

4.

5.

6.

7.

8.

9.

10.

11.




On node 1, execute the following command:

yum install mapr-cldb mapr-fileserver mapr-jobtracker mapr-nfs mapr-tasktracker mapr-webservermapr-zookeeper

On nodes 2 and 3, execute the following command:

yum install mapr-fileserver mapr-tasktracker mapr-zookeeper

On all other nodes (nodes 4...n), execute the following commands:

yum install mapr-fileserver mapr-tasktracker

On all nodes, execute the following commands:

/opt/mapr/server/configure.sh -C <node 1> -Z <node 1>,<node 2>,<node 3>/opt/mapr/server/disksetup -F /tmp/disks.txt

On nodes 1, 2, and 3, execute the following command:

/etc/init.d/mapr-zookeeper start


/etc/init.d/mapr-warden start

Tips

If you see " " it means the warden is already running. This can happen, forWARDEN running as process <process>. Stop itexample, when you reboot the machine. Use to stop it, then start it again./etc/init.d/mapr-warden stopOn node 1, give full permission to the chosen administrative user using the following command:


Tips

The Warden can take a few minutes to start. If you see the error " ," wait a few minutesCouldn't connect to the CLDB serviceand try again.On a machine that is connected to the cluster and to the Internet, perform the following steps to install the license:

In a browser, view the MapR Control System by navigating to the node that is running the WebServer:https://<node 1>:8443Your computer won't have an HTTPS certificate yet, so the browser will warn you that the connection is not trustworthy. You canignore the warning this time.The first time MapR starts, you must accept the agreement and choose whether to enable the MapR service.Dial HomeLog in to the MapR Control System as the administrative user you designated earlier.In the navigation pane of the MapR Control System, expand the group and click to display theSystem Settings MapR LicensesMapR License Management dialog.Click .Add Licenses via WebIf the cluster is already registered, the license is applied automatically. Otherwise, click to register the cluster on MapR.comOKand follow the instructions there.





11.

12.

13.

14. 15. 16.

1. 2. 3. 4. 5. 6.

1. 2.

3.

4.

5.

If the cluster is not yet registered, the message "Cluster not found" appears and the browser is redirected to aregistration page.On the registration page, create an account and log in.On the Register Cluster page, choose M3 and click .RegisterWhen the message "Cluster Registered" appears, click .Return to your MapR Cluster UI


/opt/mapr/bin/maprcli node services -nodes <node 1> -nfs start

On all other nodes (nodes 2...n), execute the following command:


Log in to the MapR Control System.Under the Cluster group in the left pane, click .DashboardCheck the Services pane and make sure each service is running the correct number of instances:

Instances of the FileServer and TaskTracker on all nodes3 instances of ZooKeeper1 instance of the CLDB, JobTracker, NFS, and WebServer

Next Steps

Start Working with Volumes


Create a volume:

In the Navigation pane, click in the group.Volumes MapR-FSClick the button to display the dialog.New Volume New VolumeFor the , select .Volume Type Standard VolumeType the name in the field.MyVolume Volume NameType the path in the field./myvolume Mount PathScroll to the bottom and click to create the volume.OK

Mount the Cluster via NFS





showmount -e <host>


mkdir /mapr



TipsIf you get an error such as it likely means that the NFS service is not running. See fRPC: Program not registered Setting Up NFSor tips.




5.

6.

7.

8.

1. 2.

3.

4.

To see the cluster, list the directory:/mapr










Try MapReduce








Next Steps





1. 2.

3.

4.

M3 - Ubuntu


Setup













Deployment

Change to the user (or use for the following commands).root sudoOn all nodes, add the following line to :/etc/apt/sources.listdeb mapr optionalhttp://package.mapr.com/releases/v1.2.10/ubuntu/To install a previous release, see the for the correct path to useRelease NotesIf you don't have Internet connectivity, do one of the following:


On all nodes, run the following command:

apt-get update

If your connection to the Internet is through a proxy server, add the following lines to :/etc/apt/apt.conf






4.

5.

6.

7.

8.

9.

10.

11.

12.

Acquire Retries ;"0"HTTP Proxy "http: ;//<proxy user>:<proxy password>@<host>:<port>";;


apt-get install mapr-cldb mapr-fileserver mapr-jobtracker mapr-nfs mapr-tasktrackermapr-webserver mapr-zookeeper


apt-get install mapr-fileserver mapr-tasktracker mapr-zookeeper


apt-get install mapr-fileserver mapr-tasktracker







Tips



Tips


In a browser, view the MapR Control System by navigating to the node that is running the WebServer:https://<node 1>:8443Your computer won't have an HTTPS certificate yet, so the browser will warn you that the connection is not trustworthy. You canignore the warning this time.The first time MapR starts, you must accept the agreement and choose whether to enable the MapR service.Dial HomeLog in to the MapR Control System as the administrative user you designated earlier.In the navigation pane of the MapR Control System, expand the group and click to display theSystem Settings MapR LicensesMapR License Management dialog.Click .Add Licenses via WebIf the cluster is already registered, the license is applied automatically. Otherwise, click to register the cluster on MapR.comOK



12.

13.

14.

15. 16. 17.

1. 2. 3. 4. 5. 6.

1. 2.

3.

4.

and follow the instructions there.If the cluster is not yet registered, the message "Cluster not found" appears and the browser is redirected to aregistration page.On the registration page, create an account and log in.On the Register Cluster page, choose M3 and click .RegisterWhen the message "Cluster Registered" appears, click .Return to your MapR Cluster UI







Next Steps



Create a volume:







showmount -e <host>


mkdir /mapr



TipsIf you get an error such as it likely means that the NFS service is not running. See fRPC: Program not registered Setting Up NFS




4.

5.

6.

7.

8.

1. 2.

3.

4.

or tips.To see the cluster, list the directory:/mapr










Try MapReduce








Next Steps





1. 2.

3.

M5 - RHEL or CentOS

Use the following steps to install a simple MapR cluster up to 100 nodes with a basic set of services. To build a larger cluster, or to build a clusterthat includes additional services (such as Hive, Pig, Flume, or Oozie), see the . To add services to nodes on a running cluster,Installation Guidesee . To get the most out of this tutorial, and to enable NFS, be sure to register for your free M5 Trial license after installingReconfiguring a Nodethe MapR software.

Setup













Deployment

Change to the user (or use for the following commands).root sudoOn all nodes, create a text file called in the directory with the following contents:maprtech.repo /etc/yum.repos.d/[maprtech]name=MapR Technologiesbaseurl=http://package.mapr.com/releases/v1.2.10/redhat/enabled=1gpgcheck=0protect=1To install a previous release, see the for the correct path to use in the parameter.Release Notes baseurlIf you don't have Internet connectivity, do one of the following:


If your connection to the Internet is through a proxy server, you must set the environment variable before installation:http_proxy





3.

4.

5.

6.

7.

8.

9.

10.

11.



yum install mapr-cldb mapr-jobtracker mapr-nfs mapr-zookeeper mapr-tasktracker mapr-webserver


yum install mapr-cldb mapr-jobtracker mapr-nfs mapr-zookeeper mapr-tasktracker


yum install mapr-fileserver mapr-nfs mapr-tasktracker


/opt/mapr/server/configure.sh -C <node 1>,<node 2>,<node 3> -Z <node 1>,<node 2>,<node 3>/opt/mapr/server/disksetup -F /tmp/disks.txt





Tips



Tips



If the cluster is not yet registered, the message "Cluster not found" appears and the browser is redirected to aregistration page.On the registration page, create an account and log in.On the Register Cluster page, choose M5 Trial and click .Register



11.

12.

13.

14. 15. 16.

1. 2. 3. 4. 5. 6. 7.

1. 2.

3.

4.

5.

When the message "Cluster Registered" appears, click .Return to your MapR Cluster UIOn node 1, execute the following command:





Instances of the FileServer, NFS, and TaskTracker on all nodes3 instances of the CLDB1 of 3 instances of the JobTracker1 instance of the WebServer

Next Steps



Create a volume:







showmount -e <host>


mkdir /mapr







5.

6.

7.

8.

1. 2.

3.

4.










Try MapReduce








Next Steps





1. 2.

3.

4.

M5 - Ubuntu


Setup













Deployment

Change to the user (or use for the following commands).root sudoOn all nodes, add the following line to :/etc/apt/sources.listdeb mapr optionalhttp://package.mapr.com/releases/v1.2.10/ubuntu/To install a previous release, see the for the correct path to use.Release NotesIf you don't have Internet connectivity, do one of the following:



apt-get update

If your connection to the Internet is through a proxy server, add the following lines to :/etc/apt/apt.conf






4.

5.

6.

7.

8.

9.

10.

11.

12.

Acquire Retries ;"0"HTTP Proxy "http: ;//<proxy user>:<proxy password>@<host>:<port>";;


apt-get install mapr-cldb mapr-jobtracker mapr-nfs mapr-zookeeper mapr-tasktrackermapr-webserver


apt-get install mapr-cldb mapr-jobtracker mapr-nfs mapr-zookeeper mapr-tasktracker


apt-get install mapr-fileserver mapr-nfs mapr-tasktracker







Tips



Tips


In a browser, view the MapR Control System by navigating to the node that is running the WebServer:https://<node 1>:8443Your computer won't have an HTTPS certificate yet, so the browser will warn you that the connection is not trustworthy. You canignore the warning this time.The first time MapR starts, you must accept the agreement and choose whether to enable the MapR service.Dial HomeLog in to the MapR Control System as the administrative user you designated earlier.In the navigation pane of the MapR Control System, expand the group and click to display theSystem Settings MapR LicensesMapR License Management dialog.Click .Add Licenses via WebIf the cluster is already registered, the license is applied automatically. Otherwise, click to register the cluster on MapR.comOK



12.

13.

14.

15. 16. 17.

1. 2. 3. 4. 5. 6. 7.

1. 2.

3.

4.

and follow the instructions there.If the cluster is not yet registered, the message "Cluster not found" appears and the browser is redirected to aregistration page.On the registration page, create an account and log in.On the Register Cluster page, choose M5 Trial and click .RegisterWhen the message "Cluster Registered" appears, click .Return to your MapR Cluster UI







Next Steps



Create a volume:







showmount -e <host>


mkdir /mapr





4.

5.

6.

7.

8.

1. 2.

3.

4.











Try MapReduce








Next Steps








1. 2.

3.

4.

M3 - SUSE


Setup













Deployment

Change to the user (or use for the following commands).root sudoOn all nodes, use the following command to add the MapR repository:zypper ar maprhttp://package.mapr.com/releases/v1.2.10/redhat/To install a previous release, see the for the correct path.Release NotesIf you don't have Internet connectivity, do one of the following:




http://package.mapr.com/releases/v1.2.10/redhat/





4.

5.

6.

7.

8.

9.

10.

11.

12.

13.

14.

On all nodes, update your system package index by running the following command:zypper refreshExecute the following command:

.zypper install mapr-compat-suseOn node 1, execute the following command:

zypper install mapr-cldb mapr-fileserver mapr-jobtracker mapr-nfs mapr-tasktrackermapr-webserver mapr-zookeeper


zypper install mapr-fileserver mapr-tasktracker mapr-zookeeper


zypper install mapr-fileserver mapr-tasktracker







Tips



Tips







14.

15.

16. 17. 18.

1. 2. 3. 4. 5. 6.

1. 2.

3.

4.

5.






Next Steps



Create a volume:







showmount -e <host>


mkdir /mapr








6.

7.

8.

1. 2.

3.

4.









Try MapReduce








Next Steps





1. 2.

3.

4.

M5 - SUSE


Setup













Deployment

Change to the user (or use for the following commands).root sudoOn all nodes, use the following command to add the MapR repository:zypper ar maprhttp://package.mapr.com/releases/v1.2.10/redhat/To install a previous release, see the for the correct path.Release NotesIf you don't have Internet connectivity, do one of the following:









4.

5.

6.

7.

8.

9.

10.

11.

12.

13.

14.

On all nodes, update your system package index by running the following command:zypper refreshExecute the following command:zypper install mapr-compat-suseOn node 1, execute the following command:

zypper install mapr-cldb mapr-jobtracker mapr-nfs mapr-zookeeper mapr-tasktracker mapr-webserver


zypper install mapr-cldb mapr-jobtracker mapr-nfs mapr-zookeeper mapr-tasktracker


zypper install mapr-fileserver mapr-nfs mapr-tasktracker







Tips



Tips



If the cluster is not yet registered, the message "Cluster not found" appears and the browser is redirected to aregistration page.On the registration page, create an account and log in.On the Register Cluster page, choose M5 Trial and click .RegisterWhen the message "Cluster Registered" appears, click .Return to your MapR Cluster UI




14.

15.

16. 17. 18.

1. 2. 3. 4. 5. 6. 7.

1. 2.

3.

4.

5.






Next Steps



Create a volume:







showmount -e <host>


mkdir /mapr







5.

6.

7.

8.

1. 2.

3.

4.










Try MapReduce








Next Steps





Quick Start - MapR Virtual Machine

The MapR Virtual Machine is a fully-functional single-node Hadoop cluster capable of running MapReduce programs and working withapplications like Hive, Pig, and HBase. You can try the MapR Virtual Machine on nearly any 64-bit computer by downloading the free VMware

.Player

The MapR Virtual Machine desktop contains the following icons:

MapR Control System - navigates to the graphical control system for managing the clusterMapR User Guide - navigates to the MapR online documentationMapR NFS - navigates to the NFS-mounted cluster storage layer

Ready for a tour? The following documents will help you get started:

Installing the MapR Virtual MachineA Tour of the MapR Virtual MachineWorking with Snapshots, Mirrors, and SchedulesGetting Started with HiveGetting Started with PigGetting Started with HBase

http://downloads.vmware.com/d/info/desktop_end_user_computing/vmware_player/4_0




1.

2.

3.

4.

5. 6.

Installing the MapR Virtual Machine

The MapR Virtual Machine runs on VMware Player, a free desktop application that lets you run a virtual machine on a Windows or Linux PC. Youcan download VMware Player from the . To install the VMware Player, see the .VMware web site VMware documentation

For Linux and Windows, download the free VMware PlayerFor Mac, purchase VMware Fusion

Use of VMware Player is subject to the VMware Player end user license terms, and VMware provides no support for VMware Player. For self-helpresources, see the .VMware Player FAQ

Requirements

The MapR Virtual Machine requires at least 20 GB free hard disk space and 2 GB of RAM on the host system. You will see higher performancewith more RAM and more free hard disk space.

To run the MapR Virtual Machine, the host system must have one of the following 64-bit x86 architectures:

A 1.3 GHz or faster AMD CPU with segment-limit support in long modeA 1.3 GHz or faster Intel CPU with VT-x support

If you have an Intel CPU with VT-x support, you must verify that VT-x support is enabled in the host system BIOS. The BIOS settings that must beenabled for VT-x support vary depending on the system vendor. See theVMware knowledge base article at for information about how to determine if VT-x support is enabled.http://kb.vmware.com/kb/1003944

Installing and Running the MapR Virtual Machine

Choose whether to install the M3 Edition or the M5 Edition, and download the corresponding archive file:M3 Edition - http://package.mapr.com/releases/v1.2.7/vmdemo/MapR-VM-1.2.7.14133GA-1-m3.tar.bzip2M5 Edition - http://package.mapr.com/releases/v1.2.7/vmdemo/MapR-VM-1.2.7.14133GA-1-m5.tar.bzip2

Use the command to extract the archive to your home directory or another directory of your choosing:tar

tar -xvf <tar.bzip2 file>

Run the VMware player.

Click , navigate to the directory into which you extracted the archive, then open the MapR-VM.vmx virtualOpen a Virtual Machinemachine.

Tips

If you are running VMware Fusion, make sure to select or instead of creating a new virtual machine.Open Open and RunTo log on to the MapR Control System, use the username and the password (all lowercase).mapr maprOnce the virtual machine is fully started, you can proceed with the .tour

http://www.vmware.com/download/player/

http://www.vmware.com/support/pubs/player_pubs.html

http://www.vmware.com/products/player/

http://www.vmware.com/products/fusion/

http://www.vmware.com/products/player/faqs.html

http://kb.vmware.com/kb/1003944

http://www.mapr.com/products/mapr-editions/m3-edition

http://package.mapr.com/releases/v1.2.7/vmdemo/MapR-VM-1.2.7.14133GA-1-m3.tar.bzip2

http://www.mapr.com/products/mapr-editions/m5-edition

http://package.mapr.com/releases/v1.2.7/vmdemo/MapR-VM-1.2.7.14133GA-1-m5.tar.bzip2



A Tour of the MapR Virtual Machine

In this tutorial, you'll get familiar with the MapR Control System dashboard, learn how to get data into the cluster (and organized), and run someMapReduce jobs on Hadoop. You can read the following sections in order or browse them as you explore on our own:

The DashboardWorking with VolumesExploring NFSRunning a MapReduce Job

Once you feel comfortable working with the MapR Virtual Machine, you can move on to more advanced topics:

Working with Snapshots, Mirrors, and SchedulesGetting Started with HiveGetting Started with PigGetting Started with HBase

The Dashboard

The dashboard, the main screen in the , shows the health of the cluster at a glance. To get to the dashboard, click the MapR Control System Map link on the desktop of the MapR Virtual Machine and log on with the username and the password . If it is your firstR Control System root mapr

time using the MapR Control System, you will need to accept the terms of the license agreement to proceed.

Parts of the dashboard:

To the left, the lets you navigate to other views that display more detailed information about , navigation pane nodes in the cluster volume, , , and .s in the MapR Storage Services layer NFS settings Alarms System Settings

In the center, the main dashboard view displays the nodes in a " " that uses color to indicate node health--since there is only oneheat mapnode in the MapR Virtual Machine cluster, there is a single green square.To the right, information about cluster usage is displayed.

Try clicking the button at the top right of the heat map. You will see a number of different information that can be displayed in theHealthheat map.Try clicking the green square representing the node. You will see more detailed information about the status of the node.

By the way, the browser is pre-configured with the following bookmarks, which you will find useful as you gain experience with Hadoop,MapReduce, and the MapR Control System:

MapR Control SystemJobTracker StatusTaskTracker StatusHBase MasterCLDB Status

Don\'t worry if you aren\'t sure what those are yet.



1. 2. 3. 4. 5. 6. 7.

Working with Volumes


Create a volume:


Notice that the mount path and the volume name do not have to match. The volume name is a permanent identifier for the volume, and the mountpath determines the file path by which the volume is accessed. The topology determines the racks available for storing the volume and itsreplicas.

In the next step, you'll see how to get data into the cluster with NFS.

Exploring NFS

With MapR, you can mount the cluster via NFS, and browse it as if it were a filesystem. Try clicking the icon on the MapR VirtualMapR NFSMachine desktop.



1. 2.

3.

4.

When you navigate to you can see the volume that you created in the previous example.mapr > my.cluster.com myvolume

Try copying some files to the volume; a good place to start is the files and which are attached to thisconstitution.txt sample-table.txtpage. Both are text files, which will be useful when running the Word Count example later.

To download them, select from the menu to the top right of this document (the one you are reading now) and thenAttachments Toolsclick the links for those two files.Once they are downloaded, you can add them to the cluster.Since you'll be using them as input to MapReduce jobs in a few minutes, create a directory called in the volume and dragin myvolumethe files there. If you do not have a volume mounted at on the cluster, use the instructions in above tomyvolume Working with Volumescreate it.

By the way, if you want to verify that you are really copying the files into the Hadoop cluster, you can open a terminal on the MapR VirtualMachine (select ) and type to see that the files are there.Applications > Accessories > Terminal hadoop fs -ls /myvolume/in

The Terminal

When you run MapReduce jobs, and when you use Hive, Pig, or HBase, you'll be working with the Linux terminal. Open a terminal window byselecting .Applications > Accessories > Terminal

Running a MapReduce Job

In this section, you will run the well-known Word Count MapReduce example. You'll need one or more text files (like the ones you copied to thecluster in the previous section). The Word Count program reads files from an input directory, counts the words, and writes the results of the job tofiles in an output directory. For this exercise we will use for the input, and for the output. The input directory/myvolume/in /myvolume/outmust exist and must contain the input files before running the job; the output directory must not exist, as the Word Count example creates it.

Try MapReduce

On the MapR Virtual Machine, open a terminal (select )Applications > Accessories > TerminalCopy a couple of text files into the cluster. If you are not sure how, see the previous section. Create the directory and put/myvolume/inthe files there.Type the following line to run the Word Count job:



That's it! If you're ready, you can try some more advanced exercises:



Working with Snapshots, Mirrors, and SchedulesGetting Started with HiveGetting Started with PigGetting Started with HBase





1. 2. 3. 4. 5. 6.

1. 2.

3. 4.

Working with Snapshots, Mirrors, and Schedules

Snapshots, mirrors, and schedules help you protect your data from user error, make backup copies, and in larger clusters provide load balancingfor highly-accessed data. These features are available under the M5 license.

If you are working with an M5 virtual machine, you can use this section to get acquainted with snapshots, mirrors, and schedules.If you are working with the M3 virtual machine, you should proceed to the sections about Getting Started with , , and Hive Pig Getting

.Started with HBase

Taking Snapshots

A is a point-in-time image of a volume that protects data against user error. Although other strategies such as replication and mirroringsnapshotprovide good protection, they cannot protect against accidental file deletion or corruption. You can create a snapshot of a volume manually beforeembarking on risky jobs or operations, or set a snapshot schedule on the volume to ensure that you can always roll back to specific points in time.

Try creating a snapshot manually:

In the Navigation pane, expand the group and click the view.MapR-FS VolumesSelect the checkbox beside the volume (which you created during the ).MyVolume previous tutorialExpand the MapR Virtual Machine window or scroll the browser to the right until the New Snapshot button is visible.Click to display the Snapshot Name dialog.New SnapshotType a name for the new snapshot in the field.NameClick to create the snapshot.OK

Try scheduling snapshots:

In the Navigation pane, expand the group and click the view.MapR-FS VolumesDisplay the Volume Properties dialog by clicking the volume name (which you created during the ), or byMyVolume previous tutorialselecting the checkbox beside and clicking the button.MyVolume PropertiesIn the section, choose a schedule from the dropdown menu.Replication and Snapshot Scheduling Snapshot ScheduleClick to save changes to the volume.Modify Volume

Viewing Snapshot Contents

All the snapshots of a volume are available in a directory called at the volume's top level. For example, the snapshots of the volume.snapshotMyVolume, which is mounted at , are available in the directory. You can view the snapshots using the /myvolume /myvolume/.snapshot had

command or via NFS. If you list the contents of the top-level directory in the volume, you will not see — but it's there.oop fs -ls .snapshot



1. 2. 3. 4. 5. 6. 7. 8.

To view the snapshots for on the command line, type /myvolume hadoop fs -ls /myvolume/.snapshotTo view the snapshots for in the file browser via NFS, navigate to and use to specify an explicit path,/myvolume /myvolume CTRL-Lthen add to the end..snapshot

Creating Mirrors

A is a full read-only copy of a volume, which you can use for backups, data transfer to another cluster, or load balancing. A mirror is itself amirrortype of volume; after you create a mirror volume, you can sync it with its source volume manually or set a schedule for automatic sync.

Try creating a mirror volume:

In the Navigation pane, expand the group and click the view.MapR-FS VolumesClick the button to display the dialog.New Volume New VolumeSelect the radio button at the top of the dialog.Local Mirror VolumeType in the field.my-mirror Mirror NameType the in the field.MyVolume Source Volume NameType in the field./my-mirror Mount PathTo schedule mirror sync, select a from the dropdown menu respectively.schedule Mirror Update ScheduleClick to create the volume.OK

You can also sync a mirror manually; it works just like taking a manual snapshot. View the list of volumes, select the checkbox next to a mirrorvolume, and click .Start Mirroring



1. 2. 3. 4.

a. b.

5.

Working with Schedules

The MapR Virtual machine comes pre-loaded with a few , but you can create your own as well. Once you have created a schedule, youschedulescan use it for snapshots and mirrors on any volume. Each schedule contains one or more rules that determine when to trigger a snapshot or amirror sync, and how long to keep snapshot data resulting from the rule.

Try creating a schedule:

In the Navigation pane, expand the group and click the view.MapR-FS SchedulesClick .New ScheduleType in the field.My Schedule Schedule NameDefine a schedule rule in the section:Schedule Rules

From the first dropdown menu, select Every 5 minUse the field to specify how long the data is to be preserved. Type in the box, and select from theRetain For 1 hour(s)dropdown menu.

Click to create the schedule.Save Schedule

You can use the schedule "My Schedule" to perform a snapshot or mirror operation automatically every 5 minutes. If you use "My Schedule" toautomate snapshots, they will be preserved for one hour (you will have 12 snapshots of the volume, on average).

Next Steps

If you haven't already, try the following tutorials:

Getting Started with HiveGetting Started with PigGetting Started with HBase



1.

2.

3.

4.

5.

6.

7.

8.

Getting Started with HBase

HBase is the Hadoop database, designed to provide random, realtime read/write access to very large tables — billions of rows amd millions ofcolumns — on clusters of commodity hardware. HBase is an open-source, distributed, versioned, column-oriented store modeled after Google'sBigtable. (For more information about HBase, see the .)HBase project page

We'll be working with HBase from the Linux shell. Open a terminal by selecting (see Applications > Accessories > Terminal A Tour of the).MapR Virtual Machine

Note: Although this tutorial was originally designed for users of the MapR Virtual Machine, you can easily adapt these instructions for a node in acluster, for example by using a different directory structure.

In this tutorial, we'll create an HBase table on the cluster, enter some data, query the table, then clean up the data and exit.

HBase tables are organized by column, rather than by row. Furthermore, the columns are organized in groups called . Whencolumn familiescreating an HBase table, you must define the column families before inserting any data. Column families should not be changed often, nor shouldthere be too many of them, so it is important to think carefully about what column families will be useful for your particular data. Each columnfamily, however, can contain a very large number of columns. Columns are named using the format .family:qualifier

Unlike columns in a relational database, which reserve empty space for columns with no values, HBase columns simply don't exist for rows wherethey have no values. This not only saves space, but means that different rows need not have the same columns; you can use whatever columnsyou need for your data on a per-row basis.

Create a table in HBase:

Start the HBase shell by typing the following command:

/opt/mapr/hbase/hbase-0.90.4/bin/hbase shell

Create a table called with one column family named :weblog stats

create 'weblog', 'stats'

Verify the table creation by listing everything:

list

Add a test value to the column in the column family for row 1:daily stats

put 'weblog', 'row1', 'stats:daily', 'test-daily-value'

Add a test value to the column in the column family for row 1:weekly stats

put 'weblog', 'row1', 'stats:weekly', 'test-weekly-value'



Type to display the contents of the table. Sample output:scan 'weblog'

ROW COLUMN+CELL row1 column=stats:daily, timestamp=1321296699190, value=test-daily-value row1 column=stats:weekly, timestamp=1321296715892, value=test-weekly-value row2 column=stats:weekly, timestamp=1321296787444, value=test-weekly-value2 row(s) in 0.0440 seconds

Type to display the contents of row 1. Sample output:get 'weblog', 'row1'

http://hbase.apache.org/



8.

9. 10. 11.

COLUMN CELL stats:daily timestamp=1321296699190, value=test-daily-value stats:weekly timestamp=1321296715892, value=test-weekly-value2 row(s) in 0.0330 seconds

Type to disable the table.disable 'weblog'Type to drop the table and delete all data.drop 'weblog'Type to exit the HBase shell.exit



1. 2.

1.

2.

3.

Getting Started with Hive


Hive is a data warehouse system for Hadoop that uses a SQL-like language called HiveQL to query structured data stored in a distributedfilesystem. (For more information about Hive, see the .)Apache Hive project page

You'll be working with Hive from the Linux shell. To use Hive, open a terminal by selecting (see Applications > Accessories > Terminal A Tour).of the MapR Virtual Machine


In this tutorial, you'll create a Hive table, load data from a tab-delimited text file, and run a couple of basic queries against the table.

First, make sure you have downloaded the sample table: On the page , select A Tour of the MapR Virtual Machine Tools > Attachmentsand right-click on , select from the pop-up menu, select a directory to save to, then click OK. Ifsample-table.txt Save Link As...you're working on the MapR Virtual Machine, we'll be loading the file from the MapR Virtual Machine's local file system (not the clusterstorage layer), so save the file in the MapR Home directory (for example, )./home/mapr

Take a look at the source data

First, take a look at the contents of the file using the terminal:

Make sure you are in the Home directory where you saved (type if you are not sure).sample-table.txt cd ~Type to display the following output.cat sample-table.txt

mapr@mapr-desktop:~$ cat sample-table.txt1320352532 1001 http://www.mapr.com/doc http://www.mapr.com 192.168.10.11320352533 1002 http://www.mapr.com http://www.example.com 192.168.10.101320352546 1001 http://www.mapr.com http://www.mapr.com/doc 192.168.10.1

Notice that the file consists of only three lines, each of which contains a row of data fields separated by the TAB character. The data in the filerepresents a web log.

Create a table in Hive and load the source data:

Type the following command to start the Hive shell, using tab-completion to expand the :<version>

/opt/mapr/hive/hive-<version>/bin/hive

At the prompt, type the following command to create the table:hive>

CREATE TABLE web_log(viewTime INT, userid BIGINT, url STRING, referrer STRING, ip STRING) ROWFORMAT DELIMITED FIELDS TERMINATED BY '\t';

Type the following command to load the data from into the table:sample-table.txt

LOAD DATA LOCAL INPATH '/home/mapr/sample-table.txt' INTO TABLE web_log;

Run basic queries against the table:

Try the simplest query, one that displays all the data in the table:

SELECT web_log.* FROM web_log;

This query would be inadvisable with a large table, but with the small sample table it returns very quickly.

Try a simple SELECT to extract only data that matches a desired string:

http://hive.apache.org/



SELECT web_log.* FROM web_log WHERE web_log.url LIKE '%doc';

This query launches a MapReduce job to filter the data.



1. 2.

3.

Getting Started with Pig

Apache Pig is a platform for parallelized analysis of large data sets via a language called Pig Latin. (For more information about Pig, see the Pig.)project page

You'll be working with Pig from the Linux shell. Open a terminal by selecting Applications > Accessories > Terminal(see A Tour of the MapR).Virtual Machine


In this tutorial, we'll use Pig to run a MapReduce job that counts the words in the file on the cluster, and/myvolume/in/constitution.txtstore the results in the file ./myvolume/wordcount.txt

First, make sure you have downloaded the file: On the page , select Tools > Attachments andA Tour of the MapR Virtual Machineright-click to save it.constitution.txtMake sure the file is loaded onto the cluster, in the directory . If you are not sure how, look at the tutorial on /myvolume/in NFS A Tour

.of the MapR Virtual Machine

Open a Pig shell and get started:

In the terminal, type the command to start the Pig shell.pigAt the prompt, type the following lines (press ENTER after each):grunt>

A = LOAD '/myvolume/in' USING TextLoader() AS (words:chararray);

B = FOREACH A GENERATE FLATTEN(TOKENIZE(*));

C = GROUP B BY $0;

D = FOREACH C GENERATE group, COUNT(B);

STORE D INTO '/myvolume/wordcount';

After you type the last line, Pig starts a MapReduce job to count the words in the file .constitution.txt

When the MapReduce job is complete, type to exit the Pig shell and take a look at the contents of the directory quit /myvolume/wordc to see the results.ount

http://pig.apache.org/




1. 2. 3.

1. 2.

MapR 2.0 Beta

Welcome to the MapR 2.0 Beta! This new release includes the following features:

MapR Metrics — Graphical display of job and task statisticsCentral configuration — configure nodes from files stored on the clusterLabel-based scheduling — specify which jobs run on which nodesSELinux support — keep your cluster secureCentral logging — easily diagnose the cluster

The easiest way to install the Beta is to follow the steps below. If you are installing on a moderate sized cluster (3 to 100 nodes), simply performthe following steps:

PREPARATION — Make sure your nodes meet the RequirementsADDING THE REPOSITORY — Add the correct MapR repository for your operating systemINSTALLATION —

Install either the or version of MapRM3 M5Install (requires M5)MapR Metrics

Preparation

Before installing the MapR 2.0 Beta, make sure your nodes meet the following requirements:

Operating system:64-bit Red Hat 5.4 or greater, or 64-bit CentOS 5.4 or greater64-bit Ubuntu 9.04 or greater

RAM: 4 GB or moreAt least one free unmounted drive or partition, 50 GB or moreAt least 10 GB of free space on the operating system partitionSun Java JDK version 1.6.0_24 (not JRE)The password, or privilegesroot sudoA Linux user chosen to have administrative privileges on the cluster








Adding the MapR Repository

The first step in deployment is to add the MapR repository. Follow the appropriate instructions for your operating system:

To add the MapR repository on Red Hat Enterprise Linux (RHEL) or CentOS:

Change to the user (or use for the following commands).root sudoOn all nodes, create a text file called in the directory with the following contents:maprtech.repo /etc/yum.repos.d/[maprtech]name=MapR Technologiesbaseurl=http://package.mapr.com/releases/v2.0.0-beta/redhat/



2.

1. 2.

3.

1. 2.

3.

4.

enabled=1gpgcheck=0protect=1

To add the MapR repository on Ubuntu:

Change to the user (or use for the following commands).root sudoOn all nodes, add the following line to :/etc/apt/sources.listdeb mapr optionalhttp://package.mapr.com/releases/v2.0.0-beta/ubuntu/On all nodes, run the following command:

apt-get update

Installation


<user> - the chosen administrative username<node 1>, , ... - the IP addresses of nodes 1, 2, 3 ...<node 2> <node 3>


M3 Installation

Change to the user (or use for the following commands).root sudoOn node 1, make sure the package is installed:ajaxterm

which ajaxterm

If it is not installed, install it:

RHEL/CentOS:

yum install ajaxterm

Ubuntu:

apt-get install ajaxterm

On node 1, execute the following command:RHEL/CentOS:

yum install mapr-cldb mapr-fileserver mapr-jobtracker mapr-nfs mapr-tasktrackermapr-webserver mapr-zookeeper

Ubuntu:

apt-get install mapr-cldb mapr-fileserver mapr-jobtracker mapr-nfs mapr-tasktrackermapr-webserver mapr-zookeeper

On nodes 2 and 3, execute the following command:RHEL/CentOS:

yum install mapr-fileserver mapr-tasktracker mapr-zookeeper

Ubuntu:

http://package.mapr.com/releases/v2.0.0-beta/ubuntu/



4.

5.

6.

7.

8.

9.

10.

11.

12.

apt-get install mapr-fileserver mapr-tasktracker mapr-zookeeper

On all other nodes (nodes 4...n), execute the following commands:RHEL/CentOS:

yum install mapr-fileserver mapr-tasktracker

Ubuntu:

apt-get install mapr-fileserver mapr-tasktracker







Tips



Tips









12.

13. 14. 15.

1. 2.

3.

4.

5.

6.




M5 Installation

Change to the user (or use for the following commands).root sudoOn node 1, make sure the package is installed:ajaxterm

which ajaxterm

If it is not installed, install it:

RHEL/CentOS:

yum install ajaxterm

Ubuntu:

apt-get install ajaxterm

On node 1, execute the following command:RHEL/CentOS:

yum install mapr-cldb mapr-jobtracker mapr-nfs mapr-zookeeper mapr-tasktrackermapr-webserver

Ubuntu:

apt-get install mapr-cldb mapr-jobtracker mapr-nfs mapr-zookeeper mapr-tasktrackermapr-webserver

On nodes 2 and 3, execute the following command:RHEL/CentOS:

yum install mapr-cldb mapr-jobtracker mapr-nfs mapr-zookeeper mapr-tasktracker

Ubuntu:

apt-get install mapr-cldb mapr-jobtracker mapr-nfs mapr-zookeeper mapr-tasktracker

On all other nodes (nodes 4...n), execute the following commands:RHEL/CentOS:

yum install mapr-fileserver mapr-nfs mapr-tasktracker

Ubuntu:

apt-get install mapr-fileserver mapr-nfs mapr-tasktracker




6.

7.

8.

9.

10.

11.

12.

13. 14. 15.






Tips



Tips



If the cluster is not yet registered, the message "Cluster not found" appears and the browser is redirected to aregistration page.On the registration page, create an account and log in.On the Register Cluster page, choose M5 Trial and click .RegisterWhen the message "Cluster Registered" appears, click .Return to your MapR Cluster UI







Installing MapR Metrics

MapR Metrics provides statistical information about jobs, tasks, and task attempts in easy-to-read graphical form.



1.

2.

1.

2.

1. 2. 3. 4.

5.

6. 7.

Prerequisites

MySQL Server — MapR Metrics requires a MySQL server to store statistical data about jobs and tasks in the cluster. The MySQL servercan be on a cluster node or on a separate machine.EPEL Repository — Extra Packages for Enterprise Linux (EPEL) provides components that MapR Metrics needs (CentOS and Red Hatonly).M5 License — to get the most out of MapR Metrics, you'll need an M5 License. With an M3 license, you don't get the charts orhistograms.

To enable the EPEL repository on CentOS or Red Hat 5.x:

Download the EPEL repository:

wget http://dl.fedoraproject.org/pub/epel/5/x86_64/epel-release-5-4.noarch.rpm

Install the EPEL repository:

rpm -Uvh epel-release-5*.rpm

To enable the EPEL repository on CentOS or Red Hat 6.x:

Download the EPEL repository:

wget http://download.fedoraproject.org/pub/epel/6/x86_64/epel-release-6-7.noarch.rpm

Install the EPEL repository:

rpm -Uvh epel-release-6*.rpm

Installation

To Install MapR Metrics:

Start with a functioning MapR cluster. Note which nodes run the Jobtracker and Webserver.To get the most out of MapR metrics, to the cluster.apply an M5 licenseInstall MySQL Server, either on a cluster node or on a separate system (or use an existing MySQL server).On all JobTracker nodes and Webserver nodes, install the package, running the appropriate command as or usingmapr-metrics root

:sudoCentOS or Red Hat:

yum install mapr-metrics

Ubuntu:

apt-get install mapr-metrics

Restart the warden:

/etc/init.d/mapr-warden restart

Log on to the MapR Control System.In the Navigation pane, click to display the Configure Metrics Database dialog. System Settings > Metrics

Image: Configure Metrics Database dialog

http://www.mapr.com/doc/display/MapR12/Adding+a+License



7.

8. 9.

10.

In the field, enter the hostname and port of the machine running the MySQL server.URLIn the and fields, enter the username and password of the MySQL user.Username PasswordLog on to one of the nodes on which is installed, go to the prompt, and source the script mapr-metrics mysql\> /opt/mapr/bin/se

. Example:tup.sql

mysql> source /opt/mapr/bin/setup.sql;



Our Partners





Datameer

Datameer provides the world's first business intelligence platform built natively for Hadoop. Datameer delivers powerful, self-service analytics forthe BI user through a simple spreadsheet UI, along with point-and-click data integration (ETL) and data visualization capabilities.

MapR provides a pre-packaged version of ("DAS"). DAS is delivered as an RPM or Debian package.Datameer Analytics Solution

See to add the DAS package to your MapR environment.How to setup DAS on MapRVisit to explore several demos included in the package to illustrate the usage of DAS in behavioral analytics and ITDemos for MapRsystems management use cases.Check out the library of with step-by-step walk-throughs on how to use DAS, and demo videos showing variousvideo tutorialsapplications.

If you have questions about using DAS, please visit the . For information about Datameer, please visit .DAS documentation www.datameer.com

http://www.datameer.com

http://www.datameer.com/products/overview.html

http://wiki.datameer.com/display/DAS13/How+to+setup+DAS+on+MapR

http://wiki.datameer.com/display/DAS13/Demos+for+MapR

http://wiki.datameer.com/display/DAS13/DAS+Video+Tutorials

http://wiki.datameer.com/display/DAS13/Home

http://www.datameer.com



Karmasphere

Karmasphere provides software products for data analysts and data professionals so they can unlock the power of Big Data in Hadoop, opening awhole new world of possibilities to add value to the business. Karmasphere equips analysts with the ability to discover new patterns, relationships,and drivers in any kind of data – unstructured, semi-structured or structured - that were not possible to find before.

The Karmasphere Big Data Analytics product line supports the Map R distributions, M3 and M5 Editions and includes:

Karmasphere Analyst, which provides data analysts immediate entry to structured and unstructured data on Hadoop, through SQL andother familiar languages, so that they can make ad-hoc queries, interact with the results, and iterate – without the aid of IT.Karmasphere Studio, which provides developers that support analytic teams a graphical environment to analyze their MapReduce codeas they develop custom analytic algorithms and systematize the creation of meaningful datasets for analysts.

To get started with Karmasphere Analyst or Studio:

Request a 30-day trial of Karmasphere Analyst or Studio for MapRLearn more about Karmasphere Big Data Analytics productsView videos about Karmasphere productsBig Data AnalyticsAccess technical resourcesRead documentation for Karmasphere products

If you have questions about Karmasphere please email [email protected] or visit www.karmasphere.com.

http://www.karmasphere.com/Products-Information/karmasphere-analyst.html

http://www.karmasphere.com/Products-Information/karmasphere-studio.html

http://karmasphere.com/MapRtrials

http://www.karmasphere.com/Products-Information/overview.html

http://www.karmasphere.com/ksc/Demos-and-Videos/videos.html

http://www.karmasphere.com/ksc

http://www.karmasphere.com/ksc/Article/documentation.html

http://www.karmasphere.com/



1. 2. 3. 4.

5. 6.

HParser

HParser is a data transformation (data handler) environment optimized for Hadoop. This easy-to-use, codeless parsing software enablesprocessing of any file format inside Hadoop with scale and efficiency. It provides Hadoop developers with out-of-the-box Hadoop parsingcapabilities to address the variety and complexity of data sources, including logs, industry standards, documents, and binary or hierarchical data.

MapR has partnered with Informatica to provide the of :Community Edition HParser

The HParser package can be downloaded from Informatica as a Zip archive that includes the HParser engine, the Data TransformationHParser Jar file, HParser Studio, and the .HParser Operator GuideThe HParser engine is also available as an RPM via the MapR repository, making it easier to install the HParser Engine on all nodes inthe cluster.

HParser can be installed on a MapR cluster running CentOS or Red Hat Enterprise Linux.

To install HParser on a MapR cluster:

Register on the site.InformaticaDownload the Zip file containing the of HParser, and extract it.Community EditionFamiliarize yourself with the installation procedure in the HParser Operator Guide.On each node, install HParser Engine from the MapR repository by typing the following command as or with :root sudoyum install hparser-engineChoose a a node in the cluster from which you will issue HParser commands.Command Node,Following the instructions in the copy the HParser Jar file to the Command Node and create the HParserHParser Operator Guide,configuration file.

https://community.informatica.com/solutions/1679

http://http://www.informatica.com/hparser/

http://www.informatica.com

https://community.informatica.com/solutions/1679



1. 2. 3. 4.

5.

Installation Guide

Getting Started

If installing a new cluster, make sure to install the latest version of MapR software. If applying a new license to an existing MapRcluster, make sure to upgrade to the latest version of MapR first. If you are not sure, check the contents of the file MapRBuildV

in the directory. If the version is and includes then you must upgrade before applying a license.ersion /opt/mapr 1.0.0 GAExample:

# cat /opt/mapr/MapRBuildVersion 1.0.0.10178GA-0v

For information about upgrading the cluster, see .Cluster Upgrade

To get started installing a basic cluster, take a look at the Quick Start guides:

M3 - RHEL or CentOSM3 - SUSEM3 - UbuntuM5 - RHEL or CentOSM5 - SUSEM5 - Ubuntu

To design and configure a cluster from the ground up, perform the following steps:

CHOOSE a cluster name, if desired.PREPARE all nodes, making sure they meet the hardware, software, and configuration requirements.PLAN which services to run on which nodes in the cluster.INSTALL MapR Software:

On all nodes, the MapR Repository.ADDOn each node, the planned MapR services.INSTALLOn all nodes, configure.sh.RUNOn all nodes, disks for use by MapR.FORMATSTART the cluster.SET UP node topology.SET UP NFS for HA.

CONFIGURE the cluster:SET UP the administrative user.CHECK that the correct services are running.SET UP authentication.CONFIGURE cluster email settings.CONFIGURE permissions.SET user quotas.CONFIGURE alarm notifications.

More Information

Once the cluster is up and running, you will find the following documents useful:

Component Setup - guides to third-party tool integration with MapRSetting Up the Client - set up a laptop or desktop to work directly with a MapR clusterUninstalling MapR - completely remove MapR softwareCluster Upgrade - upgrade an entire cluster to the latest version of MapR software



Architecture



MapR is a complete Hadoop distribution, implemented as a number of services running on individual in a cluster. In a typical cluster, all ornodesnearly all nodes are dedicated to data processing and storage, and a smaller number of nodes run other services that provide cluster coordinationand management. The following table shows the services corresponding to roles in a MapR cluster.

CLDB Maintains the (CLDB) and the MapR Distributed NameNode™ . The CLDB maintains the MapRcontainer location databaseFileServer storage (MapR-FS) and is aware of all the NFS and FileServer nodes in the cluster. The CLDB processcoordinates data storage services among MapR FileServer nodes, MapR NFS Gateways, and MapR Clients.

FileServer Runs the MapR FileServer (MapR-FS) and MapR Lockless Storage Services™.

HBaseMaster master (optional). Manages the region servers that make up HBase table storage. HBase

HRegionServer HBase region server (used with HBase master). Provides storage for an individual HBase region.

JobTracker Hadoop JobTracker. The JobTracker coordinates the execution of MapReduce jobs by assigning tasks to TaskTrackernodes and monitoring their execution.

NFS Provides read-write MapR Direct Access NFS™ access to the cluster, with full support for concurrent read and write access.With NFS running on multiple nodes, MapR can use virtual IP addresses to provide automatic transparent failover, ensuringhigh availability (HA).

TaskTracker Hadoop TaskTracker. The process that starts and tracks MapReduce tasks on a node. The TaskTracker registers with theJobTracker to receive task assignments, and manages the execution of tasks on a node.

WebServer Runs the MapR Control System and provides the MapR Heatmap™

Zookeeper Enables high availability (HA) and fault tolerance for MapR clusters by providing coordination.

A process called the runs on all nodes to manage, monitor, and report on the other services on each node. The MapR cluster uses warden ZooKe to coordinate services. ZooKeeper runs on an odd number of nodes (at least three, and preferably five or more) and prevents serviceeper

coordination conflicts by enforcing a rigid set of rules and conditions that determine which instance of each service is the master. The warden willnot start any services unless ZooKeeper is reachable and more than half of the configured ZooKeeper nodes are live.

Hadoop Compatibility

MapR provides the following packages:

Apache Hadoop 0.20.2flume-0.9.3hbase-0.90.2hive-0.7.0oozie-3.0.0pig-0.8sqoop-1.2.0whirr-0.3.0

For more information, see .Hadoop Compatibility in Version 1.1

http://wiki.apache.org/hadoop/Hbase

http://hadoop.apache.org/zookeeper/

http://hadoop.apache.org/zookeeper/



Requirements

Before setting up a MapR cluster, ensure that every node satisfies the following hardware and software requirements, and consider which MapRlicense provides the features you need.

If you are setting up a large cluster, it is a good idea to use a configuration management tool such as Puppet or Chef, or a parallel ssh tool, tofacilitate the installation of MapR packages across all the nodes in the cluster. The following sections provide details about the prerequisites forsetting up the cluster.

Node Hardware

Minimum Requirements Recommended

64-bit processor4G DRAM1 network interfaceAt least one free unmounted drive or partition, 100 GB or moreAt least 10 GB of free space on the operating system partitionTwice as much swap space as RAM (if this is not possible, see Memory

)Overcommit

64-bit processor with 8-12 cores32G DRAM or more2 GigE network interfaces3-12 disks of 1-3 TB eachAt least 20 GB of free space on the operatingsystem partition32 GB swap space or more (see also Memory

)Overcommit

In practice, it is useful to have 12 or more disks per node, not only for greater total storage but also to provide a larger number of avstorage poolsailable. If you anticipate a lot of big reduces, you will need additional network bandwidth in relation to disk I/O speeds. MapR can detect multipleNICs with multiple IP addresses on each node and manage network throughput accordingly to maximize bandwidth. In general, the more networkbandwidth you can provide, the faster jobs will run on the cluster. When designing a cluster for heavy CPU workloads, the processor on eachnode is more important than networking bandwidth and available disk space.

Disks

Set up at least three unmounted drives or partitions, separate from the operating system drives or partitions, for use by MapR-FS. For informationon setting up disks for MapR-FS, see . If you do not have disks available for MapR, or to test with a small installation,Setting Up Disks for MapRyou can use a instead.flat file

It is not necessary to set up RAID on disks used by MapR-FS. MapR uses a script called to set up storage pools. In most cases, youdisksetupshould let MapR calculate storage pools using the default of two or three disks. If you anticipate a high volume of random-access I/O,stripe widthyou can use the option with to specify larger storage pools of up to 8 disks each.-W disksetup

You can set up RAID on the operating system partition(s) or drive(s) at installation time, to provide higher operating system performance (RAID0), disk mirroring for failover (RAID 1), or both (RAID 10), for example. See the following instructions from the operating system websites:

CentOSRed HatUbuntu

Directory Structure

Follow these guidelines for disk space allocated to the directories on each node:

/tmp: 10GB/opt/mapr: 40GB/opt/cores: Twice the amount of physical RAM on the node, or 100 GB, whichever is more/opt/mapr/zk-data: A separate 500MB partition

Software

Install a compatible 64-bit operating system on all nodes. MapR currently supports the following operating systems:

64-bit CentOS 5.4 or greater64-bit Red Hat 5.4 or greater64-bit Ubuntu 9.04 or greater64-bit SUSE Linux Enterprise Server 11.x

Each node must also have the following software installed:

http://www.mapr.com/doc/display/MapR12/Memory+Overcommit




http://wiki.centos.org/HowTos/SoftwareRAIDonCentOS5

http://docs.redhat.com/docs/en-US/Red_Hat_Enterprise_Linux/3/html/System_Administration_Guide/ch-software-raid.html

https://help.ubuntu.com/community/Installation/SoftwareRAID



Sun Java JDK version 1.6.0_24 (not JRE)

If Java is already installed, check which versions of Java are installed: java -versionIf JDK 6 is installed, the output will include a version number starting with 1.6, and then below that the text . Example:Java(TM)


Use update-alternatives to make sure JDK 6 is the default Java: sudo update-alternatives --config java

Configuration

Each node must be configured as follows:

Each node have a unique hostname.SELinux features must be disabled.Each node must be able to perform forward and reverse host name resolution with every other node in the cluster.Administrative user - a Linux user chosen to have administrative privileges on the cluster.

Make sure the user has a password (using for example).sudo passwd <user>Make sure the limit on the number of processes is not set too low for the user; the value should be at least (NPROC_RLIMIT) root 327

. In Red Hat or CentOS, the default may be very low ( , for example). In Ubuntu, there may be no default; you should only set this86 1024value if you see errors related to inability to create new threads. Use the command to remove limits on file sizes or or otherulimitcomputing resources. Each node must have a number of available file descriptors greater than four times the number of nodes in thecluster. See for more detailed information.ulimitsyslog must be enabled.

In VM environments like EC2, VMware, and Xen, when running Ubuntu 10.10, problems can occur due to an Ubuntu bug unless the IRQ balanceris turned off. On all nodes, edit the file and set to turn off the IRQ balancer (requires reboot to take/etc/default/irqbalance ENABLED=0effect).

Environment Variables

The script sets default values for the and environment variables with /opt/mapr/conf/env.sh JAVA_HOME MAPR_SUBNETS /etc/environm. You can set these values manually as needed:ent

JAVA_HOME points to a specific version of Java, if this node needs to run multiple versions of JavaMAPR_SUBNETS takes a comma-separated list of up to four subnets in CIDR notationwith no spaces, as in the example . Every node in the cluster must be reachable onexport MAPR_SUBNETS=1.2.3.4/12, 5.6/24one of the subnets listed. Set this value to limit MapR-FS to a particular set of network interface controllers (NICs) on a node with multipleNICs.

Keyless SSH

Keyless (passwordless) SSH should be set up between all nodes in the cluster; MapR uses keyless SSH for centralized management of disks (viathe commands), , and .disk support utilities rolling upgrades

If you choose not to provide keyless SSH, everything in the cluster will run fine. The only inconvenience is that you will be unable to use the abovefeatures remotely; however, you can accomplish the same tasks locally on each node as follows:

Use the commands on each node to manage its own disks.diskUse the support utility with the or option, to use the warden instead of SSH for support dumpmapr-support-collect.sh -O --onlinecollection from nodes.Upgrade the cluster instead of performing a rolling upgrade.manually

NTP

To keep all cluster nodes time-synchronized, MapR requires NTP to be configured and running on every node. If server clocks in the cluster driftout of sync, serious problems will occur with HBase and other MapR services. MapR raises a Time Skew alarm on any out-of-sync nodes. See htt

for more information about obtaining and installing NTP. In the event that a large adjustment must be made to the time on ap://www.ntp.org/particular node, you should stop ZooKeeper on the node, then adjust the time, then restart ZooKeeper.

An internal NTP server enables your cluster to remain synchronized in the event that an outside NTP server is inaccessible.

DNS Resolution

For MapR to work properly, all nodes on the cluster must be able to communicate with each other. Each node must have a unique hostname, and

http://en.wikipedia.org/wiki/CIDR_notation

http://www.ntp.org/

http://www.ntp.org/



must be able to resolve all other hosts with both normal and reverse DNS name lookup.

You can use the command on each node to check the hostname. Example:hostname

$ hostname -fswarm

If the command returns a hostname, you can use the command to check whether the hostname exists in the hosts database. The getent geten command should return a valid IP address on the local network, associated with a fully-qualified domain name for the host. Example:t

$ getent hosts `hostname`10.250.1.53 swarm.corp.example.com

If you do not get the expected output from the command or the command, correct the host and DNS settings on the node. Ahostname getentcommon problem is an incorrect loopback entry ( ), which prevents the correct IP address from being assigned to the hostname.127.0.x.x

Pay special attention to the format of . For more information, see the . Example:/etc/hosts hosts(5) man page

127.0.0.1 localhost10.10.5.10 mapr-hadoopn.maprtech.prv mapr-hadoopn

Users and Groups

MapR uses each node's native operating system configuration to authenticate users and groups for access to the cluster. If you are deploying alarge cluster, you should consider configuring all nodes to use LDAP or another user management system. You can use the MapR ControlSystem to give specific permissions to particular users and groups. For more information, see . Each user can be restrictedManaging Permissionsto a specific amount of disk usage. For more information, see .Managing Quotas

All nodes in the cluster must have the same set of users and groups, with the same and numbers on all nodes:uid gid

When adding a user to a cluster node, specify the option with the command to guarantee that the user has the same --uid useradd ui on all machines.d

When adding a group to a cluster node, specify the option with the command to guarantee that the group has the--gid groupaddsame on all machines.gid

Choose a specific user to be the administrative user for the cluster. By default, MapR gives the user full administrative permissions. If therootnodes do not have an explicit login (as is sometimes the case with Ubuntu, for example), you can give full permissions to the chosenrootadministrative user after deployment. See .Cluster Configuration

On the node where you plan to run the (the MapR Control System), install Pluggable Authentication Modules (PAM). See mapr-webserver PAM.Configuration

Network Ports

The following table lists the network ports that must be open for use by MapR.

Service Port

SSH 22

NFS 2049

MFS server 5660

ZooKeeper 5181

CLDB web port 7221

CLDB 7222

Web UI HTTP 8080 (set by user)

Web UI HTTPS 8443 (set by user)

JobTracker 9001

http://www.kernel.org/doc/man-pages/online/pages/man5/hosts.5.html



NFS monitor (for HA) 9997

NFS management 9998

JobTracker web 50030

TaskTracker web 50060

HBase Master 60000

LDAP Set by user

SMTP Set by user

The MapR UI runs on Apache. By default, installation does not close port 80 (even though the MapR Control System is available over HTTPS onport 8443). If this would present a security risk to your datacenter, you should close port 80 manually on any nodes running the MapR ControlSystem.

Licensing

Before installing MapR, consider the capabilities you will need and make sure you have obtained the corresponding license. If you need NFS,data protection with snapshots and mirroring, or plan to set up a cluster with high availability (HA), you will need an M5 license. You can obtainand install a license through the after installation. For more information about which features are included in each license type,License Managersee .MapR Editions





http://www.mapr.com/beta/mapr-editions



1. 2. 3.

1.

2.

PAM Configuration

MapR uses for user authentication in the MapR Control System. Make sure PAM is installed andPluggable Authentication Modules (PAM)configured on the node running the .mapr-webserver

There are typically several PAM modules (profiles), configurable via configuration files in the directory. Each standard UNIX/etc/pam.d/program normally installs its own profile. MapR can use (but does not require) its own PAM profile. The MapR Control Systemmapr-adminwebserver tries the following three profiles in order:

mapr-admin (Expects that user has created the profile)/etc/pam.d/mapr-adminsudo ( )/etc/pam.d/sudosshd ( )/etc/pam.d/sshd

The profile configuration file (for example, ) should contain an entry corresponding to the authentication scheme used by your/etc/pam.d/sudosystem. For example, if you are using local OS authentication, check for the following entry:

auth sufficient pam_unix.so # For local OS Auth

Example: Configuring PAM with mapr-admin

Although there are several viable ways to configure PAM to work with the MapR UI, we recommend using the profile. The followingmapr-adminexample shows how to configure the file. If LDAP is not configured, comment out the LDAP lines.

Example /etc/pam.d/mapr-admin file

account required pam_unix.soaccount sufficient pam_succeed_if.so uid < 1000 quietaccount [ =bad success=ok user_unknown=ignore] pam_ldap.sodefaultaccount required pam_permit.so

auth sufficient pam_unix.so nullok_secureauth requisite pam_succeed_if.so uid >= 1000 quietauth sufficient pam_ldap.so use_first_passauth required pam_deny.so

password sufficient pam_unix.so md5 obscure min=4 max=8 nulloktry_first_passpassword sufficient pam_ldap.sopassword required pam_deny.so

session required pam_limits.sosession required pam_unix.sosession optional pam_ldap.so

The following sections provide information about configuring PAM to work with LDAP or Kerberos.

The file should be modified only with care and only when absolutely necessary./etc/pam.d/sudo

LDAP

To configure PAM with LDAP:

Install the appropriate PAM packages:On Ubuntu, sudo apt-get libpam-ldapOn Redhat/Centos, sudo yum install pam_ldap

Open and check for the following line:/etc/pam.d/sudo

auth sufficient pam_ldap.so # For LDAP Auth

Kerberos

http://en.wikipedia.org/wiki/Pluggable_Authentication_Modules



1.

2.

To configure PAM with Kerberos:

Install the appropriate PAM packages:On Redhat/Centos, sudo yum install pam_krb5On Ubuntu, install sudo apt-get -krb5

Open and check for the following line:/etc/pam.d/sudo

auth sufficient pam_krb5.so # For kerberos Auth



1. 2.

3. 4. 5.

1.

2. 3.

Setting Up Disks for MapR

MapR formats and uses disks for the Lockless Storage Services layer (MapR-FS), recording these disks in the file . In a productiondisktabenvironment, or when testing performance, MapR should be configured to use physical hard drives and partitions. In some cases, it is necessaryto reinstall the operating system on a node so that the physical hard drives are available for direct use by MapR. Reinstalling the operating systemprovides an unrestricted opportunity to configure the hard drives. If the installation procedure assigns hard drives to be managed by the LinuxLogical Volume manger (LVM) by default, you should explicitly remove from LVM configuration the drives you plan to use with MapR. It iscommon to let LVM manage one physical drive containing the operating system partition(s) and to leave the rest unmanaged by LVM for use withMapR.

To determine if a disk or partition is ready for use by MapR:

Run the command to determine whether any processes are already using the disk or partition.sudo lsof <partition>There should be no output when running , indicating there is no process accessing the specific disk orsudo fuser <partition>partition.The disk or partition should not be mounted, as checked via the output of the command.mountThe disk or partition should not have an entry in the file./etc/fstabThe disk or partition should be accessible to standard Linux tools such as . You should be able to successfully format the partitionmkfsusing a command like as this is similar to the operations MapR performs during installation. If fasudo mkfs.ext3 <partition> mkfsils to access and format the partition, then it is highly likely MapR will encounter the same problem.

Any disk or partition that passes the above testing procedure can be added to the list of disks and partitions passed to the command.disksetup

To specify disks or partitions for use by MapR:





You should run only after running .disksetup configure.sh

To test without formatting physical disks:

If you do not have physical partitions or disks available for reformatting, you can test MapR by creating a flat file and including a path to the file inthe disk list file. You should create at least a 16GB file or larger.

The following example creates a 20 GB flat file ( specifies 1 gigabyte, multiply by ):bs=1G count=20

$ dd =/dev/zero of=/root/storagefile bs=1G count=20if

Using the above example, you would add the following to :/tmp/disks.txt

/root/storagefile

Working with a Logical Volume Manager

The Logical Volume Manager creates symbolic links to each logical volume's block device, from a directory path in the form: /dev/<volume. MapR needs the actual block location, which you can find by using the command to list the symbolic links.group>/<volume name> ls -l

Make sure you have free, unmounted logical volumes for use by MapR:Unmount any mounted logical volumes that can be erased and used for MapR.Allocate any free space in an existing logical volume group to new logical volumes.

Make a note of the volume group and volume name of each logical volume.Use with the volume group and volume name to determine the path of each logical volume's block device. Each logical volume isls -l



3.

4.

5.

a symbolic link to a logical block device from a directory path that uses the volume group and volume name: /dev/<volumegroup>/<volume name>The following example shows output that represents a volume group named containing logical volumes named , , mapr mapr1 mapr2 map

, and :r3 mapr4

# ls -l /dev/mapr/mapr*lrwxrwxrwx 1 root root 22 Apr 12 21:48 /dev/mapr/mapr1 -> /dev/mapper/mapr-mapr1lrwxrwxrwx 1 root root 22 Apr 12 21:48 /dev/mapr/mapr2 -> /dev/mapper/mapr-mapr2lrwxrwxrwx 1 root root 22 Apr 12 21:48 /dev/mapr/mapr3 -> /dev/mapper/mapr-mapr3lrwxrwxrwx 1 root root 22 Apr 12 21:48 /dev/mapr/mapr4 -> /dev/mapper/mapr-mapr4

Create a text file containing the paths to the block devices for the logical volumes (one path on each line). Example:/tmp/disks.txt

$ cat /tmp/disks.txt/dev/mapper/mapr-mapr1/dev/mapper/mapr-mapr2/dev/mapper/mapr-mapr3/dev/mapper/mapr-mapr4

Pass to disks.txt disksetup

# sudo /opt/mapr/server/disksetup -F /tmp/disks.txt



1.

2.

3. 4.

1.

2.

3. 4.

ulimit

On each node, specifies the number of inodes that can be opened simultaneously. With the default value of 1024, the system appears toulimitbe out of disk space and shows no inodes available. The value for should be set to 64000.ulimit

Setting ulimit for Centos/Redhat:

Edit and add the following lines:/etc/security/limits.conf

root soft nofile 64000root hard nofile 64000

Check that the /etc/pam.d/su file contains the following settings:

#%PAM-1.0

auth sufficient pam_rootok.so

# Uncomment the following line to implicitly trust users in the group."wheel"

#auth sufficient pam_wheel.so trust use_uid

# Uncomment the following line to require a user to be in the group."wheel"

#auth required pam_wheel.so use_uid

auth include system-auth

account sufficient pam_succeed_if.so uid = 0 use_uid quiet

account include system-auth

password include system-auth

session include system-auth

session optional pam_xauth.so

Reboot the system.Run the following command to check the setting:ulimit

ulimit -n

The command should report .64000

Setting ulimit for Ubuntu:

Edit and add the following lines:/etc/security/limits.conf

root soft nofile 64000root hard nofile 64000

Edit and uncomment the following line:/etc/pam.d/su

session required pam_limits.so

Reboot the system.Run the following command to check the setting:ulimit

ulimit -n



4.

The command should report .64000



Planning the Deployment

Planning a MapR deployment involves determining which services to run in the cluster and where to run them. The majority of nodes are nworkerodes, which run the TaskTracker and MapR-FS services for data processing. A few nodes run services that manage the cluster andcontrolcoordinate MapReduce jobs.

The following table provides general guidelines for the number of instances of each service to run in a cluster:

Service Package How Many

CLDB mapr-cldb 1-3

FileServer mapr-fileserver Most or all nodes

HBase Master mapr-hbase-master 1-3

HBase RegionServer mapr-hbase-regionserver Varies

JobTracker mapr-jobtracker 1-3

NFS mapr-nfs Varies

TaskTracker mapr-tasktracker Most or all nodes

WebServer mapr-webserver One or more

Zookeeper mapr-zookeeper 1, 3, 5, or a higher odd number

Sample Configurations

The following sections describe a few typical ways to deploy a MapR cluster.

Small M3 Cluster

A small cluster runs most control services on only one node (except for ZooKeeper, which runs on three) and data services on the remainingM3nodes. The M3 license does not permit failover or high availability, and only allows one running CLDB.

Small M5 Cluster

A small cluster runs control services on three nodes and data services on the remaining nodes, providing failover and high availability for allM5critical services.

http://www.mapr.com/doc/display/MapR12/M3

http://www.mapr.com/doc/display/MapR12/M5



Larger M5 Cluster

A large cluster (over 100 nodes) should from the TaskTracker and NFS nodes.isolate CLDB nodes

In large clusters, you should not run TaskTracker and ZooKeeper together on any nodes.

If running the TaskTracker on CLDB or ZooKeeper nodes is unavoidable, certain settings in should be configured differentlymapred-site.xmlto protect critical cluster-wide services from resource shortages. In particular, the number of slots advertised by the TaskTracker should bereduced and less memory should be reserved for MapReduce tasks. Edit /opt/mapr/hadoop/hadoop-<version>/conf/mapred-site.xm

and set the following parameters:l

mapred.tasktracker.map.tasks.maximum=(CPUS > 2) ? (CPUS * 0.50) : 1mapred.tasktracker.reduce.tasks.maximum=(CPUS > 2) ? (CPUS * 0.25) : 1mapreduce.tasktracker.prefetch.maptasks=0.25mapreduce.tasktracker.reserved.physicalmemory.mb.low=0.50mapreduce.tasktracker.task.slowlaunch=true

Planning NFS



The service lets you access data on a licensed MapR cluster via the protocol:mapr-nfs NFS

M3 license: one NFS nodeM5 license: multiple NFS nodes with VIPs for failover and load balancing

You can mount the MapR cluster via NFS and use standard shell scripting to read and write live data in the cluster. NFS access to cluster datacan be faster than accessing the same data with the commands. To mount the cluster via NFS from a client machine, see hadoop Setting Up the

.Client

NFS Setup Tips

Before using the MapR NFS Gateway, here are a few helpful tips:

Ensure the stock Linux NFS service is stopped, as Linux NFS and MapR NFS will conflict with each other.Ensure is running (Example: ).portmapper ps a | grep portmapMake sure you have installed the package. If you followed the or instructmapr-nfs Quick Start - Single Node Quick Start - Small Clusterions, then it is installed. You can check by listing the directory and checking for in the list./opt/mapr/roles nfsMake sure you have applied an M3 license or an M5 (paid or trial) license to the cluster. See .Adding a LicenseMake sure the NFS service is started (see ).ServicesFor information about mounting the cluster via NFS, see .Setting Up the Client

NFS on an M3 Cluster

At installation time, choose one node on which to run the NFS gateway. NFS is lightweight and can be run on a node running services such asCLDB or ZooKeeper. To add the NFS service to a running cluster, use the instructions in to install the package on theAdding Roles mapr-nfsnode where you would like to run NFS.


At cluster installation time, plan which nodes should provide NFS access according to your anticipated traffic. For instance, if you need 5Gbps ofwrite throughput and 5Gbps of read throughput, here are a few ways to set up NFS:

12 NFS nodes, each of which has a single 1Gbe connection6 NFS nodes, each of which has a dual 1Gbe connection4 NFS nodes, each of which has a quad 1Gbe connection

You can also set up NFS on all file server nodes, so each node can NFS-mount itself and native applications can run as tasks, or on one or morededicated gateways outside the cluster (using round-robin DNS or behind a hardware load balancer) to allow controlled access.

You can set up virtual IP addresses (VIPs) for NFS nodes in an M5-licensed MapR cluster, for load balancing or failover. VIPs provide multipleaddresses that can be leveraged for round-robin DNS, allowing client connections to be distributed among a pool of NFS nodes. VIPs also makehigh availability (HA) NFS possible; in the event an NFS node fails, data requests are satisfied by other NFS nodes in the pool. You should use aminimum of one VIP per NFS node per NIC that clients will use to connect to the NFS server. If you have four nodes with four NICs each, witheach NIC connected to an individual IP subnet, use a minimum of 16 VIPs and direct clients to the VIPs in round-robin fashion. The VIPs shouldbe in the same IP subnet as the interfaces to which they will be assigned.

Here are a few tips:

Set up NFS on at least three nodes if possible.All NFS nodes must be accessible over the network from the machines where you want to mount them.To serve a large number of clients, set up dedicated NFS nodes and load-balance between them. If the cluster is behind a firewall, youcan provide access through the firewall via a load balancer instead of direct access to each NFS node. You can run NFS on all nodes inthe cluster, if needed.To provide maximum bandwidth to a specific client, install the NFS service directly on the client machine. The NFS gateway on the clientmanages how data is sent in or read back from the cluster, using all its network interfaces (that are on the same subnet as the clusternodes) to transfer data via MapR APIs, balancing operations among nodes as needed.Use VIPs to provide High Availability (HA) and failover. See for more information.Setting Up NFS HA

To add the NFS service to a running cluster, use the instructions in to install the package on the nodes where you wouldAdding Roles mapr-nfslike to run NFS.

NFS Memory Settings

The memory allocated to each MapR service is specified in the file, which MapR automatically configures/opt/mapr/conf/warden.confbased on the physical memory available on the node. You can adjust the minimum and maximum memory used for NFS, as well as thepercentage of the heap that it tries to use, by setting the , , and parameters in the file on each NFS node.percent max min warden.confExample:

http://en.wikipedia.org/wiki/Network_File_System_%28protocol%29




...service.command.nfs.heapsize.percent=3service.command.nfs.heapsize.max=1000service.command.nfs.heapsize.min=64...

The percentages need not add up to 100; in fact, you can use less than the full heap by setting the parameters for allheapsize.percentservices to add up to less than 100% of the heap size. In general, you should not need to adjust the memory settings for individual services,unless you see specific memory-related problems occurring.

Running NFS on a Non-standard Port

To run NFS on an arbitrary port, modify the following line in :warden.conf

service.command.nfs.start=/etc/init.d/mapr-nfsserver start

Add to the end of the line, as in the following example:-p <portnumber>

service.command.nfs.start=/etc/init.d/mapr-nfsserver start -p 12345

After modifying , restart the MapR NFS server by issuing the following command:warden.conf

maprcli node services -nodes <nodename> -nfs restart

You can verify the port change with the command.rpcinfo -p localhost

Planning Services for HA

When properly licensed and configured for HA, the MapR cluster provides automatic failover for continuity throughout the stack. Configuring acluster for HA involves running redundant instances of specific services, and configuring NFS properly. In HA clusters, it is advisable to have 3nodes run CLDB and 5 run ZooKeeper. In addition, 3 Hadoop JobTrackers and/or 3 HBase Masters are appropriate depending on the purpose ofthe cluster. Any node or nodes in the cluster can run the MapR WebServer. In HA clusters, it is appropriate to run more than one instance of theWebServer with a load balancer to provide failover. NFS can be configured for HA using virtual IP addresses (VIPs). For more information, see Se

.tting Up NFS HA

The following are the minimum numbers of each service required for HA:

CLDB - 2 instancesZooKeeper - 3 instances (to maintain a quorum in case one instance fails)HBase Master - 2 instancesJobTracker - 2 instancesNFS - 2 instances

You should run redundant instances of important services on separate racks whenever possible, to provide failover if a rack goes down. Forexample, the top server in each of three racks might be a CLDB node, the next might run ZooKeeper and other control services, and theremainder of the servers might be data processing nodes. If necessary, use a worksheet to plan the services to run on each node in each rack.

Tips:

If you are installing a large cluster (100 nodes or more), CLDB nodes should not run any other service and should not contain any clusterdata (see ).Isolating CLDB NodesIn HA clusters, it is advisable to have 3 nodes run CLDB and 5 run ZooKeeper. In addition, 3 Hadoop JobTrackers and/or 3 HBaseMasters are appropriate depending on the purpose of the cluster.



Cluster Architecture

The architecture of the cluster hardware is an important consideration when planning a deployment. Among the considerations are anticipateddata storage and network bandwidth needs, including intermediate data generated during MapReduce job execution. The type of workload isimportant: consider whether the planned cluster usage will be CPU-intensive, I/O-intensive, or memory-intensive. Think about how data will beloaded into and out of the cluster, and how much data is likely to be transmitted over the network.

Typically, the CPU is less of a bottleneck than network bandwidth and disk I/O. To the extent possible, network and disk transfer rates should bebalanced to meet the anticipated data rates using multiple NICs per node. It is not necessary to bond or trunk the NICs together; MapR is able totake advantage of multiple NICs transparently. Each node should provide raw disks and partitions to MapR, with no RAID or logical volumemanager, as MapR takes care of formatting and data protection.

Example Architecture

The following example architecture provides specifications for a standard compute/storage node for general purposes, and two sample rackconfigurations made up of the standard nodes. MapR is able to make effective use of more drives per node than standard Hadoop, so each nodeshould present enough face plate area to allow a large number of drives. The standard node specification allows for either 2 or 4 1Gb/s ethernetnetwork interfaces.

Standard Compute/Storage Node

2U ChassisSingle motherboard, dual socket2 x 4-core + 32 GB RAM or 2 x 6-core + 48 GB RAM12 x 2 TB 7200-RPM drives2 or 4 network interfaces(on-board NIC + additional NIC)OS on single partition on one drive (remainder of drive used for storage)

Standard 50TB Rack Configuration

10 standard compute/storage nodes(10 x 12 x 2 TB storage; 3x replication, 25% margin)24-port 1 Gb/s rack-top switch with 2 x 10Gb/s uplinkAdd second switch if each node uses 4 network interfaces

Standard 100TB Rack Configuration

20 standard nodes(20 x 12 x 2 TB storage; 3x replication, 25% margin)48-port 1 Gb/s rack-top switch with 4 x 10Gb/s uplinkAdd second switch if each node uses 4 network interfaces

To grow the cluster, just add more nodes and racks, adding additional service instances as needed. MapR rebalances the cluster automatically.



1.

2.

3. 4. 5.

1.

2.

3.

4.

1.

2.

Isolating CLDB Nodes

In a large cluster (100 nodes or more) create CLDB-only nodes to ensure high performance. This configuration also provides additional controlover the placement of the CLDB data, for load balancing, fault tolerance, or high availability (HA). Setting up CLDB-only nodes involves restrictingthe CLDB volume to its own topology and making sure all other volumes are on a separate topology. Unless you specify a default volumetopology, new volumes have no topology when they are created, and reside at the root topology path: " ". Because both the CLDB-only path and/the non-CLDB path are children of the root topology path, new non-CLDB volumes are not guaranteed to keep off the CLDB-only nodes. To avoidthis problem, set a default volume topology. See .Setting Default Volume Topology

To set up a CLDB-only node:

SET UP the node as usual:PREPARE the node, making sure it meets the requirements.ADD the MapR Repository.

INSTALL only the following packages:mapr-cldbmapr-webservermapr-coremapr-fileserver

RUN .configure.shFORMAT the disks.START the warden:


To restrict the CLDB volume to specific nodes:

Move all CLDB nodes to a CLDB-only topology (e. g. ) using the MapR Control System or the following command:/cldbonlymaprcli node move -serverids <CLDB nodes> -topology /cldbonlyRestrict the CLDB volume to the CLDB-only topology. Use the MapR Control System or the following command:maprcli volume move -name mapr.cldb.internal -topology /cldbonlyIf the CLDB volume is present on nodes not in /cldbonly, increase the replication factor of mapr.cldb.internal to create enough copies in/cldbonly using the MapR Control System or the following command:maprcli volume modify -name mapr.cldb.internal -replication <replication factor>Once the volume has sufficient copies, remove the extra replicas by reducing the replication factor to the desired value using the MapRControl System or the command used in the previous step.

To move all other volumes to a topology separate from the CLDB-only nodes:

Move all non-CLDB nodes to a non-CLDB topology (e. g. ) using the MapR Control System or the following command:/defaultRackmaprcli node move -serverids <all non-CLDB nodes> -topology /defaultRackRestrict all existing volumes to the topology using the MapR Control System or the following command:/defaultRackmaprcli volume move -name <volume> -topology /defaultRackAll volumes except (mapr.cluster.root) get re-replicated to the changed topology automatically.

To prevent subsequently created volumes from encroaching on the CLDB-only nodes, set a default topology thatexcludes the CLDB-only topology.



1.

2.

3. 4. 5.

Isolating ZooKeeper Nodes

For large clusters (100 nodes or more), isolate the ZooKeeper on nodes that do not perform any other function, so that the ZooKeeper does notcompete for resources with other processes. Installing a ZooKeeper-only node is similar to any typical node installation, but with a specific subsetof packages. Importantly, do not install the FileServer package, so that MapR does not use the ZooKeeper-only node for data storage.

To set up a ZooKeeper-only node:


INSTALL only the following packages:mapr-zookeepermapr-zk-internalmapr-core

RUN .configure.shFORMAT the disks.START ZooKeeper (as or using ):root sudo


Do not start the warden.



1. 2. 3. 4. 5. 6. 7.

1. 2.

3.

1. 2.

3.

Installing MapR

Before performing these steps, make sure all nodes meet the , and that you have which services to runRequirements plannedon each node. You will need a list of the hostnames or IP addresses of all CLDB nodes, and the hostnames or IP addresses ofall ZooKeeper nodes.

Perform the following steps, starting the installation with the running CLDB and ZooKeeper:control nodes

On all nodes, the MapR Repository.ADDOn each node, the planned MapR services.INSTALLOn all nodes, configure.sh.RUNOn all nodes, disks for use by MapR.FORMATSTART the cluster.SET UP node topology.SET UP NFS for HA.

The following sections provide details about each step.

Adding the MapR Repository

The MapR repository provides all the packages you need to install and run a MapR cluster using native tools such as on Red Hat or CentOS,yumor on Ubuntu. Perform the following steps on every node to add the MapR Repository for MapR Version 2.1.0.apt-get

To Add the Repository on Red Hat or CentOS:

Change to the user (or use for the following commands).root sudoCreate a text file called in the directory with the following contents:maprtech.repo /etc/yum.repos.d/[maprtech]name=MapR Technologiesbaseurl=http://package.mapr.com/releases/v1.2.10/redhat/enabled=1gpgcheck=0protect=1To install a previous release, see the for the correct path to use in the parameter.Release Notes baseurlIf your connection to the Internet is through a proxy server, you must set the environment variable before installation:http_proxy




To Add the Repository on SUSE:

Change to the user (or use for the following commands).root sudouse the following command to add the MapR repository:zypper ar http://package.mapr.com/releases/v1.2.10/redhat/ maprTo install a previous release, see the for the correct path.Release NotesIf your connection to the Internet is through a proxy server, you must set the environment variable before installation:http_proxy


Update the system package index by running the following command:zypper refreshExecute the following command:zypper install mapr-compat-suse






1. 2.

3.

1. 2.

1. 2.

1. 2.

3.


To Add the Repository on Ubuntu:

Change to the user (or use for the following commands).root sudoAdd the following line to :/etc/apt/sources.listdeb mapr optionalhttp://package.mapr.com/releases/v1.2.10/ubuntu/(To install a previous release, see the for the correct path.)Release NotesIf your connection to the Internet is through a proxy server, add the following lines to :/etc/apt/apt.conf




Installing MapR Services

The following sections provide instructions for installing MapR.

To Install Services on Red Hat or CentOS:

Change to the user (or use sudo for the following command).rootUse the command to install the services planned for the node. Examples:yum

To install TaskTracker and MapR-FS:

yum install mapr-tasktracker mapr-fileserver

To install CLDB, JobTracker, Webserver, ZooKeeper, Hive, Pig, HBase and Mahout:

yum install mapr-cldb mapr-jobtracker mapr-webserver mapr-zookeeper mapr-hive mapr-pigmapr-hbase mapr-mahout

To Install Services on SUSE:

Change to the user (or use sudo for the following command).rootUse the command to install the services planned for the node. Examples:zypper

To install TaskTracker and MapR-FS:

zypper install mapr-tasktracker mapr-fileserver


zypper install mapr-cldb mapr-jobtracker mapr-webserver mapr-zookeeper mapr-hive mapr-pigmapr-hbase mapr-mahout

To Install Services on Ubuntu:

Change to the user (or use sudo for the following commands).rootOn all nodes, issue the following command to update the Ubuntu package list:

apt-get update








3.

1. 2.

Use the command to install the services planned for the node. Examples:apt-get installTo install TaskTracker and MapR-FS:

apt-get install mapr-tasktracker mapr-fileserver


apt-get install mapr-cldb mapr-jobtracker mapr-webserver mapr-zookeeper mapr-hivemapr-pig mapr-hbase mapr-mahout

Running configure.sh

Run the script to create and update the corresponding and files.configure.sh /opt/mapr/conf/mapr-clusters.conf *.conf *.xmlBefore performing this step, make sure you have a list of the hostnames of the CLDB and ZooKeeper nodes. Optionally, you can specify the portsfor the CLDB and ZooKeeper nodes as well. If you do not specify them, the default ports are:

CLDB – 7222ZooKeeper – 5181

The script takes an optional cluster name and log file, and comma-separated lists of CLDB and ZooKeeper host names or IPconfigure.shaddresses (and optionally ports), using the following syntax:

/opt/mapr/server/configure.sh -C <host>[:<port>][,<host>[:<port>]...] -Z<host>[:<port>][,<host>[:<port>]...] [-L <logfile>][-N <cluster name>]

Example:

/opt/mapr/server/configure.sh -C r1n1.sj.us:7222,r3n1.sj.us:7222,r5n1.sj.us:7222 -Zr1n1.sj.us:5181,r2n1.sj.us:5181,r3n1.sj.us:5181,r4n1.sj.us:5181,r1n1.sj.us5:5181 -N MyCluster

If you have not chosen a cluster name, you can run again later to the cluster.configure.sh rename

Formatting the Disks

On all nodes, use the following procedure to format disks and partitions for use by MapR.






The script removes all data from the specified disks. Make sure you specify the disks correctly, and that any datadisksetupyou wish to keep has been backed up elsewhere. Before following this procedure, make sure you have backed up any data youwish to keep.

Change to the user (or use for the following command).root sudoRun , specifying the disk list file.disksetupExample:



2.

1.

2.

3.

4.

5.

6.

7. 8. 9.

/opt/mapr/server/disksetup -F /tmp/disks.txt

Bringing Up the Cluster

In order to configure the administrative user and license, bring up the CLDB, MapR Control System, and ZooKeeper; once that is done, bring upthe other nodes. You will need the following information:

A list of nodes on which is installedmapr-cldb<MCS node> - the node on which the service is installedmapr-webserver<user> - the chosen Linux (or LDAP) user which will have administrative privileges on the cluster

To Bring Up the Cluster

Start ZooKeeper on all nodes where it is installed, by issuing the following command:


On one of the CLDB nodes and the node running the service, start the warden:mapr-webserver


On the running CLDB node, issue the following command to give full permission to the chosen administrative user:


On a machine that is connected to the cluster and to the Internet, perform the following steps to install the license:In a browser, view the MapR Control System by navigating to the node that is running the MapR Control System:https://<MCS node>:8443Your computer won't have an HTTPS certificate yet, so the browser will warn you that the connection is not trustworthy. You canignore the warning this time.The first time MapR starts, you must accept the agreement and choose whether to enable the MapR service.Dial HomeLog in to the MapR Control System as the administrative user you designated earlier.In the navigation pane of the MapR Control System, expand the group and click to display theSystem Settings MapR LicensesMapR License Management dialog.Click .Add Licenses via WebIf the cluster is already registered, the license is applied automatically. Otherwise, click to register the cluster on MapR.comOKand follow the instructions there.

If is installed on the running CLDB node, execute the following command on the running CLDB node:mapr-nfs


On all other nodes, execute the following command:


Log in to the MapR Control System.Under the Cluster group in the left pane, click .DashboardCheck the Services pane and make sure each service is running the correct number of instances.

Setting up Topology

Topology tells MapR about the locations of nodes and racks in the cluster. Topology is important, because it determines where MapR placesreplicated copies of data. If you define the cluster topology properly, MapR scatters replication on separate racks so that your data remainsavailable in the event an entire rack fails. Cluster topology is defined by specifying a topology path for each node in the cluster. The paths groupnodes by rack or switch, depending on how the physical cluster is arranged and how you want MapR to place replicated data.

Topology paths can be as simple or complex as needed to correspond to your cluster layout. In a simple cluster, each topology path might consistof the rack only (e. g. ). In a deployment consisting of multiple large datacenters, each topology path can be much longer (e. g. /rack-1 /europe

). MapR uses topology paths to spread out replicated copies of data, placing each copy on/uk/london/datacenter2/room4/row22/rack5/a separate path. By setting each path to correspond to a physical rack, you can ensure that replicated data is distributed across racks to improvefault tolerance.



1. 2. 3. 4.

5.

1. 2. 3. 4. 5. 6. 7.

a. b.

8.

9.

After you have defined node topology for the nodes in your cluster, you can use volume topology to place volumes on specific racks, nodes, orgroups of nodes. See .Setting Volume Topology

Setting Node Topology Manually

You can specify a topology path for one or more nodes using the command, or in the MapR Control System using the followingnode topoprocedure.

To set node topology using the MapR Control System:

In the Navigation pane, expand the group and click the view.Cluster NodesSelect the checkbox beside each node whose topology you wish to set.Click the button to display the dialog.Change Topology Change Node TopologySet the path in the field:New Path

To define a new path, type a topology path. Topology paths must begin with a forward slash ('/').To use a path you have already defined, select it from the dropdown.

Click to set the new topology.Move Node

Setting Node Topology with a Script

If the cluster is large, it is more convenient to set the topology mapping using a text file or a script that specifies the topology. Each line of the textfile (or the output from the script) specifies a single node and its full topology path, in the following format:<ip or hostname> <topology>

The text file or script must be specified (and available) on the local filesystem on all CLDB nodes:

To set topology with a text file, set in to the text file namenet.topology.table.file.name /opt/mapr/conf/cldb.confTo set topology with a script, set in to the script file namenet.topology.script.file.name /opt/mapr/conf/cldb.conf

If both are specified, the script is used and the text file is ignored.

Setting Up NFS HA

You can easily set up a pool of NFS nodes with HA and failover using virtual IP addresses (VIPs); if one node fails the VIP will be automaticallyreassigned to the next NFS node in the pool. If you do not specify a list of NFS nodes, then MapR uses any available node running the MapRNFS service. You can add a server to the pool simply by starting the MapR NFS service on it. Before following this procedure, make sure you arerunning NFS on the servers to which you plan to assign VIPs. You should install NFS on at least three nodes. If all NFS nodes are connected toonly one subnet, then adding another NFS server to the pool is as simple as starting NFS on that server; the MapR cluster automatically detects itand adds it to the pool.

You can restrict VIP assignment to specific NFS nodes or MAC addresses by adding them to the NFS pool list manually. VIPs are not assigned toany nodes that are not on the list, regardless of whether they are running NFS. If the cluster's NFS nodes have multiple network interface cards(NICs) connected to different subnets, you should restrict VIP assignment to the NICs that are on the correct subnet: for each NFS server, choosewhichever MAC address is on the subnet from which the cluster will be NFS-mounted, then add it to the list. If you add a VIP that is not accessibleon the subnet, then failover will not work. You can only set up VIPs for failover between network interfaces that are in the same subnet. In largeclusters with multiple subnets, you can set up multiple groups of VIPs to provide NFS failover for the different subnets.

You can set up VIPs with the command, or using the Add Virtual IPs dialog in the MapR Control System. The Add Virtual IPs dialogvirtualip addlets you specify a range of virtual IP addresses and assign them to the pool of servers that are running the NFS service. The available servers aredisplayed in the left pane in the lower half of the dialog. Servers that have been added to the NFS VIP pool are displayed in the right pane in thelower half of the dialog.

To set up VIPs for NFS using the MapR Control System:

In the Navigation pane, expand the group and click the view.NFS HA NFS SetupClick to start the NFS Gateway service on nodes where it is installed.Start NFSClick to display the Add Virtual IPs dialog.Add VIPEnter the start of the VIP range in the field.Starting IPEnter the end of the VIP range in the field. If you are assigning one one VIP, you can leave the field blank.Ending IPEnter the Netmask for the VIP range in the field. Example: Netmask 255.255.255.0If you wish to restrict VIP assignment to specific servers or MAC addresses:

If each NFS node has one NIC, or if all NICs are on the same subnet, select NFS servers in the left pane.If each NFS node has multiple NICs connected to different subnets, select the server rows with the correct MAC addresses in theleft pane.

Click to add the selected servers or MAC addresses to the list of servers to which the VIPs will be assigned. The servers appear inAddthe right pane.Click to assign the VIPs and exit.OK





1. 2.

1.

2. 3. 4. 5.

1.

2. 3.

1.

2. 3.

1. 2. 3. 4. 5.

Cluster Configuration

After installing MapR Services and bringing up the cluster, perform the following configuration steps.

Setting Up the Administrative User

Give the full control over the cluster:administrative user

Log on to any cluster node as (or use for the following command).root sudoExecute the following command, replacing with the administrative username:<user>sudo /opt/mapr/bin/maprcli acl edit -type cluster -user <user>:fc

For general information about users and groups in the cluster, see .Users and Groups

Checking the Services

Use the following steps to start the MapR Control System and check that all configured services are running:

Start the MapR Control System: in a browser, go to the following URL, replacing with the hostname of the node that is running<host>the WebServer: https://<host>:8443Log in using the administrative username and password.The first time you run the MapR Control System, you must accept the MapR Terms of Service. Click to proceed.I AcceptUnder the group in the left pane, click .Cluster DashboardCheck the pane and make sure each service is running the correct number of instances. For example: if you have configured 5Servicesservers to run the CLDB service, you should see that 5 of 5 instances are running.

If one or more services have not started, wait a few minutes to see if the warden rectifies the problem. If not, you can try to start the servicesmanually. See .Managing ServicesIf too few instances of a service have been configured, check that the service is installed on all appropriate nodes. If not, you can add the serviceto any nodes where it is missing. See .Reconfiguring a Node

Configuring Authentication

If you use Kerberos, LDAP, or another authentication scheme, make sure PAM is configured correctly to give MapR access. See PAM.Configuration

Configuring Email

MapR can notify users by email when certain conditions occur. There are three ways to specify the email addresses of MapR users:

From an LDAP directoryBy domainManually, for each user

To configure email from an LDAP directory:

In the MapR Control System, expand the group and click to display the dSystem Settings Email Addresses Configure Email Addressesialog.Select and enter the information about the LDAP directory into the appropriate fields.Use LDAPClick to save the settings.Save

To configure email by domain:

In the MapR Control System, expand the group and click to display the dSystem Settings Email Addresses Configure Email Addressesialog.Select and enter the domain name in the text field.Use Company DomainClick to save the settings.Save

To configure email manually for each user:

Create a volume for the user.In the MapR Control System, expand the group and click .MapR-FS User Disk UsageClick the to display the User Properties dialog.usernameEnter the user's email address in the field.EmailClick to save the settings.Save



1. 2.

3. 4. 5.

Configuring SMTP

Use the following procedure to configure the cluster to use your SMTP server to send mail:

In the MapR Control System, expand the group and click to display the dialog.System Settings SMTP Configure Sending EmailEnter the information about how MapR will send mail:

Provider: assists in filling out the fields if you use Gmail.SMTP Server: the SMTP server to use for sending mail.This server requires an encrypted connection (SSL): specifies an SSL connection to SMTP.SMTP Port: the SMTP port to use for sending mail.Full Name: the name MapR should use when sending email. Example: MapR ClusterEmail Address: the email address MapR should use when sending email.Username: the username MapR should use when logging on to the SMTP server.SMTP Password: the password MapR should use when logging on to the SMTP server.

Click .Test SMTP ConnectionIf there is a problem, check the fields to make sure the SMTP information is correct.Once the SMTP connection is successful, click to save the settings.Save

Configuring Permissions

By default, users are able to log on to the MapR Control System, but do not have permission to perform any actions. You can grant specificpermissions to individual users and groups. See .Managing Permissions

Setting Quotas

Set default disk usage quotas. If needed, you can set specific quotas for individual users and groups. See .Managing Quotas

Configuring alarm notifications

If an alarm is raised on the cluster, MapR sends an email notification by default to the user associated with the object on which the alarm wasraised. For example, if a volume goes over its allotted quota, MapR raises an alarm and sends email to the volume creator. You can configureMapR to send email to a custom email address in addition or instead of the default email address, or not to send email at all, for each alarm type.See .Notifications

Designating NICs for MapR

If you do not want MapR to use all NICs on each node, use the environment variable to restrict MapR traffic to specific NICs. Set MAPR_SUBNETS to a comma-separated list of up to four subnets in with no spaces. Example:MAPR_SUBNETS CIDR notation

export MAPR_SUBNETS=1.2.3.4/12, 5.6/24

If is not set, MapR uses all NICs present on the node.MAPR_SUBNETS

http://en.wikipedia.org/wiki/CIDR_notation



Component Setup

This section provides information about integrating the following tools with a MapR cluster:

Mahout - Environment variable settings needed to run Mahout on MapRGanglia - Setting up Ganglia monitoring on a MapR clusterNagios Integration - Generating a Nagios Object Definition file for use with a MapR clusterUsing Whirr to Install on Amazon EC2 - Using a Whirr script to start a MapR cluster on Amazon EC2Compiling Pipes Programs\ - Using Hadoop Pipes on a MapR clusterHBase - Installing and using HBase on MapRMultiTool - Starting Cascading Multitool on a MapR clusterFlume - Installing and using Flume on a MapR clusterHive - Installing and using Hive on a MapR cluster, and setting up a MySQL metastorePig - Installing and using Pig on a MapR cluster



1. 2. 3.

4.

1. 2. 3.

Flume

Flume is a reliable, distributed service for collecting, aggregating, and moving large amounts of log data, generally delivering the data to adistributed file system such as HDFS. For more information about Flume, see the .Apache Flume Incubation Wiki

Installing Flume

The following procedures use the operating system package managers to download and install from the MapR Repository. To install thepackages manually, refer to the Local Packages document for or .Red Hat Ubuntu

To install Flume on an Ubuntu cluster:

Execute the following commands as or using .root sudoThis procedure is to be performed on a MapR cluster. If you have not installed MapR, see the .Installation GuideUpdate the list of available packages:

apt-get update

On each planned Flume node, install :mapr-flume

apt-get install mapr-flume

To install Flume on a Red Hat or CentOS cluster:

Execute the following commands as or using .root sudoThis procedure is to be performed on a MapR cluster. If you have not installed MapR, see the .Installation GuideOn each planned Flume node, install :mapr-flume

yum install mapr-flume

Using Flume

For information about configuring and using Flume, see the following documents:

Flume User GuideFlume Cookbook

https://cwiki.apache.org/FLUME/



http://archive.cloudera.com/cdh/3/flume/UserGuide/

http://archive.cloudera.com/cdh/3/flume/Cookbook/



1. 2. 3.

4.

5.

6. 7.

1. 2.

3.

4. 5.

HBase

HBase is the Hadoop database, which provides random, realtime read/write access to very large data.

See for information about using HBase with MapRInstalling HBaseSee for information about compressing HFile storageSetting Up Compression with HBaseSee for information about using MapReduce with HBaseRunning MapReduce Jobs with HBaseSee for HBase tips and tricksHBase Best Practices

Installing HBase

Plan which nodes should run the HBase Master service, and which nodes should run the HBase RegionServer. At least one node (generally threenodes) should run the HBase Master; for example, install HBase Master on the ZooKeeper nodes. Only a few of the remaining nodes or all of theremaining nodes can run the HBase RegionServer. When you install HBase RegionServer on nodes that also run TaskTracker, reduce thenumber of map and reduce slots to avoid oversubscribing the machine. The following procedures use the operating system package managers todownload and install from the MapR Repository. To install the packages manually, refer to the Local Packages document for or .Red Hat Ubuntu

To install HBase on an Ubuntu cluster:


apt-get update

On each planned HBase Master node, install :mapr-hbase-master

apt-get install mapr-hbase-master

On each planned HBase RegionServer node, install :mapr-hbase-regionserver

apt-get install mapr-hbase-regionserver

On all HBase nodes, run with a list of the CLDB nodes and ZooKeeper nodes in the cluster.configure.shThe warden picks up the new configuration and automatically starts the new services. When it is convenient, restart the warden:

# /etc/init.d/mapr-warden stop# /etc/init.d/mapr-warden start

To install HBase on a Red Hat or CentOS cluster:

Execute the following commands as or using .root sudoOn each planned HBase Master node, install :mapr-hbase-master

yum install mapr-hbase-master

On each planned HBase RegionServer node, install :mapr-hbase-regionserver

yum install mapr-hbase-regionserver

On all HBase nodes, run with a list of the CLDB nodes and ZooKeeper nodes in the cluster.configure.shThe warden picks up the new configuration and automatically starts the new services. When it is convenient, restart the warden:


Getting Started with HBase

In this tutorial, we'll create an HBase table on the cluster, enter some data, query the table, then clean up the data and exit.





1.

2.

3.

4.

5.

6.

7.

8.

9. 10. 11.

HBase tables are organized by column, rather than by row. Furthermore, the columns are organized in groups called . Whencolumn familiescreating an HBase table, you must define the column families before inserting any data. Column families should not be changed often, nor shouldthere be too many of them, so it is important to think carefully about what column families will be useful for your particular data. Each columnfamily, however, can contain a very large number of columns. Columns are named using the format .family:qualifier

Unlike columns in a relational database, which reserve empty space for columns with no values, HBase columns simply don't exist for rows wherethey have no values. This not only saves space, but means that different rows need not have the same columns; you can use whatever columnsyou need for your data on a per-row basis.

Create a table in HBase:

Start the HBase shell by typing the following command:

/opt/mapr/hbase/hbase-0.90.4/bin/hbase shell

Create a table called with one column family named :weblog stats

create 'weblog', 'stats'

Verify the table creation by listing everything:

list

Add a test value to the column in the column family for row 1:daily stats

put 'weblog', 'row1', 'stats:daily', 'test-daily-value'





Type to display the contents of the table. Sample output:scan 'weblog'

ROW COLUMN+CELL row1 column=stats:daily, timestamp=1321296699190, value=test-daily-value row1 column=stats:weekly, timestamp=1321296715892, value=test-weekly-value row2 column=stats:weekly, timestamp=1321296787444, value=test-weekly-value2 row(s) in 0.0440 seconds

Type to display the contents of row 1. Sample output:get 'weblog', 'row1'

COLUMN CELL stats:daily timestamp=1321296699190, value=test-daily-value stats:weekly timestamp=1321296715892, value=test-weekly-value2 row(s) in 0.0330 seconds

Type to disable the table.disable 'weblog'Type to drop the table and delete all data.drop 'weblog'Type to exit the HBase shell.exit

Setting Up Compression with HBase

Using compression with HBase reduces the number of bytes transmitted over the network and stored on disk. These benefits often outweigh theperformance cost of compressing the data on every write and uncompressing it on every read.

GZip Compression



1.

2.

3.

4.

5.

6.

GZip compression is included with most Linux distributions, and works natively with HBase. To use GZip compression, specify it in the per-columnfamily compression flag while creating tables in HBase shell. Example:

create 'mytable', NAME=>'colfam:', COMPRESSION=>'gz'

LZO Compression

Lempel-Ziv-Oberhumer (LZO) is a lossless data compression algorithm, included in most Linux distributions, that is designed for decompressionspeed.

To Set Up LZO Compression for Use with HBase:

Make sure HBase is installed on the nodes where you plan to run it. See and for morePlanning the Deployment Installing MapRinformation.On each HBase node, ensure the native LZO base library is installed:

on Ubuntu: apt-get install liblzo2-devOn Red Hat or CentOS: yum install liblzo2-devel

Check out the native connector library from http://svn.codespot.com/a/apache-extras.org/hadoop-gpl-compression/For 0.20.2 check out branches/branch-0.1

svn checkout http://svn.codespot.com/a/apache-extras.org/hadoop-gpl-compression/branches/branch-0.1/

For 0.21 or 0.22 check out trunk

svn checkout http://svn.codespot.com/a/apache-extras.org/hadoop-gpl-compression/branches/trunk/

Set the compiler flags and build the native connector library:

$ export CFLAGS="-m64"$ ant compile-native$ ant jar

Create a directory for the native libraries (use TAB completion to fill in the <version> placeholder):

mkdir -p /opt/mapr/hbase/hbase-<version>/lib/ /Linux-amd64-64/native

Copy the build results into the appropriate HBase directories on every HBase node. Example:

$ cp build/hadoop-gpl-compression-0.2.0-dev.jar /opt/mapr/hbase/hbase-<version>/lib$ cp build/ /Linux-amd64-64/lib/libgplcompression.* /opt/mapr/hbase/hbase-<version>/lib/native na

/Linux-amd64-64/tive

Once LZO is set up, you can specify it in the per-column family compression flag while creating tables in HBase shell. Example:

create 'mytable', NAME=>'colfam:', COMPRESSION=>'lzo'

Running MapReduce Jobs with HBase

To run MapReduce jobs with data stored in HBase, set the environment variable to the output of the cHADOOP_CLASSPATH hbase classpathommand (use TAB completion to fill in the placeholder):<version>

$ export HADOOP_CLASSPATH=`/opt/mapr/hbase/hbase-<version>/bin/hbase classpath`

Note the backticks ( ).`

Example: Exporting a table named t1 with MapReduce

http://svn.codespot.com/a/apache-extras.org/hadoop-gpl-compression/



Notes: On a node in a MapR cluster, the output directory /hbase/export_t1 will be located in the mapr hadoop filesystem, so to list the output filesin the example below use the following hadoop fs command from the node's command line:

# hadoop fs -ls /hbase/export_t1

To view the output:

# hadoop fs -cat /hbase/export_t1/part-m-00000

# cd /opt/mapr/hadoop/hadoop-0.20.2# export HADOOP_CLASSPATH=`/opt/mapr/hbase/hbase-0.90.4/bin/hbase classpath`# ./bin/hadoop jar /opt/mapr/hbase/hbase-0.90.4/hbase-0.90.4.jar export t1 /hbase/export_t111/09/28 09:35:11 INFO mapreduce.Export: verisons=1, starttime=0,endtime=922337203685477580711/09/28 09:35:11 INFO fs.JobTrackerWatcher: Current running JobTracker is:lohit-ubuntu/10.250.1.91:900111/09/28 09:35:12 INFO mapred.JobClient: Running job: job_201109280920_000311/09/28 09:35:13 INFO mapred.JobClient: map 0% reduce 0%11/09/28 09:35:19 INFO mapred.JobClient: Job complete: job_201109280920_000311/09/28 09:35:19 INFO mapred.JobClient: Counters: 1511/09/28 09:35:19 INFO mapred.JobClient: Job Counters11/09/28 09:35:19 INFO mapred.JobClient: Aggregate execution time ofmappers(ms)=325911/09/28 09:35:19 INFO mapred.JobClient: Total time spent by all reduceswaiting after reserving slots (ms)=011/09/28 09:35:19 INFO mapred.JobClient: Total time spent by all mapswaiting after reserving slots (ms)=011/09/28 09:35:19 INFO mapred.JobClient: Launched map tasks=111/09/28 09:35:19 INFO mapred.JobClient: Data-local map tasks=111/09/28 09:35:19 INFO mapred.JobClient: Aggregate execution time ofreducers(ms)=011/09/28 09:35:19 INFO mapred.JobClient: FileSystemCounters11/09/28 09:35:19 INFO mapred.JobClient: FILE_BYTES_WRITTEN=6131911/09/28 09:35:19 INFO mapred.JobClient: Map-Reduce Framework11/09/28 09:35:19 INFO mapred.JobClient: Map input records=511/09/28 09:35:19 INFO mapred.JobClient: PHYSICAL_MEMORY_BYTES=10799104011/09/28 09:35:19 INFO mapred.JobClient: Spilled Records=011/09/28 09:35:19 INFO mapred.JobClient: CPU_MILLISECONDS=78011/09/28 09:35:19 INFO mapred.JobClient: VIRTUAL_MEMORY_BYTES=75983667211/09/28 09:35:19 INFO mapred.JobClient: Map output records=511/09/28 09:35:19 INFO mapred.JobClient: SPLIT_RAW_BYTES=6311/09/28 09:35:19 INFO mapred.JobClient: GC time elapsed (ms)=35



HBase Best Practices

The HBase write-ahead log (WAL) writes many tiny records, and compressing it would cause massive CPU load. Before using HBase,turn off MapR compression for directories in the HBase volume (normally mounted at . Example:/hbase

hadoop mfs -setcompression off /hbase

You can check whether compression is turned off in a directory or mounted volume by using to list the file contents.hadoop mfsExample:

hadoop mfs -ls /hbase

The letter in the output indicates compression is turned on; the letter indicates compression is turned off. See for moreZ U hadoop mfsinformation.

On any node where you plan to run both HBase and MapReduce, give more memory to the FileServer than to the RegionServer so thatthe node can handle high throughput. For example, on a node with 24 GB of physical memory, it might be desirable to limit theRegionServer to 4 GB, give 10 GB to MapR-FS, and give the remainder to TaskTracker. To change the memory allocated to eachservice, edit the file. See for more information./opt/mapr/conf/warden.conf Tuning MapReduce



Hive

Apache Hive is a data warehouse system for Hadoop that uses a SQL-like language called Hive Query Language (HQL) to query structured datastored in a distributed filesystem. For more information about Hive, see the .Apache Hive project page

On this page:

Installing HiveGetting Started with HiveUsing Hive with MapR VolumesSetting Up Hive with a MySQL MetastoreHive-HBase Integration

Once Hive is installed, the executable is located at: /opt/mapr/hive/hive-<version>/bin/hive

Make sure the environment variable is set correctly. Example:JAVA_HOME

# export JAVA_HOME=/usr/lib/jvm/java-6-sun

Make sure the environment variable is set correctly. Example:HIVE_HOME

# export HIVE_HOME=/opt/mapr/hive/hive-<version>

Installing Hive

The following procedures use the operating system package managers to download and install Hive from the MapR Repository. To install thepackages manually, refer to the Local Packages document for or . This procedure is to be performed on a MapR cluster (see the Red Hat Ubuntu I

) or client (see ). nstallation Guide Setting Up the Client

Default Hive MapR Hadoop Filesystem Directories

It is not necessary to create and the Hive and directories in the MapR Hadoopchmod /tmp /user/hive/warehousefilesystem. By default MapR creates and configures these directories for you when you create your first Hive table.

These default directories are defined in the file:$HIVE_HOME/conf/hive-default.xml

<configuration>...<property><name>hive.exec.scratchdir</name><value>/tmp/hive-$user.name</value><description>Scratch space for Hive jobs</description></property>

<property><name>hive.metastore.warehouse.dir</name><value>/user/hive/warehouse</value><description>location of default database for the warehouse</description></property>...</configuration>

If you need to modify the default names for one or both of these directories, create a file$HIVE_HOME/conf/hive-site.xmlfor this purpose if it doesn't already exist.

Copy the and/or the property elements from the hive.exec.scratchdir hive.metastore.warehouse.dir hive-defau file and paste them into an XML configuration element in the file. Modify the value elements for theselt.xml hive-site.xml

directories in the file as desired, and then save and close the file and close the hive-site.xml hive-site.xml hive-defa file.ult.xml

To install Hive on an Ubuntu cluster:

http://hive.apache.org/





1. 2.

3.

1. 2.

1. 2.

1.

2.

3.

Execute the following commands as or using .root sudoUpdate the list of available packages:

apt-get update

On each planned Hive node, install :mapr-hive

apt-get install mapr-hive

To install Hive on a Red Hat or CentOS cluster:

Execute the following commands as or using .root sudoOn each planned Hive node, install :mapr-hive

yum install mapr-hive


In this tutorial, you'll create a Hive table, load data from a tab-delimited text file, and run a couple of basic queries against the table.

First, make sure you have downloaded the sample table: On the page , select A Tour of the MapR Virtual Machine Tools > Attachmentsand right-click on , select from the pop-up menu, select a directory to save to, then click OK. Ifsample-table.txt Save Link As...you're working on the MapR Virtual Machine, we'll be loading the file from the MapR Virtual Machine's local file system (not the clusterstorage layer), so save the file in the MapR Home directory (for example, )./home/mapr

Take a look at the source data

First, take a look at the contents of the file using the terminal:

Make sure you are in the Home directory where you saved (type if you are not sure).sample-table.txt cd ~Type to display the following output.cat sample-table.txt

mapr@mapr-desktop:~$ cat sample-table.txt1320352532 1001 http://www.mapr.com/doc http://www.mapr.com 192.168.10.11320352533 1002 http://www.mapr.com http://www.example.com 192.168.10.101320352546 1001 http://www.mapr.com http://www.mapr.com/doc 192.168.10.1

Notice that the file consists of only three lines, each of which contains a row of data fields separated by the TAB character. The data in the filerepresents a web log.

Create a table in Hive and load the source data:

Type the following command to start the Hive shell, using tab-completion to expand the :<version>

/opt/mapr/hive/hive-<version>/bin/hive

At the prompt, type the following command to create the table:hive>

CREATE TABLE web_log(viewTime INT, userid BIGINT, url STRING, referrer STRING, ip STRING) ROWFORMAT DELIMITED FIELDS TERMINATED BY '\t';

Type the following command to load the data from into the table:sample-table.txt

LOAD DATA LOCAL INPATH '/home/mapr/sample-table.txt' INTO TABLE web_log;

Run basic queries against the table:

Try the simplest query, one that displays all the data in the table:



SELECT web_log.* FROM web_log;

This query would be inadvisable with a large table, but with the small sample table it returns very quickly.

Try a simple SELECT to extract only data that matches a desired string:

SELECT web_log.* FROM web_log WHERE web_log.url LIKE '%doc';

This query launches a MapReduce job to filter the data.

When the Hive shell starts, it reads an initialization file called which is located in the or directories. You can.hiverc HIVE_HOME/bin/ $HOME/edit this file to set custom parameters or commands that initialize the Hive command-line environment, one command per line.

When you run the Hive shell, you can specify a MySQL initialization script file using the option. Example:-i

hive -i <filename>

Using Hive with MapR Volumes

MapR-FS does not allow moving or renaming across volume boundaries. Be sure to set the Hive Scratch Directory and Hive Warehouse Directoryin the same directory as the volume where the data for the Hive job resides before running the job. The following sections provide additionaldetail.

Hive Scratch Directory

When running an import job on data from a MapR volume, make sure to set to a directory in the same volume (wherehive.exec.scratchdirthe data for the job resides). Set the parameter to a directory (for example, ) under the volume's mount point (as viewed in /tmp Volume

). You can set this parameter from the Hive shell. Example:Properties

hive> set hive.exec.scratchdir=/myvolume/tmp

Hive Warehouse Directory

When writing queries that move data between tables, make sure the tables are in the same volume. By default, all volumes are created under thepath "/user/hive/warehouse" under the root volume. This value is specified by the property , which you canhive.metastore.warehouse.dirset from the Hive shell. Example:

hive> set hive.metastore.warehouse.dir=/myvolume/mydirectory

Setting Up Hive with a MySQL Metastore

The metadata for Hive tables and partitions are stored in the Hive Metastore (for more information, see the ). ByHive project documentationdefault, the Hive Metastore stores all Hive metadata in an embedded Apache Derby database in MapR-FS. Derby only allows one connection at atime; if you want multiple concurrent Hive sessions, you can use MySQL for the Hive Metastore. You can run the Hive Metastore on any machinethat is accessible from Hive.

Prerequisites

Make sure MySQL is installed on the machine on which you want to run the Metastore, and make sure you are able to connect to theMySQL Server from the Hive machine. You can test this with the following command:

mysql -h <hostname> -u <user>

The database administrator must create a database for the Hive metastore data, and the username specified in javax.jdo.Connecti must have permissions to access it. The database can be specified using the parameter. The tables andonUser ConnectionURL

schemas are created automatically when the metastore is first started.

Download and install the driver for the MySQL JDBC connector. Example:

https://cwiki.apache.org/confluence/display/Hive/AdminManual+MetastoreAdmin



$ curl -L 'http://www.mysql.com/get/Downloads/Connector-J/mysql-connector-java-5.1.18.tar.gz/from/http://mysql.he.net/|http://mysql.he.net/' | tar xz$ sudo cp mysql-connector-java-5.1.18/mysql-connector-java-5.1.18-bin.jar/opt/mapr/hive/hive-<version>/lib/

Configuring Hive for MySQL

Create the file in the Hive configuration directory ( ) with the contents from thehive-site.xml /opt/mapr/hive/hive-<version>/confexample below. Then set the parameters as follows:

You can set a specific port for Thrift URIs by adding the command into the file (if export METASTORE_PORT=<port> hive-env.sh h does not exist, create it in the Hive configuration directory). Example:ive-env.sh

export METASTORE_PORT=9083

To connect to an existing MySQL metastore, make sure the parameter and the parameters in ConnectionURL Thrift URIs hive-si point to the metastore's host and port.te.xml

Once you have the configuration set up, start the Hive Metastore service using the following command (use tab auto-complete to fill in the):<version>

/opt/mapr/hive/hive-<version>/bin/hive --service metastore

You can use to run metastore in the background.nohup hive --service metastore

Example hive-site.xml



<configuration>

<property> <name>hive.metastore.local</name> <value> </value>true <description>controls whether to connect to remove metastore server or open a metastore servernewin Hive Client JVM</description> </property>

<property> <name>javax.jdo.option.ConnectionURL</name> <value>jdbc:mysql://localhost:3306/hive?createDatabaseIfNotExist= </value>true<description>JDBC connect string a JDBC metastore</description>for</property>

<property> <name>javax.jdo.option.ConnectionDriverName</name> <value>com.mysql.jdbc.Driver</value> <description>Driver class name a JDBC metastore</description>for </property>

<property> <name>javax.jdo.option.ConnectionUserName</name> <value>root</value> <description>username to use against metastore database</description> </property>

<property> <name>javax.jdo.option.ConnectionPassword</name> <value><fill in with password></value> <description>password to use against metastore database</description> </property>

<property> <name>hive.metastore.uris</name> <value>thrift://localhost:3306</value></property>

</configuration>

Hive-HBase Integration

You can create HBase tables from Hive that can be accessed by both Hive and HBase. This allows you to run Hive queries on HBase tables. Youcan also convert existing HBase tables into Hive-HBase tables and run Hive queries on those tables as well.

In this section:

Install and Configure Hive and HBaseGetting Started with Hive-HBase Integration

Install and Configure Hive and HBase

1. if it is not already installed.Install and configure Hive

2. if it is not already installed.Install and configure HBase

3. Execute the command and ensure that all relevant Hadoop, HBase and Zookeeper processes are running.jps

Example:



$ jps21985 HRegionServer1549 jenkins.war15051 QuorumPeerMain30935 Jps15551 CommandServer15698 HMaster15293 JobTracker15328 TaskTracker15131 WardenMain

Configure the the Filehive-site.xml

1. Open the file with your favorite editor, or create a file if it doesn't already exist:hive-site.xml hive-site.xml

$ cd $HIVE_HOME$ vi conf/hive-site.xml

2. Copy the following XML code and paste it into the file.hive-site.xml

Note: If you already have an existing file with a element block, just copy the element block codehive-site.xml configuration propertybelow and paste it inside the element block in the file.configuration hive-site.xml

Example configuration:

<configuration>

<property> <name>hive.aux.jars.path</name> <value>file:///opt/mapr/hive/hive-0.7.1/lib/hive-hbase-handler-0.7.1.jar,file:///opt/mapr/hbase/hbase-0.90.4/hbase-0.90.4.jar,file:///opt/mapr/zookeeper/zookeeper-3.3.2/zookeeper-3.3.2.jar</value><description>A comma separated list (with no spaces) of the jar files required Hive-HBaseforintegration</description></property>

<property> <name>hbase.zookeeper.quorum</name> <value>xx.xx.x.xxx,xx.xx.x.xxx,xx.xx.x.xxx</value> <description>A comma separated list (with no spaces) of the IP addresses of all RegionServers in thecluster.</description></property>

<property> <name>hbase.zookeeper.property.clientPort</name> <value>5181</value> <description>The Zookeeper client port. The MapR clientPort is 5181.</description>default</property>

</configuration>

3. Save and close the file.hive-site.xml

If you have successfully completed all of the steps in this Install and Configure Hive and HBase section, you're ready to begin the Getting Startedwith Hive-HBase Integration tutorial in the next section.

Getting Started with Hive-HBase Integration

In this tutorial we will:

Create a Hive tablePopulate the Hive table with data from a text fileQuery the Hive tableCreate a Hive-HBase tableIntrospect the Hive-HBase table from HBase



Populate the Hive-Hbase table with data from the Hive tableQuery the Hive-HBase table from HiveConvert an existing HBase table into a Hive-HBase table

Be sure that you have successfully completed all of the steps in the Install and Configure Hive and HBase section before beginning this GettingStarted tutorial.

This Getting Started tutorial closely parallels the section of the Apache Hive Wiki, and thanks to Samuel Guo and otherHive-HBase Integrationcontributors to that effort. If you are familiar with their approach to Hive-HBase integration, you should be immediately comfortable with thismaterial.

However, please note that there are some significant differences in this Getting Started section, especially in regards to configuration andcommand parameters or the lack thereof. Follow the instructions in this Getting Started tutorial to the letter so you can have an enjoyable andsuccessful experience.

Create a Hive table with two columns:

Change to your Hive installation directory if you're not already there and start Hive:

$ cd $HIVE_HOME$ bin/hive

Execute the CREATE TABLE command to create the Hive pokes table:

hive> CREATE TABLE pokes (foo INT, bar STRING);

To see if the pokes table has been created successfully, execute the SHOW TABLES command:

hive> SHOW TABLES;OKpokesTime taken: 0.74 seconds

The table appears in the list of tables. pokes

Populate the Hive pokes table with data

Execute the LOAD DATA LOCAL INPATH command to populate the Hive table with data from the file.pokes kv1.txt

The file is provided in the directory.kv1.txt $HIVE_HOME/examples

hive> LOAD DATA LOCAL INPATH './examples/files/kv1.txt' OVERWRITE INTO TABLE pokes;

A message appears confirming that the table was created successfully, and the Hive prompt reappears:

Copying data from file:...OKTime taken: 0.278 secondshive>

Execute a SELECT query on the Hive pokes table:

hive> SELECT * FROM pokes WHERE foo = 98;

The SELECT statement executes, runs a MapReduce job, and prints the job output:

https://cwiki.apache.org/Hive/hbaseintegration.html



OK98 val_9898 val_98Time taken: 18.059 seconds

The output of the SELECT command displays two identical rows because there are two identical rows in the Hive table with a key of 98. pokes

Note: This is a good illustration of the concept that Hive tables can have multiple identical keys. As we will see shortly, HBase tables cannot havemultiple identical keys, only unique keys.

To create a Hive-HBase table, enter these four lines of code at the Hive prompt:

hive> CREATE TABLE hbase_table_1(key , value string)int > STORED BY 'org.apache.hadoop.hive.hbase.HBaseStorageHandler' > WITH SERDEPROPERTIES ( = )"hbase.columns.mapping" ":key,cf1:val" > TBLPROPERTIES ( = );"hbase.table.name" "xyz"

After a brief delay, a message appears confirming that the table was created successfully:

OKTime taken: 5.195 seconds

Note: The TBLPROPERTIES command is not required, but those new to Hive-HBase integration may find it easier to understand what's going onif Hive and HBase use different names for the same table.

In this example, Hive will recognize this table as "hbase_table_1" and HBase will recognize this table as "xyz".

Start the HBase shell:

Keeping the Hive terminal session open, start a new terminal session for HBase, then start the HBase shell:

$ cd $HBASE_HOME$ bin/hbase shellHBase Shell; enter 'help<RETURN>' list of supported commands.forType to leave the HBase Shell"exit<RETURN>"Version 0.90.4, rUnknown, Wed Nov 9 17:35:00 PST 2011

hbase(main):001:0>

Execute the list command to see a list of HBase tables:

hbase(main):001:0> listTABLExyz1 row(s) in 0.8260 seconds

HBase recognizes the Hive-HBase table named . This is the same table known to Hive as . xyz hbase_table_1

Display the description of the xyz table in the HBase shell:

hbase(main):004:0> describe "xyz"DESCRIPTION ENABLED NAME => 'xyz', FAMILIES => [NAME => 'cf1', BLOOMFILTER => 'NONE', REPLICATI true ON_SCOPE => '0', COMPRESSION => 'NONE', VERSIONS => '3', TTL => '2147483647', BL OCKSIZE => '65536', IN_MEMORY => ' ', BLOCKCACHE => ' ']false true1 row(s) in 0.0190 seconds

From the Hive prompt, insert data from the Hive table pokes into the Hive-HBase table hbase_table_1:



hive> INSERT OVERWRITE TABLE hbase_table_1 SELECT * FROM pokes WHERE foo=98;...2 Rows loaded to hbase_table_1OKTime taken: 13.384 seconds

Query hbase_table_1 to see the data we have inserted into the Hive-HBase table:

hive> SELECT * FROM hbase_table_1;OK98 val_98Time taken: 0.56 seconds

Even though we loaded two rows from the Hive table that had the same key of 98, only one row was actually inserted into pokes hbase_table_. This is because is an HBASE table, and although Hive tables support duplicate keys, HBase tables only support unique1 hbase_table_1

keys. HBase tables arbitrarily retain only one key, and will silently discard all of the data associated with duplicate keys.

Convert a pre-existing HBase table to a Hive-HBase table

To convert a pre-existing HBase table to a Hive-HBase table, enter the following four commands at the Hive prompt.

Note that in this example the existing HBase table is .my_hbase_table

hive> CREATE EXTERNAL TABLE hbase_table_2(key , value string)int > STORED BY 'org.apache.hadoop.hive.hbase.HBaseStorageHandler' > WITH SERDEPROPERTIES ( = )"hbase.columns.mapping" "cf1:val" > TBLPROPERTIES( = );"hbase.table.name" "my_hbase_table"

Now we can run a Hive query against the pre-existing HBase table that Hive sees as :my_hbase_table hbase_table_2

hive> SELECT * FROM hbase_table_2 WHERE key > 400 AND key < 410;Total MapReduce jobs = 1Launching Job 1 out of 1

of reduce tasks is set to 0 since there's no reduce Number operator...OK401 val_401402 val_402403 val_403404 val_404406 val_406407 val_407409 val_409Time taken: 9.452 seconds



1.

2. a. b.

3. 4. 5.

1.

2. 3. 4.

5. 6.

Hive ODBC Connector

Before You Begin

The MapR Hive ODBC Connector is an ODBC driver for Apache Hive 0.70 and later that conforms to the ODBC 3.52 specification. The standardquery language for ODBC is SQL; Hive's query language, HiveQL, includes a subset of ANSI SQL-92. When using an application that connectsvia ODBC to Hive, you may need to rewrite queries to compensate for SQL features that are not present in Hive. Applications that use SQL willrecognize HiveQL, but might not provide access to HiveQL-specific features such as multi-table insert. Please refer to the forHiveQL wikiup-to-date information on HiveQL.

You will need to configure a (DSN), a definition that specifies how to connect to Hive. DSNs are typically managed by theData Source Nameoperating system and may be used by multiple applications. Some applications do not use DSNs. You will need to refer to your particularapplication's documentation to understand how it connects using ODBC.

Software and Hardware Requirements

To use MapR Hive ODBC Connector on Windows requires:

Windows® 7 Professional or Windows® 2008 R2. Both 32 and 64-bit editions are supported.Microsoft .NET Framework 4.0The Microsoft Visual C++ 2010 Redistributable Package (runtimes required to run applications developed with Visual C++ on a computerthat does not have Visual C++ 2010 installed.)A Hadoop cluster with the Hive service installed and running. You should find out from the cluster administrator the hostname or IPaddress for the Hive service and the port that the service is running on. (The default port for Hive is 10000.)

Installation and Configuration

There are versions of the connector for 32-bit and 64-bit applications. The 64-bit version of the connector works only with 64-bit DSNs; the 32-bitconnector works only with 32-bit DSNs. 64-bit Windows machines can run both 64-bit and 32-bit applications; on a 64-bit system, you might needto install both versions of the connector in order to set up DSNs to work with both. If both the 32-bit connector and the 64-bit connector areinstalled, you must configure DSNs for each independently, in their separate Data Source Administrators.

To install the Hive ODBC Connector:

Run the installer to get started:To install the 64-bit connector, download and run http://package.mapr.com/tools/MapR-ODBC/MapR_odbc_1.00.100

.7_x64.exeTo install the 32-bit connector, download and run http://package.mapr.com/tools/MapR-ODBC/MapR_odbc_1.00.100

.7_x86.exePerform the following steps, clicking after each:Next

Accept the license agreement.Select an installation folder.

On the Information window, click .NextOn the Completing... window, click Finish.Install a DSN corresponding to your Hive server.

To create a Data Source Name (DSN)

Open the Data Source Administrator from the Start menu. Example:Start > MapR Hive ODBC Connector > 64-Bit ODBC Driver ManagerClick to open the Create New Data Source dialog.AddSelect and click to open the MapR Hive ODBC Connector Setup window.Hive ODBC Connector FinishEnter the connection information for the Hive instance:

Data Source Name — a name for the DSNDescription — an optional description for the DSNHost — the IP or hostname of your Hive serverPort — the listening port for the Hive serviceDatabase — use at the Hive command line if you are not sureshow databases

Click to test the connection.TestWhen you're sure the connection works, click .Finish

SQLPrepare Optimization

The connector currently uses query execution to determine the result-set's metadata for SQLPrepare. The down side of this is that SQLPrepare isslow because query execution tends to be slow. You can configure the connector to speed up SQLPrepare if you do not need the result-set'smetadata. To change the behavior for SQLPrepare, create a String value under your DSN. If the value is set to a non-zeroNOPSQLPreparevalue, SQLPrepare will not use query execution to derive the result-set's metadata. If this registry entry is not defined, the default value is 0.

Notes

https://cwiki.apache.org/confluence/display/Hive/HiveQL

http://package.mapr.com/tools/MapR-ODBC/MapR_odbc_1.00.1007_x64.exe






Data Types

The following data types are supported:

Type Description

TINYINT 1-byte integer

SMALLINT 2- byte integer

INT 4-byte integer

BIGINT 8-byte integer

FLOAT Single-precision floating-point number

DOUBLE Double-precision floating-point number

BOOLEAN True/false value

STRING Sequence of characters

Not yet supported:

The aggregate types (ARRAY, MAP, and STRUCT)The new timestamp types introduced in Hive 0.80

HiveQL Notes

CAST Function

HiveQL doesn't support the CONVERT function; it uses the CAST function to perform type conversion. Example:

CAST (<expression> AS <type>)

Using in HiveQL:CAST

Use the HiveQL names for the eight data types supported by Hive in the expression. For example, to convert 1.0 to an integer, use CAST rather than .CAST (1.0 AS INT) CAST (1.0 AS SQL_INTEGER)

Hive does not do a range check during operations. For example, returns a ofCAST CAST (1000000 AS SQL_TINYINT) TINYINTvalue 64, rather than the expected error.Unlike SQL, Hive returns instead of an error if it fails to convert the data. For example, returns null.null CAST ("STRING" AS INT)

Using with values:CAST BOOLEAN

The boolean value converts to the numeric value TRUE 1The boolean value converts to the numeric value FALSE 0The numeric value converts to the boolean value ; any other number converts to 0 FALSE TRUEThe empty string converts to the boolean value ; any other string converts to FALSE TRUE

The HiveQL type stores text strings, and corresponds to the data type. The operation successfully convertsSTRING SQL_LONGVARCHAR CASTstrings to numbers if the strings contain only numeric characters; otherwise the conversion fails.

You can tune the column length used for columns. To change the default length reported for columns, add the registry entry STRING STRING Def under your DSN and specify a value. If this registry entry is not defined, the default length of 1024 characters isaultStringColumnLength

used.

Delimiters

The connector uses Thrift to connect to the Hive server. Hive returns the result set of a HiveQL query as newline-delimited rows whose fields aretab-delimited. Hive currently does not escape any tab character in the field. Make sure to escape any tab or newline characters in the Hive data,indlucing platform-specific newline character sequences such as line-feed (LF) for UNIX/Linux/Mac OS X/etc, carriage return/line-feed (CR/LF) forWindows, and carriage return (CR) for older Macintosh platforms.

Notes on Applications

Microsoft Access

Version tested "2010" (=14.0), 32 and 64-bit.



Notes Linked table is not available currently.

Microsoft Excel/Query

Versiontested

"2010" (=14.0), 32 and 64-bit.

Notes From the ribbon, use and select either or . TheData From Other Sources From Data Connection Wizard From Microsoft Queryformer requires a pre-defined DSN while the latter supports creating a DSN on the fly. You can use the ODBC driver via the OLE DBfor ODBC Driver bridge.

Tableau Desktop

Version tested 7.0, 32-bit only.

Notes Prior to version 7.0.n, you will need to install a TDC to maximize the capability of the driver. From version 7.0.n onward, you can specify the driver via the option from the tab.MapR Hadoop Hive Connect to Data



Hive ODBC Connector License and Copyright Information

Third Party Trademarks

ICU License - ICU 1.8.1 and later

COPYRIGHT AND PERMISSION NOTICE

Copyright (c) 1995-2010 International Business Machines Corporation and others

All rights reserved.

Permission is hereby granted, free of charge, to any person obtaining a copy of this software and associated documentation files (the "Software"),to deal in the Software without restriction, including without limitation the rights to use, copy, modify, merge, publish, distribute, and/or sell copiesof the Software, and to permit persons to whom the Software is furnished to do so, provided that the above copyright notice(s) and this permissionnotice appear in all copies of the Software and that both the above copyright notice(s) and this permission notice appear in supportingdocumentation.

THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TOTHE WARRANTIES OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT OF THIRD PARTYRIGHTS. IN NO EVENT SHALL THE COPYRIGHT HOLDER OR HOLDERS INCLUDED IN THIS NOTICE BE LIABLE FOR ANY CLAIM, ORANY SPECIAL INDIRECT OR CONSEQUENTIAL DAMAGES, OR ANY DAMAGES WHATSOEVER RESULTING FROM LOSS OF USE, DATAOR PROFITS, WHETHER IN AN ACTION OF CONTRACT, NEGLIGENCE OR OTHER TORTIOUS ACTION, ARISING OUT OF OR INCONNECTION WITH THE USE OR PERFORMANCE OF THIS SOFTWARE.

Except as contained in this notice, the name of a copyright holder shall not be used in advertising or otherwise to promote the sale, use or otherdealings in this Software without prior written authorization of the copyright holder.

All trademarks and registered trademarks mentioned herein are the property of their respective owners.

OpenSSL

Copyright (c) 1998-2008 The OpenSSL Project. All rights reserved.

Redistribution and use in source and binary forms, with or without modification, are permitted provided that the following conditions are met:

1. Redistributions of source code must retain the above copyright notice, this list of conditions and the following disclaimer.

2. Redistributions in binary form must reproduce the above copyright notice, this list of conditions and the following disclaimer in thedocumentation and/or other materials provided with the distribution.

3. All advertising materials mentioning features or use of this software must display the following acknowledgment:"This product includes software developed by the OpenSSL Project for use in the OpenSSL Toolkit. ( )"http://www.openssl.org/

4. The names "OpenSSL Toolkit" and "OpenSSL Project" must not be used to endorse or promote products derived from this software withoutprior written permission. For written permission, please contact [email protected].

5. Products derived from this software may not be called "OpenSSL" nor may "OpenSSL" appear in their names without prior written permissionof the OpenSSL Project.

6. Redistributions of any form whatsoever must retain the following acknowledgment:"This product includes software developed by the OpenSSL Project for use in the OpenSSL Toolkit ( )"http://www.openssl.org/

THIS SOFTWARE IS PROVIDED BY THE OpenSSL PROJECT ``AS IS'' AND ANY EXPRESSED OR IMPLIED WARRANTIES, INCLUDING,BUT NOT LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE AREDISCLAIMED. IN NO EVENT SHALL THE OpenSSL PROJECT OR ITS CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT,INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OFSUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ONANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE)ARISING IN ANY WAY OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.

Expat

Copyright (c) 1998, 1999, 2000 Thai Open Source Software Center Ltd

Permission is hereby granted, free of charge, to any person obtaining a copy of this software and associated documentation files (the""Software""), to deal in the Software without restriction, including without limitation the rights to use, copy, modify, merge, publish, distribute,sublicense, and/or sell copies of the Software, and to permit persons to whom the Software is furnished to do so, subject to the followingconditions:

The above copyright notice and this permission notice shall be included in all copies or substantial portions of the Software.

THE SOFTWARE IS PROVIDED ""AS IS"", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED

http://www.openssl.org/

http://www.openssl.org/



TO THE WARRANTIES OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND NOINFRINGEMENT. IN NO EVENT SHALLTHE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION OFCONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHERDEALINGS IN THE SOFTWARE."

Apache Hive

Copyright 2008-2011 The Apache Software Foundation.

Apache Thrift

Copyright 2006-2010 The Apache Software Foundation.



Mahout

Apache Mahout™ is a scalable machine learning library. For more information about Mahout, see the project.Apache Mahout

On this page:

Installing MahoutConfiguring the Mahout EnvironmentGetting Started with Mahout

Installing Mahout

Mahout can be installed when MapR services are initially installed as discussed in . If Mahout wasn't installed during theInstalling MapR Servicesinitial MapR services installation, Mahout can be installed at a later date by executing the instructions in this section. These procedures may beperformed on a node in a MapR cluster (see the ) or on a client (see ).Installation Guide Setting Up the Client

The Mahout installation procedures below use the operating system's package manager to download and install Mahout from the MapRRepository. To install the packages, refer to the Local Packages document for or .Red Hat Ubuntu

Installing Mahout on a MapR Node

Mahout only needs to be installed on the nodes in the cluster from which Mahout applications will be executed. So you may only need to installMahout on one node. However, depending on the number of Mahout users and the number of scheduled Mahout jobs, you may need to installMahout on more than one node.

Mahout applications may run MapReduce programs, and by default Mahout will use the cluster's default JobTracker to execute MapReduce jobs.

Install Mahout on a MapR node running Ubuntu

Install Mahout on a MapR node running Ubuntu as or using by executing the following command:root sudo apt-get install

# apt-get install mapr-mahout

Install Mahout on a MapR node running Red Hat or CentOS

Install Mahout on a MapR node running Red Hat or CentOS as or using by executing the following command:root sudo yum install

# yum install mapr-mahout

Installing Mahout on a Client

If you install Mahout on a Linux client, you can run Mahout applications from the client that execute MapReduce jobs on the cluster that your clientis configured to use.

Tip: You don't have to install Mahout on the cluster in order to run Mahout applications from your client.

Install Mahout on a client running Ubuntu

Install Mahout on a client running Ubuntu as or using by executing the following command:root sudo apt-get install

# apt-get install mapr-mahout

Install Mahout on a client running Red Hat or CentOS

Install Mahout on a client running Red Hat or CentOS as or using by executing the following command:root sudo yum install

# yum install mapr-mahout

Configuring the Mahout Environment

After installation the Mahout executable is located in the following directory:

http://mahout.apache.org/





/opt/mapr/mahout/mahout-<version>/bin/mahout

Example: /opt/mapr/mahout/mahout-0.5/bin/mahout

To use Mahout with MapR, set the following environment variables:

MAHOUT_HOME - the path to the Mahout directory. Example: $ export MAHOUT_HOME=/opt/mapr/mahout/mahout-0.5

JAVA_HOME - the path to the Java directory. Example for Ubuntu: $ export JAVA_HOME=/usr/lib/jvm/java-6-sun

JAVA_HOME - the path to the Java directory. Example for Red Hat and CentOS: $ export JAVA_HOME=/usr/java/jdk1.6.0_24

HADOOP_HOME - the path to the Hadoop directory. Example: $ export HADOOP_HOME=/opt/mapr/hadoop/hadoop-0.20.2

HADOOP_CONF_DIR - the path to the directory containing Hadoop configuration parameters. Example:$ export HADOOP_CONF_DIR=/opt/mapr/hadoop/hadoop-0.20.2/conf

You can set these environment variables persistently for all users by adding them to the file as or using . The/etc/environment root sudoorder of the environment variables in the file doesn't matter.

Example entries for setting environment variables in the /etc/environment file for Ubuntu:

JAVA_HOME=/usr/lib/jvm/java-6-sun MAHOUT_HOME=/opt/mapr/mahout/mahout-0.5 HADOOP_HOME=/opt/mapr/hadoop/hadoop-0.20.2 HADOOP_CONF_DIR=/opt/mapr/hadoop/hadoop-0.20.2/conf

Example entries for setting environment variables in the /etc/environment file for Red Hat and CentOS:

JAVA_HOME=/usr/java/jdk1.6.0_24 MAHOUT_HOME=/opt/mapr/mahout/mahout-0.5 HADOOP_HOME=/opt/mapr/hadoop/hadoop-0.20.2 HADOOP_CONF_DIR=/opt/mapr/hadoop/hadoop-0.20.2/conf

After adding or editing environment variables to the file, you can activate them without rebooting by executing the c/etc/environment sourceommand:

$ source /etc/environment

Note: A user who doesn't have or permissions can add these environment variable entries to his or her file. Theroot sudo ~/.bashrcenvironment variables will be set each time the user logs in.

Getting Started with Mahout

To see the sample applications bundled with Mahout, execute the following command:

$ ls $MAHOUT_HOME/examples/bin

To run the Twenty Newsgroups Classification Example, execute the following commands:

$ cd $MAHOUT_HOME$ ./examples/bin/build-20news-bayes.sh

The output from this example will look similar to the following:





MultiTool

The command is the wrapper around Cascading.Multitool, a command line tool for processing large text files and datasets (like sed and grepmton unix). The command is located in the directory. To use , change to the directory.mt /opt/mapr/contrib/multitool/bin mt multitoolExample:

cd /opt/mapr/contrib/multitool./bin/mt



1. 2.

3.

4.

5.

6.

7.

8.

Oozie

Oozie is a workflow system for Hadoop. Using Oozie, you can set up that execute MapReduce jobs, and that manageworkflows coordinatorsworkflows.

Installing Oozie

The following procedures use the operating system package managers to download and install from the MapR Repository. To install thepackages manually, refer to the Local Packages document for or .Red Hat Ubuntu

To install Oozie on a MapR cluster:

Execute the following commands as or using .root sudoThis procedure is to be performed on a MapR cluster with the MapR repository properly set. If you have not installed MapR, see the Instal

.lation GuideIf you are installing on Ubuntu, update the list of available packages:

apt-get update

Install :mapr-oozieRHEL/CentOS:

yum install mapr-oozie

SUSE:

zypper install mapr-oozie

Ubuntu:

apt-get install mapr-oozie

The warden picks up the new configuration and automatically starts the new services. When it is convenient, restart the warden:


Use the script to set up Oozie:oozie-setup.sh

/opt/mapr/oozie/oozie-3.0.0/bin/oozie-setup.sh

Start the Oozie daemon:

/etc/init.d/mapr-oozie start

The command returns immediately, but it might take a few minutes for Oozie to start.

Use the following command to see if Oozie has started:

/etc/init.d/mapr-oozie status

Checking the Status of Oozie

Once Oozie is installed, you can check the status using the command line or the Oozie web console.

To check the status of Oozie using the command line:

Use the command:oozie admin





1.

2.

3.

4.

5.

1.

2. 3.

/opt/mapr/oozie/oozie-3.0.0/bin/oozie admin -oozie http://localhost:11000/oozie -status

The following output indicates normal operation:

System mode: NORMAL

To check the status of Oozie using the web console:

Point your browser to http://localhost:11000/oozie

Examples

After verifying the status of Oozie, set up and try the examples, to get familiar with Oozie.

To set up the examples and copy them to the cluster:

Extract the oozie examples archive :oozie-examples.tar.gz

cd /opt/mapr/oozie/oozie-3.0.0tar xvfz ./oozie-examples.tar.gz

Mount the cluster via NFS (See .). Example:Accessing Data with NFS

mkdir /mnt/maprmount localhost:/mapr /mnt/mapr

Create a directory for the examples. Example:

mkdir /mnt/mapr/my.cluster.com/myvolume/oozie-examples

Copy the Oozie examples from the local directory to the cluster directory. Example:

cp -r /opt/mapr/oozie/oozie-3.0.0/examples /mnt/mapr/my.cluster.com/myvolume/oozie-examples

Set the environment variable so that you don't have to provide the option when you run each job:OOZIE_URL -oozie

export OOZIE_URL="http://localhost:11000/oozie"

To run the examples:

Choose an example and run it with the command. Example:oozie job

/opt/mapr/oozie/oozie-3.0.0/bin/oozie job -config/opt/mapr/oozie/oozie-3.0.0/examples/apps/map-reduce/job.properties -run

Make a note of the returned job ID.Using the job ID, check the status of the job using the command line or the Oozie web console, as shown below.

Using the command line, type the following (substituting the job ID for the placeholder):<job id>

/opt/mapr/oozie/oozie-3.0.0/bin/oozie job -info <job id>

Using the Oozie web console, point your browser to and click .http://localhost:11000/oozie All Jobs

http://localhost:11000/oozie

http://localhost:11000/oozie



1. 2. 3.

4.

1. 2. 3.

1. 2.

Pig

Apache Pig is a platform for parallelized analysis of large data sets via a language called PigLatin. For more information about Pig, see the Pig.project page

Once Pig is installed, the executable is located at: /opt/mapr/pig/pig-<version>/bin/pig

Make sure the environment variable is set correctly. Example:JAVA_HOME

# export JAVA_HOME=/usr/lib/jvm/java-6-sun

Installing Pig

The following procedures use the operating system package managers to download and install Pig from the MapR Repository. To install thepackages manually, refer to the Local Packages document for or .Red Hat Ubuntu

To install Pig on an Ubuntu cluster:


apt-get update

On each planned Pig node, install :mapr-pig

apt-get install mapr-pig

To install Pig on a Red Hat or CentOS cluster:

Execute the following commands as or using .root sudoThis procedure is to be performed on a MapR cluster. If you have not installed MapR, see the .Installation GuideOn each planned Pig node, install :mapr-pig

yum install mapr-pig

Getting Started with Pig

In this tutorial, we'll use Pig to run a MapReduce job that counts the words in the file on the cluster, and/myvolume/in/constitution.txtstore the results in the file ./myvolume/wordcount.txt

First, make sure you have downloaded the file: On the page , select Tools > Attachments andA Tour of the MapR Virtual Machineright-click to save it.constitution.txtMake sure the file is loaded onto the cluster, in the directory . If you are not sure how, look at the tutorial on /myvolume/in NFS A Tour

.of the MapR Virtual Machine

Open a Pig shell and get started:

In the terminal, type the command to start the Pig shell.pigAt the prompt, type the following lines (press ENTER after each):grunt>

A = LOAD '/myvolume/in' USING TextLoader() AS (words:chararray);

B = FOREACH A GENERATE FLATTEN(TOKENIZE(*));

C = GROUP B BY $0;







2.

3.

D = FOREACH C GENERATE group, COUNT(B);

STORE D INTO '/myvolume/wordcount';

After you type the last line, Pig starts a MapReduce job to count the words in the file .constitution.txt

When the MapReduce job is complete, type to exit the Pig shell and take a look at the contents of the directory quit /myvolume/wordc to see the results.ount



Using Whirr to Install on Amazon EC2

Whirr simplifies the task of installing a MapR cluster on . To get started, you only need an Amazon Elastic Compute Cloud (Amazon EC2) Amazon account. Once your account is set up, Whirr creates the instances, installs and configures the cluster according to yourWeb Services (AWS)

specifications, and launches the MapR software.

Installing Whirr

You can install and run Whirr on any computer running a compatible operating system:

64-bit Red Hat 5.4 or greater, or 64-bit CentOS 5.4 or greater64-bit Ubuntu 9.04 or greater

To install Whirr on CentOS or Red Hat:

As (or using ), issue the following command:root sudo

yum install mapr-whirr

To install Whirr on Ubuntu:

As (or using ), issue the following commands:root sudo

apt-get updateapt-get install mapr-whirr

Planning

Before setting up a MapR cluster on Amazon EC2, select a compatible AMI. MapR recommends the following:

us-west-1/ami-e1bfefa4 - an Ubuntu 10.04 AMI (username: )ubuntu

Plan the services to run on each node just as you would for any MapR cluster. For more information, see . If you are notPlanning the Deploymentsure, you can use the schema to create a basic cluster.mapr-simple

Specifying Services

Once you have determined how many nodes should run which services, you will need to format this information as a list of forinstance templatesuse in the Whirr configuration. An instance template specifies a number of nodes and a group of services to run, in the following format:<number><service>[+<service>...]

In other words, an instance template defines a node type and how many of that node type to create in the Amazon EC2 cluster. The instancetemplate is specified in the properties file, in the parameter (see below).whirr.instance-templates Configuring WhirrFor example, consider the following configuration plan for an 8-node cluster:

CLDB, FileServer, JobTracker, WebServer, NFS, and ZooKeeper on 3 nodesFileServer and TaskTracker on 5 nodes

To set up the above cluster on EC2, you would use the following instance template:

whirr.instance-templates=3mapr-cldb+mapr-fileserver+mapr-jobtracker+mapr-webserver+mapr-nfs+mapr-zookeeper,5mapr-fileserver+mapr-tasktracker

The MapR-Simple Schema

The schema installs the core MapR services on a single node, then installs only the FileServer and TaskTracker services on themapr-simpleremaining nodes. To create a cluster, specify a single instance template consisting of the number of nodes and the text mapr-simple mapr-sim

. Example:ple

whirr.instance-templates=5 mapr-simple

http://incubator.apache.org/whirr/

http://aws.amazon.com/ec2/

http://aws.amazon.com/

http://aws.amazon.com/



Because the schema includes only one CLDB, one ZooKeeper, and one JobTracker, it is not suitable for clusters which requiremapr-simpleHigh Availability (HA).

Configuring Whirr

Whirr is configured with a , a text file that specifies information about the cluster to provision on Amazon EC2. The directory properties file /opt/m contains a few properties files (MapR recipes) to get you started.apr/whirr/whirr-0.3.0-incubating/mapr-recipes

Collect the following information, which is needed for the Whirr properties file

EC2 access key - The access key for your Amazon accountEC2 secret key - The secret key for your Amazon accountAmazon AMI - The AMI to use for the instancesInstance templates - A list of instance templatesAmazon EC2 region - The region on which to start the clusterNumber of nodes - The number of instances to configure with each set of servicesPrivate ssh key - The private ssh key to set up on each instancePublic ssh key - The public ssh key to set up on each instance

You can use one of the MapR recipes as a starting point for your own properties file:

If you do not need to change the instance template or cluster name, you can use the MapR recipes unmodified by setting environmentvariables referenced in the file:

$ export AWS_ACCESS_KEY_ID=<EC2 access key>$ export AWS_SECRET_ACCESS_KEY=<EC2 secret key>

To change the cluster name, the number of nodes, or other configuration settings, edit one of the MapR recipes or create a properties filefrom scratch. The properties file should look like the following sample properties file, with the values above substituted for theplaceholders in angle brackets (<>).

Sample Properties File

#EC2 settingswhirr.provider=ec2whirr.identity=<EC2 access key>whirr.credential=<EC2 secret key>whirr.image-id=<Amazon AMI>whirr.hardware-id=<instance type>whirr.location-id=<Amazon EC2 region>

#Cluster settingswhirr.instance-templates=<instance templates>whirr. -key-file=< ssh key>private privatewhirr. -key-file=< ssh key to setup on instances>public public

#MapR settingswhirr.run-url-base=http:// .mapr.com/scripts/whirr/packagewhirr.mapr-install-runurl=mapr/installwhirr.mapr-configure-runurl=mapr/configure

Running Whirr

You can use Whirr to launch the cluster, list the nodes, and destroy the cluster when finished.

Do not use with Whirr, as the environment variables associated with your shell will not be available.sudo

Launching the Cluster

Use the command, specifying the properties file, to launch your cluster. Whirr configures an AMI instance on eachwhirr launch-clusternode with the MapR services specified, then starts the MapR software.

Example:

http://incubator.apache.org/whirr/configuration-guide.html



1.

2. 3. 4. 5.

$ /opt/mapr/whirr/whirr-0.3.0-incubating/bin/whirr launch-cluster --config/opt/mapr/whirr/whirr-0.3.0-incubating/mapr-recipes/mysimple.properties

When Whirr finishes, it displays a list of instances in the cluster. Example:

Started cluster of 2 instancesClusterinstances=[Instanceroles=[mapr-simple], publicAddress=/10.250.1.126, privateAddress=/10.250.1.192, id=us-west-1/i-501d1514, Instanceroles=[mapr-simple], publicAddress=/10.250.1.136, privateAddress=/10.250.1.217, id=us-west-1/i-521d1516], configuration=hadoop.job.ugi=root,root,mapred.job.tracker=ec2-10-250-1-126.us-west-1.compute.amazonaws.com:9001, hadoop.socks.server=localhost:6666, fs.s3n.awsAccessKeyId=ASDFASDFSDFASDFASDF, fs.s3.awsSecretAccessKey=AasKLJjkLa89asdf8970as, fs.s3.awsAccessKeyId=ASDFASDFSDFASDFASDF, hadoop.rpc.socket.factory.class. =org.apache.hadoop.net.SocksSocketFactory, defaultfs. .name=maprfs:default //ec2-10-250-1-126.us-west-1.compute.amazonaws.com:7222/, fs.s3n.awsSecretAccessKey=AasKLJjkLa89asdf8970asj

Each listed instance includes a list of roles fulfilled by the node. The above example shows a two-node cluster created with the scmapr-simplehema; both nodes have the following roles: roles=[mapr-simple]

You can find the public IP address of each node in the parameter.publicAddress

Listing the Cluster Nodes

Use the command, specifying the properties file, to list the nodes in your cluster.whirr list-cluster

Example:

$ /opt/mapr/whirr/whirr-0.3.0-incubating/bin/whirr list-cluster --config/opt/mapr/whirr/whirr-0.3.0-incubating/mapr-recipes/mysimple.properties

Destroying the Cluster

Use the command, specifying the properties file, to destroy the cluster.whirr destroy-cluster

Example:

$ /opt/mapr/whirr/whirr-0.3.0-incubating/bin/whirr destroy-cluster --config/opt/mapr/whirr/whirr-0.3.0-incubating/mapr-recipes/mysimple.properties

After Installation

Find the line containing a handler name followed by the phrase "Web UI available at" in the Whirr output to determine the public addressof the node that is running the webserver. Example:

MapRSimpleHandler: Web UI available at https://ec2-50-18-17-126.us-west-1.compute.amazonaws.com:8443

Log in to the node via SSH using your instance username (for example, or ). No password is needed.ubuntu ec2-userBecome with one of the following commands: or root su - sudo bashSet the password for by issuing the command.root passwd rootSet up keyless SSH from the webserver node to all other nodes in the cluster.

Known Issues

Installation on more than nine nodes fails with an AWSException (message size exceeded). To create a cluster larger than nine nodes, install inbatches of nine and use to specify the correct CLDB and ZooKeeper IP addresses on each node.configure.sh

Using the Cluster

View the by navigating to the following URL (substituting the IP address of the EC2 node running theMapR Control System



mapr-webserver service):https://<IP address>:8443To launch MapReduce jobs, either use to log into any node in the EC2 cluster, or set up a .ssh MapR Client



1.

2.

3.

4.

1. 2. 3. 4.

Setting Up the Client

MapR provides several interfaces for working with a cluster from a client computer:

MapR Control System - manage the cluster, including nodes, volumes, users, and alarmsDirect Access NFS™ - mount the cluster in a local directoryMapR client - work with MapR Hadoop directly

Mac OS XRed Hat/CentOSUbuntuWindows

MapR Control System

The MapR Control System is web-based, and works with the following browsers:

ChromeSafariFirefox 3.0 and aboveInternet Explorer 7 and 8

To use the MapR Control System, navigate to the host that is running the WebServer in the cluster. MapR Control System access to the cluster istypically via HTTP on port 8080 or via HTTPS on port 8443; you can specify the protocol and port in the dialog. You shouldConfigure HTTPdisable pop-up blockers in your browser to allow MapR to open help links in new browser tabs.

Direct Access NFS™

You can mount a MapR cluster locally as a directory on a Mac, Linux, or Windows computer.

Before you begin, make sure you know the hostname and directory of the NFS share you plan to mount.Example:

usa-node01:/mapr - for mounting from the command linenfs://usa-node01/mapr - for mounting from the Mac Finder

Make sure the client machine has the appropriate username and password to access the NFS share. For best results, the username andpassword for accessing the MapR cluster should be the same username and password used to log into the client machine.

Linux

Make sure the NFS client is installed. Examples: sudo yum install nfs-utils (Red Hat or CentOS)sudo apt-get install nfs-common (Ubuntu)

List the NFS shares exported on the server. Example:showmount -e usa-node01Set up a mount point for an NFS share. Example:sudo mkdir /maprMount the cluster via NFS. Example:sudo mount usa-node01:/mapr /mapr

You can also add an NFS mount to so that it mounts automatically when your system starts up. Example:/etc/fstab

# device mountpoint fs-type options dump fsckorder...usa-node01:/mapr /mapr nfs rw 0 0...

Mac

To mount the cluster from the Finder:

Open the Disk Utility: go to .Applications > Utilities > Disk UtilitySelect .File > NFS MountsClick the at the bottom of the NFS Mounts window.+In the dialog that appears, enter the following information:



4.

5. 6. 7. 8.

1.

2.

3.

Remote NFS URL: The URL for the NFS mount. If you do not know the URL, use the command described below.showmountExample: nfs://usa-node01/maprMount location: The mount point where the NFS mount should appear in the local filesystem.

Click the triangle next to .Advanced Mount ParametersEnter in the text field.nolocksClick .VerifyImportant: On the dialog that appears, click to skip the verification process.Don't Verify

The MapR cluster should now appear at the location you specified as the mount point.

To mount the cluster from the command line:

List the NFS shares exported on the server. Example:showmount -e usa-node01Set up a mount point for an NFS share. Example:sudo mkdir /maprMount the cluster via NFS. Example:sudo mount -o nolock usa-node01:/mapr /mapr

Windows

Because of Windows directory caching, there may appear to be no directory in each volume's root directory. To work.snapshotaround the problem, force Windows to re-load the volume's root directory by updating its modification time (for example, bycreating an empty file or directory in the volume's root directory).

To mount the cluster on Windows 7 Ultimate or Windows 7 Enterprise:



1. 2. 3. 4. 5.

1. 2.

3.

1.

2.

Open .Start > Control Panel > ProgramsSelect .Turn Windows features on or offSelect .Services for NFSClick .OKMount the cluster and map it to a drive using the tool or from the command line. Example:Map Network Drivemount -o nolock usa-node01:/mapr z:

To mount the cluster on other Windows versions:

Download and install (SFU). You only need to install the NFS Client and the User Name Mapping.Microsoft Windows Services for UnixConfigure the user authentication in SFU to match the authentication used by the cluster (LDAP or operating system users). You canmap local Windows users to cluster Linux users, if desired.Once SFU is installed and configured, mount the cluster and map it to a drive using the tool or from the commandMap Network Driveline. Example:mount -o nolock usa-node01:/mapr z:

To map a network drive with the Map Network Drive tool:

Open .Start > My Computer

http://www.microsoft.com/downloads/en/details.aspx?FamilyID=896c9688-601b-44f1-81a4-02878ff11778&DisplayLang=en



2. 3. 4. 5.

6. 7.

1. 2.

1.

Select .Tools > Map Network DriveIn the Map Network Drive window, choose an unused drive letter from the drop-down list.DriveSpecify the by browsing for the MapR cluster, or by typing the hostname and directory into the text field.FolderBrowse for the MapR cluster or type the name of the folder to map. This name must follow UNC. Alternatively, click the Browse... buttonto find the correct folder by browsing available network shares.Select to reconnect automatically to the MapR cluster whenever you log into the computer.Reconnect at loginClick Finish.

See for more information.Accessing Data with NFS

MapR Client

The MapR client lets you interact with MapR Hadoop directly. With the MapR client, you can submit Map/Reduce jobs and run and hadoop fs h commands. The MapR client is compatible with the following operating systems:adoop mfs

CentOS 5.5 or aboveMac OS X (Intel)Red Hat Enterprise Linux 5.5 or aboveUbuntu 9.04 or aboveWindows 7

Do not install the client on a cluster node. It is intended for use on a computer that has no other MapR software installed. Do notinstall other MapR software on a MapR client computer. To run commands, establish an session to a node in theMapR CLI sshcluster.

To configure the client, you will need the cluster name and the IP addresses and ports of the CLDB nodes on the cluster. The configuration script has the following syntax:configure.sh

Linux —

configure.sh [-N <cluster name>] -c -C <CLDB node>[:<port>][,<CLDB node>[:<port>]...]

Windows —

server\configure.bat -c -C <CLDB node>[:<port>][,<CLDB node>[:<port>]...]

Linux or Mac Example:

/opt/mapr/server/configure.sh -N MyCluster -c -C 10.10.100.1:7222

Windows Example:

server\configure.bat -c -C 10.10.100.1:7222

Installing the MapR Client on CentOS or Red Hat

Change to the user (or use sudo for the following commands).rootCreate a text file called in the directory with the following contents:maprtech.repo /etc/yum.repos.d/[maprtech]name=MapR Technologiesbaseurl=http://package.mapr.com/releases/v1.2.10/redhat/

enabled=1gpgcheck=0protect=1To install a previous release, see the for the correct path to use in the parameter.Release Notes baseurl





1.

2.

3. 4.

1. 2.

3.

4.

5.

6. 7.

1. 2. 3.

4.

5.


Remove any previous MapR software. You can use to get a list of installed MapR packages, then type therpm -qa | grep maprpackages separated by spaces after the command. Example:rpm -e

rpm -qa | grep maprrpm -e mapr-fileserver mapr-core

Install the MapR client: yum install mapr-clientRun to configure the client, using the (uppercase) option to specify the CLDB nodes, and the (lowercase) optionconfigure.sh -C -cto specify a client configuration. Example:


Installing the MapR Client on Ubuntu

Change to the user (or use sudo for the following commands).rootAdd the following line to :/etc/apt/sources.listdeb mapr optionalhttp://package.mapr.com/releases/v1.2.10/ubuntu/To install a previous release, see the for the correct path to useRelease NotesIf your connection to the Internet is through a proxy server, add the following lines to :/etc/apt.conf


Remove any previous MapR software. You can use to get a list of installed MapR packages, then type thedpkg -list | grep maprpackages separated by spaces after the command. Example:dpkg -r

dpkg -l | grep maprdpkg -r mapr-core mapr-fileserver

Update your Ubuntu repositories. Example:

apt-get update

Install the MapR client: apt-get install mapr-clientRun to configure the client, using the (uppercase) option to specify the CLDB nodes, and the (lowercase) optionconfigure.sh -C -cto specify a client configuration. Example:


Installing the MapR Client on Mac OS X

Download the archive package.mapr.com/releases/v1.2.9/mac/mapr-client-1.2.9.14720.GA-0.x86_64.tar.gzOpen the application.TerminalCreate the directory :/optsudo mkdir -p /optExtract mapr-client-1.2.9.14720.GA-0.x86_64.tar.gz into the directory. Example:/optsudo tar -C /opt -xvf mapr-client-1.2.9.14720.GA-0.x86_64.tar.gzRun to configure the client, using the (uppercase) option to specify the CLDB nodes, and the (lowercase) optionconfigure.sh -C -cto specify a client configuration. Example:sudo /opt/mapr/server/configure.sh -N MyCluster -c -C 10.10.100.1:7222




1. 2. 3.

4.

5.

6.

7.

8.

Installing the MapR Client on Windows 7

Make sure Java is installed on the computer, and set correctly.JAVA_HOMEOpen the command line.Create the directory on your drive (or another hard drive of your choosing)--- either use Windows Explorer, or type the\opt\mapr c:following at the command prompt:mkdir c:\opt\maprSet to the directory you created in the previous step. Example:MAPR_HOMESET MAPR_HOME=c:\opt\maprNavigate to :MAPR_HOMEcd %MAPR_HOME%Download the correct archive into :MAPR_HOME

On a 64-bit Windows machine, download[http://package.mapr.com/releases/v1.2.9/windows/mapr-client-1.2.9.14720GA-0.amd64.zip]On a 32-bit Windows machine, download[http://package.mapr.com/releases/v1.2.9/windows/mapr-client-1.2.9.14720GA-0.x86.zip]

Extract the archive:On a 64-bit Windows machine: jar -xvf mapr-client-1.2.9.14720GA-0.amd64.zipOn a 32-bit Windows machine: jar -xvf mapr-client-1.2.9.14720GA-0.x86.zip

Run to configure the client, using the (uppercase) option to specify the CLDB nodes, and the (lowercase) optionconfigure.bat -C -cto specify a client configuration. Example:server\configure.bat -c -C 10.10.100.1:7222

On the Windows client, you can run MapReduce jobs using the command the way you would normally use the normal comhadoop.bat hadoopmand. For example, to list the contents of a directory, instead of you would type the following:hadoop fs -lshadoop.bat fs -ls

Before running jobs on the Windows client, set the following properties in %MAPR_HOME%\hadoop\hadoop-<version>\conf\core-site.xm on the Windows machine to match the username, user ID, and group ID that have been set up for you on the cluster:l

hadoop.spoofed.user.uid=<uid>hadoop.spoofed.user.gid=<gid>hadoop.spoofed.user.username=<gid>

You can determine the correct and values for your username by logging into a cluster node and typing the command. Example:uid gid id

$ iduid=1000(pconrad) gid=1000(pconrad)groups=4(adm),20(dialout),24(cdrom),46(plugdev),105(lpadmin),119(admin),122(sambashare),1000(pconrad)

On the Windows client, because the native Hadoop library is not present, the commend is nothadoop fs -getmergeavailable.



1. 2.

3.

4.

5.

6.

Uninstalling MapR

To re-purpose machines, you may wish to remove nodes and uninstall MapR software.

Removing Nodes from a Cluster

To remove nodes from a cluster: first uninstall the desired nodes, then run on the remaining nodes. Finally, if you are usingconfigure.shGanglia, restart all and daemons in the cluster.gmeta gmon

To uninstall a node:

On each node you want to uninstall, perform the following steps:

Change to the root user (or use sudo for the following commands).Stop the Warden:/etc/init.d/mapr-warden stopIf ZooKeeper is installed on the node, stop it:/etc/init.d/mapr-zookeeper stopDetermine which MapR packages are installed on the node:

dpkg --list | grep mapr (Ubuntu)rpm -qa | grep mapr (Red Hat or CentOS)

Remove the packages by issuing the appropriate command for the operating system, followed by the list of services. Examples:apt-get purge mapr-core mapr-cldb mapr-fileserver (Ubuntu)yum erase mapr-core mapr-cldb mapr-fileserver (Red Hat or CentOS)

If the node you have decommissioned is a CLDB node or a ZooKeeper node, then run on all other nodes in the clusterconfigure.sh(see ).Configuring a Node

To reconfigure the cluster:





Example:



If you are using Ganglia, restart all gmeta and gmon daemons in the cluster. See .Ganglia



1.

2.

3.

Working with Multiple Clusters

In order to mirror volumes between clusters, you must use to create an additional entry in on the sourceconfigure.sh mapr-clusters.confcluster, for each cluster to which it will mirror. You can cross-mirror between clusters: to mirror some volumes from cluster A to cluster B and othervolumes from cluster B to cluster A, you would set up as follows:mapr-clusters.conf

Entries in on cluster A nodes:mapr-clusters.confFirst line contains name and CLDB servers of cluster ASecond line contains name and CLDB servers of cluster B

Entries in on cluster B nodes:mapr-clusters.confFirst line contains name and CLDB servers of cluster BSecond line contains name and CLDB servers of cluster A

By creating additional entries, you can mirror from one cluster to several others.

Each cluster must already be set up and running, and must have a unique name. All nodes in each cluster should be able to resolve all othernodes (via DNS or entries in )./etc/hosts

To set up multiple clusters:

On each cluster:Make a note of the cluster name and CLDB nodes (the first line in )mapr-clusters.confMake note of the ZooKeeper nodes (the parameter in )cldb.zookeeper.servers cldb.conf

On each cluster, use with the option to add entries in on all nodes, for each cluster to whichconfigure.sh -N mapr-clusters.confit will be mirroring. Example:

configure.sh -C <cluster 2 CLDB nodes> -Z <cluster 2 ZooKeeper nodes> -N <cluster 2 name>

On each cluster, restart the service on all nodes where it is running.mapr-webserver



Administration GuideWelcome to the MapR Administration Guide! This guide is for system administrators tasked with managing MapR clusters. Topics include how tomanage data by using volumes; how to monitor the cluster for performance; how to manage users and groups; how to add and remove nodesfrom the cluster; and more.

The focus of the Administration Guide is managing the nodes and services that make up a cluster. For details of fine-tuning MapR for specificjobs, see the . The Administration Guide does not cover the details of installing MapR software on a cluster. See Development Guide Installation

for details on planning and installing a MapR cluster.Guide

Click on one of the sub-sections below to get started.

MonitoringAlarms and NotificationsGangliaNagios IntegrationService Metrics

Managing the ClusterBalancersCluster UpgradeDial HomeDisksNodesServicesStartup and Shutdown

Managing Data with VolumesMirrorsSchedulesSnapshots

Users and GroupsManaging PermissionsManaging Quotas

TroubleshootingDisaster RecoveryOut of Memory TroubleshootingTroubleshooting Alarms



Monitoring

This section provides information about monitoring the cluster:

Alarms and NotificationsGangliaNagios IntegrationService Metrics



1. 2.

1. 2. 3.

4.

Alarms and Notifications

On a cluster with an M5 license, MapR raises alarms and sends notifications to alert you to information about the cluster:

Cluster health, including disk failuresVolumes that are under-replicated or over quotaServices not running

You can see any currently raised alarms in the view of the MapR Control System, or using the command. For a list of all alarms,Alarms alarm listsee .Troubleshooting Alarms

To view cluster alarms using the MapR Control System:

In the Navigation pane, expand the group and click the view.Cluster DashboardAll alarms for the cluster and its nodes and volumes are displayed in the pane.Alarms

To view node alarms using the MapR Control System:

In the Navigation pane, expand the group and click the view.Alarms Node Alarms

You can also view node alarms in the view, the view, and the pane of the view.Node Properties NFS Alarm Status Alarms Dashboard

To view volume alarms using the MapR Control System:

In the Navigation pane, expand the group and click the view.Alarms Volume Alarms

You can also view node alarms in the pane of the view.Alarms Dashboard

Notifications

When an alarm is raised, MapR can send an email notification to either or both of the following addresses:

The owner of the cluster, node, volume, or entity for which the alarm was raised (standard notification)A custom email address for the named alarm.

You can set up alarm notifications using the command or from the view in the MapR Control System.alarm config save Alarms

To set up alarm notifications using the MapR Control System:

In the Navigation pane, expand the group and click the view.Alarms Alarm NotificationsDisplay the dialog by clicking .Configure Alarm Subscriptions Alarm NotificationsFor each :Alarm

To send notifications to the owner of the cluster, node, volume, or entity: select the checkbox.Standard NotificationTo send notifications to an additional email address, type an email address in the field.Additional Email Address

Click to save the configuration changes.Save



1. 2. 3.

1.

2. 3. 4.

1. 2.

Ganglia

Ganglia is a scalable distributed system monitoring tool that allows remote viewing live or historical statistics for a cluster. The Ganglia systemconsists of the following components:

A PHP-based web front endGanglia monitoring daemon ( ): a multi-threaded monitoring daemongmondGanglia meta daemon ( ): a multi-threaded aggregation daemongmetadA few small utility programs

The daemon aggregates metrics from the instances, storing them in a database. The front end pulls metrics from the databasegmetad gmondand graphs them. You can aggregate data from multiple clusters by setting up a separate for each, and then a master togmetad gmetadaggregate data from the others. If you configure Ganglia to monitor multiple clusters, remember to use a separate port for each cluster.

MapR with Ganglia

The CLDB reports metrics about its own load, as well as cluster-wide metrics such as CPU and memory utilization, the number of activeFileServer nodes, the number of volumes created, etc. For a complete list of metrics, see .Service Metrics

MapRGangliaContext collects and sends CLDB metrics, FileServer metrics, and cluster-wide metrics to Gmon or Gmeta, depending on theconfiguration. On the Ganglia front end, these metrics are displayed separately for each FileServer by hostname. The ganglia monitor only needsto be installed on CLDB nodes to collect all the metrics required for monitoring a MapR cluster. To monitor other services such as HBase andMapReduce, install Gmon on nodes running the services and configure them as you normally would.

The Ganglia properties for the and contexts are configured in the file cldb fileserver $INSTALL_DIR/conf/hadoop-metrics.properti. Any changes to this file require a CLDB restart.es

Installing Ganglia

To install Ganglia on Ubuntu:

On each CLDB node, install : ganglia-monitor sudo apt-get install ganglia-monitorOn the machine where you plan to run the Gmeta daemon, install : gmetad sudo apt-get install gmetadOn the machine where you plan to run the Ganglia front end, install : ganglia-webfrontend sudo apt-get installganglia-webfrontend

To install Ganglia on Red Hat:

Download the following RPM packages for Ganglia version 3.1 or later:ganglia-gmondganglia-gmetadganglia-web

On each CLDB node, install : ganglia-monitor rpm -ivh <ganglia-gmond>On the machine where you plan to run the Ganglia meta daemon, install : gmetad rpm -ivh <gmetad>On the machine where you plan to run the Ganglia front end, install : ganglia-webfrontend rpm -ivh <ganglia-web>

For more details about Ganglia configuration and installation, see the .Ganglia documentation

To start sending CLDB metrics to Ganglia:

Make sure the CLDB is configured to send metrics to Ganglia (see ).Service MetricsAs (or using ), run the following commands:root sudo

maprcli config save -values ' : '"cldb.ganglia.cldb.metrics" "1"maprcli config save -values ' : '"cldb.ganglia.fileserver.metrics" "1"

To stop sending CLDB metrics to Ganglia:

As (or using ), run the following commands:root sudo


http://sourceforge.net/apps/trac/ganglia/wiki/Ganglia%203.1.x%20Installation%20and%20Configuration#installation



1. 2.

Nagios Integration

Nagios is an open-source cluster monitoring tool. MapR can generate a Nagios Object Definition File that describes the nodes in the cluster andthe services running on each. You can generate the file using the MapR Control System or the command, then save the file innagios generatethe proper location in your Nagios environment.

MapR recommends Nagios version 3.3.1 and version 1.4.15 of the plugins.

To generate a Nagios file using the MapR Control System:

In the Navigation pane, click .NagiosCopy and paste the output, and save as the appropriate Object Definition File in your Nagios environment.

For more information, see the .Nagios documentation

http://nagios.org

http://nagios.org/documentation



1.

2.

3.

1.

1.

Service Metrics

MapR services produce metrics that can be written to an output file or consumed by . The file metrics output is directed by the Ganglia hadoop-m files.etrics.properties

By default, the CLDB and FileServer metrics are sent via unicast to the Ganglia gmon server running on localhost. To send the metrics directly toa Gmeta server, change the property to the hostname of the Gmeta server. To send the metrics to a multicast channel, changecldb.serversthe property to the IP address of the multicast channel.cldb.servers

Metrics Collected

Below are the kinds of metrics that can be collected.

CLDB FileServers

Number of FileServersNumber of VolumesNumber of ContainersCluster Disk Space Used GBCluster Disk Space Available GBCluster Disk Capacity GBCluster Memory Capacity MBCluster Memory Used MBCluster Cpu Busy %Cluster Cpu TotalNumber of FS Container Failure ReportsNumber of Client Container Failure ReportsNumber of FS RW Container ReportsNumber of Active Container ReportsNumber of FS Volume ReportsNumber of FS RegisterNumber of container lookupsNumber of container assignNumber of container corrupt reportsNumber of rpc failedNumber of rpc received

FS Disk Used GBFS Disk Available GBCpu Busy %Memory Total MBMemory Used MBMemory Free MBNetwork Bytes ReceivedNetwork Bytes Sent

Setting Up Service Metrics

To configure metrics for a service:

Edit the appropriate file on all CLDB nodes, depending on the service:hadoop-metrics.propertiesFor MapR-specific services, edit /opt/mapr/conf/hadoop-metrics.propertiesFor standard Hadoop services, edit /opt/mapr/hadoop/hadoop-<version>/conf/hadoop-metrics.properties

In the sections specific to the service:Un-comment the lines pertaining to the context to which you wish the service to send metrics.Comment out the lines pertaining to other contexts.

Restart the service.

To enable service metrics:

As root (or using sudo), run the following commands:


To disable service metrics:

As root (or using sudo), run the following commands:


Example



In the following example, CLDB service metrics will be sent to the Ganglia context:

#CLDB metrics config - Pick one out of ,file or ganglia.null#Uncomment all properties in , file or ganglia context, to send cldb metrics to that contextnull

# Configuration of the context "cldb" for null#cldb.class=org.apache.hadoop.metrics.spi.NullContextWithUpdateThread#cldb.period=10

# Configuration of the context file"cldb" for#cldb.class=org.apache.hadoop.metrics.file.FileContext#cldb.period=60#cldb.fileName=/tmp/cldbmetrics.log

# Configuration of the context ganglia"cldb" forcldb.class=com.mapr.fs.cldb.counters.MapRGangliaContext31cldb.period=10cldb.servers=localhost:8649cldb.spoof=1



Managing the Cluster

This section describes the tools and processes involved in managing a MapR cluster. Topics include upgrading the MapR software version;adding and removing disks and nodes; managing data replication and disk space with balancers; managing the services on a node; managing thetopology of a cluster; and more.

Choose a subtopic below for more detail.

BalancersCluster Upgrade

Manual UpgradeRolling Upgrade

Dial HomeDisksNodes

Adding Nodes to a ClusterAdding RolesNode TopologyRemoving Roles

ServicesCLDB FailoverTaskTracker Blacklisting

Startup and Shutdown



Balancers

The disk space balancer and the replication role balancer redistribute data in the MapR storage layer to ensure maximum performance andefficient use of space:

The works to ensure that the percentage of space used on all disks in the node is similar, so that no nodes aredisk space balanceroverloaded.The changes the replication roles of cluster containers so that the replication process uses network bandwidthreplication role balancerevenly.

To view balancer status:

Use the command to view the space used and free on each storage pool, and the active containermaprcli dump balancerinfomoves. Example:

# maprcli dump balancerinfousedMB fsid spid percentage outTransitMB inTransitMB capacityMB209 5567847133641152120 01f8625ba1d15db7004e52b9570a8ff3 1 0 0 15200209 1009596296559861611 816709672a690c96004e52b95f09b58d 1 0 0 15200

To view balancer configuration values:

Pipe the command through . Example:maprcli config load grep

# maprcli config load -json | grep balancer : ,"cldb.balancer.disk.max.switches.in.nodes.percentage" "10" : ,"cldb.balancer.disk.paused" "1" : ,"cldb.balancer.disk.sleep.interval.sec" "120" : ,"cldb.balancer.disk.threshold.percentage" "70" : ,"cldb.balancer.logging" "0" : ,"cldb.balancer.role.max.switches.in.nodes.percentage" "10" : ,"cldb.balancer.role.paused" "1" : ,"cldb.balancer.role.sleep.interval.sec" "900" : ,"cldb.balancer.startup.interval.sec" "1800"

To set balancer configuration values:

Use the command to set the appropriate values. Example:config save

# maprcli config save -values : "cldb.balancer.disk.max.switches.in.nodes.percentage" "20"

By default, the balancers are turned off.

To turn on the disk space balancer, use to set to config save cldb.balancer.disk.paused 0To turn on the replication role balancer, use to set to config save cldb.balancer.role.paused 0

Disk Space Balancer

The disk space balancer attempts to move containers from storage pools that are over 70% full to other storage pools that have lower utilizationthan average for the cluster. You can view disk usage on all nodes in the view, by clicking in the Navigation pane and theDisks Cluster > Nodeschoosing from the dropdown.Disks

Disk Space Balancer Configuration Parameters

Parameter Value Description



cldb.balancer.disk.threshold.percentage 70 Threshold for moving containers out of a given storage pool, expressed asutilization percentage.

cldb.balancer.disk.paused 1 Specifies whether the disk space balancer runs:

0 - Not paused (normal operation)1 - Paused (does not perform any container moves)

cldb.balancer.disk.max.switches.in.nodes.percentage 10 This can be used to throttle the disk balancer. If it is set to 10, the balancer willthrottle the number of concurrent container moves to 10% of the total nodes inthe cluster (minimum 2).

Replication Role Balancer

Each container in a MapR volume is either a master container (part of the original copy of the volume) or an intermediate or tail container (one ofthe replicas in the replication chain). For a volume's (the volume's first container), replication occurs simultaneously from thename containermaster to the intermediate and tail containers. For other containers in the volume, replication proceeds from the master to the intermediatecontainer(s) until it reaches the tail. Replication occurs over the network between nodes, often in separate racks. For balanced networkbandwidth, every node must have an equal share of masters, intermediates and tails. The replication role balancer switches container roles toachieve this balance.

Replication Role Balancer Configuration Parameters


cldb.balancer.role.paused 1 Specifies whether the role balancer runs:

0 - Not paused (normal operation)1 - Paused (does not perform any container replication role switches)

cldb.balancer.role.max.switches.in.nodes.percentage 10 This can be used to throttle the role balancer. If it is set to 10, the balancer willthrottle the number of concurrent role switches to 10% of the total nodes in the cluster (minimum 2).



Cluster Upgrade

The following sections provide information about upgrading the cluster:

Rolling Upgrade provides information about automatically applying MapR software upgrades to a cluster.Manual Upgrade provides information about stopping all nodes, installing updated packages manually, and restarting all nodes.



1.

2.

1. 2. 3.

4.

5.

6.

7.

Manual Upgrade

Upgrading the MapR cluster manually entails stopping all nodes, installing updated packages, and restarting the nodes. Here are a few tips:

Make sure to add the correct repository directory for the version of MapR software you wish to install.Work on all nodes at the same time; that is, stop the warden on all nodes before proceeding to the next step, and so on.Use the procedure corresponding to the operating system on your cluster.

When performing a manual upgrade to MapR 1.2, it is necessary to run configure.sh on any nodes that are running theHBase region server or HBase master.

If you are upgrading a node that is running the NFS service:Use the command to determine whether the node has mounted MapR-FS through its own physical IPmount(not a VIP). If so, unmount any such mount points before beginning the upgrade process.

If you are upgrading a node that is running the HBase RegionServer and the command does not returnwarden stopafter five minutes, kill the HBase RegionServer process manually:

Determine the process ID of the HBase RegionServer:

cat /opt/mapr/logs/hbase-root-regionserver.pid

Kill the HBase RegionServer using the following command, substituting the process ID from the previous stepfor the placeholder :<PID>

kill -9 <PID>

CentOS and Red Hat

Perform all three procedures:

Upgrading the cluster - installing the new versions of the packagesSetting the version - manually updating the software configuration to reflect the correct versionUpdating the configuration - switching to the new versions of the configuration files, and preserving any custom settings

To upgrade the cluster:

On each node, perform the following steps:

Change to the user or use for the following commands.root sudoMake sure the MapR software is correctly installed and configured.Add the MapR yum repository for the latest version of MapR software, removing any old versions. For more information, see Adding the

.MapR RepositoryStop the warden:


If ZooKeeper is installed on the node, stop it:


Upgrade the MapR packages with the following command:

yum upgrade 'mapr-*'



7.

8.

1.

2. 3.

1. 2. 3.

4.

5.

6.

7.

8.

9.

10.

If ZooKeeper is installed on the node, start it:


Start the warden:


To update the configuration files:

If you have made any changes to or in the dirmapred-site.xml core-site.xml /opt/mapr/hadoop/hadoop-<version>/confectory, then make the same changes to the same files in the directory./opt/mapr/hadoop/hadoop-<version>/conf.newRename to to deactivate it./opt/mapr/hadoop/hadoop-<version>/conf /opt/mapr/hadoop/hadoop-<version>/conf.oldRename to to activate it as/opt/mapr/hadoop/hadoop-<version>/conf.new /opt/mapr/hadoop/hadoop-<version>/confthe configuration directory.

Ubuntu

Perform all three procedures:

Upgrading the cluster - installing the new versions of the packagesSetting the version - manually updating the software configuration to reflect the correct versionUpdating the configuration - switching to the new versions of the configuration files, and preserving any custom settings

To upgrade the cluster:

The easiest way to upgrade all MapR packages on Ubuntu is to temporarily move the normal and replace it with a special sources.list sourc that specifies only the MapR software repository, then use to upgrade all packages from the special es.list apt-get upgrade sources.li

file. On each node, perform the following steps:st

Change to the user or use for the following commands.root sudoMake sure the MapR software is correctly installed and configured.Rename the normal to prevent from reading packages from other repositories:/etc/apt/sources.list apt-get upgrade

mv /etc/apt/sources.list /etc/apt/sources.list.orig

Create a new file and add the MapR apt-get repository for the latest version of MapR software, removing/etc/apt/sources.listany old versions. For more information, see .Adding the MapR RepositoryStop the warden:


If ZooKeeper is installed on the node, stop it:


Clear the APT cache:

apt-get clean

Update the list of available packages:

apt-get update

Upgrade all MapR packages:

apt-get upgrade



10.

11.

12.

1.

2.

1.

2. 3.

Rename the special and restore the original:/etc/apt/sources.list

mv /etc/apt/sources.list /etc/apt/sources.list.maprmv /etc/apt/sources.list.orig /etc/apt/sources.list

If ZooKeeper is installed on the node, start it:


Start the warden:


Setting the Version

After completing the upgrade on all nodes, use the following steps on any node to set the correct version:

Check the software version by looking at the contents of the file. Example:MapRBuildVersion

$ cat /opt/mapr/MapRBuildVersion

Set the version accordingly using the command. Example:config save*maprcli config save -values mapr.targetversion:"1.2.10GA"*

Updating the Configuration

If you have made any changes to or in the dirmapred-site.xml core-site.xml /opt/mapr/hadoop/hadoop-<version>/confectory, then make the same changes to the same files in the directory./opt/mapr/hadoop/hadoop-<version>/conf.newRename to to deactivate it./opt/mapr/hadoop/hadoop-<version>/conf /opt/mapr/hadoop/hadoop-<version>/conf.oldRename to to activate it as/opt/mapr/hadoop/hadoop-<version>/conf.new /opt/mapr/hadoop/hadoop-<version>/confthe configuration directory.



1. 2.

3.

1. 2. 3.

4.

5.

Rolling Upgrade

A rolling upgrade installs the latest version of MapR software on all nodes in the cluster. You perform a rolling upgrade by running the script roll from a node in the cluster. The script upgrades all packages on each node, logging output to the rolling upgrade log (ingupgrade.sh /opt/map

). You must specify either a directory containing packages (using the option) or a version to fetch from ther/logs/rollingupgrade.log -pMapR repository (using the option). Here are a few tips:-v

If you specify a local directory with the option, you must either ensure that the same directory containing the packages exists on all the-pnodes, or use the option to copy packages out to each node via SCP automatically (requires the option). If you use the option,-x -s -xthe upgrade process copies the packages from the directory specified with into the same directory path on each node.-pIn a multi-cluster setting, use to specify which cluster to upgrade. If is not specified, the default cluster is upgraded.-c -cWhen specifying the version with the parameter, use the format to specify the major, minor, and revision numbers of the target-v x.y.zversion. Example: *1.2.10*The package (Red Hat) or (Ubuntu) enables automatic rollback if the upgrade fails. You can alsorpmrebuild dpkg-repackaccomplish this, without installing the package, by running with the option.rollingupgrade.sh -n

There are two ways to perform a rolling upgrade:

Via SSH - If keyless SSH is set up between all nodes, use the option to automatically upgrade all nodes without user intervention.-sNode by node - If SSH is not available, the script prepares the cluster for upgrade and guides the user through upgrading each node. In anode-by-node installation, you must individually run the commands to upgrade each node when instructed by the rollingupgrade.shscript.

Before upgrading the cluster, make sure that the following packages are installed on the appropriate nodes:

On all Red Hat and Centos nodes, rpmrebuild 2.4 or higherOn all Ubuntu nodes, dpkg-repack

To determine whether or not the appropriate package is installed on each node, run the following command to see a list of all installed versions ofthe package:

On Red Hat and Centos nodes:

rpm -qa | grep rpmrebuild

On Ubuntu nodes:

dpkg -l | grep dpkg-repack

Requirements

On the computer from which you will be starting the upgrade, perform the following steps:

Change to the user (or use for the following commands).root sudoIf you are starting the upgrade from a computer that is not a MapR client or a MapR cluster node, you must add the MapR repository (Ce

, or ) and install :ntOS or Red Hat Ubuntu mapr-coreCentOS or Red Hat: or yum install mapr-coreUbuntu: apt-get install mapr-coreRun , using to specify the cluster CLDB nodes and to specify the cluster ZooKeeper nodes. Example:configure.sh -C -Z

/opt/mapr/server/configure.sh -C 10.10.100.1,10.10.100.2,10.10.100.3 -Z10.10.100.1,10.10.100.2,10.10.100.3

To enable a fully automatic rolling upgrade, ensure passwordless SSH is enabled to all nodes for the user , from the computer onrootwhich the upgrade will be started.

On all nodes and the computer from which you will be starting the upgrade, perform the following steps:

Change to the user (or use for the following commands).root sudoAdd the MapR software repository ( , or ).CentOS or Red Hat UbuntuInstall rolling upgrade scripts:

CentOS or Red Hat: or yum install mapr-upgradeUbuntu: apt-get install mapr-upgrade

Install the following packages to enable automatic rollback:Ubuntu: apt-get install dpkg-repackCentOS or Red Hat: yum install rpmrebuild

http://www.mapr.com/doc/display/MapR12/Adding+the+RHEL+Repository


http://www.mapr.com/doc/display/MapR12/Adding+the+Ubuntu+Repository


http://www.mapr.com/doc/display/MapR12/Adding+the+Ubuntu+Repository



5.

1.

2. 3.

If you are planning to upgrade from downloaded packages instead of the repository, prepare a directory containing the packages toupgrade using the option with . This directory should reside at the same absolute path on each node. For the-p rollingupgrade.shdownload location, see or .Local Packages - Ubuntu Local Packages - Red Hat

Upgrading the Cluster via SSH

On the node from which you will be starting the upgrade, issue the command as (or with ) to upgrade therollingupgrade.sh root sudocluster:

If you have prepared a directory of packages to upgrade, issue the following command, substituting the path to the directory for the <dir placeholder:ectory>

/opt/upgrade-mapr/rollingupgrade.sh -s -p <directory>

If you are upgrading from the MapR software repository, issue the following command, substituting the version (x.y.z) for the <version>placeholder:

/opt/upgrade-mapr/rollingupgrade.sh -s -v <version>

Upgrading the Cluster Node by Node

On the node from which you will be starting the upgrade, use the command as (or with ) to upgrade the cluster:rollingupgrade.sh root sudo

Start the upgrade:If you have prepared a directory of packages to upgrade, issue the following command, substituting the path to the directory forthe placeholder:<directory>

/opt/upgrade-mapr/rollingupgrade.sh -p <directory>

If you are upgrading from the MapR software repository, issue the following command, substituting the version (x.y.z) for the <ve placeholder:rsion>

/opt/upgrade-mapr/rollingupgrade.sh -v <version>

When prompted, run on all nodes other than the master CLDB node, following the on-screen instructions.singlenodeupgrade.shWhen prompted, run on the master CLDB node, following the on-screen instructions.singlenodeupgrade.sh





Dial Home

MapR provides a service called , which automatically collects certain metrics about the cluster for use by support engineers and to helpDial Homeus improve and evolve our product. When you first install MapR, you are presented with the option to enable or disable Dial Home. Werecommend enabling it. You can enable or disable Dial Home later, using the command.dialhome enable



1.

2. 3.

1. 2. 3. 4. 5.

1. 2. 3. 4. 5.

Disks

MapR-FS groups disks into usually made up of two or three disks.storage pools,

When adding disks to MapR-FS, it is a good idea to add at least two or three at a time so that MapR can create properly-sized storage pools.Each node in a MapR cluster can support up to 36 storage pools.

When you remove a disk from MapR-FS, any other disks in the storage pool are also removed from MapR-FS automatically; the disk youremoved, as well as the others in its storage pool, are then to be added to MapR-FS again. You can either replace the disk and re-add itavailablealong with the other disks that were in the storage pool, or just re-add the other disks if you do not plan to replace the disk you removed.

MapR maintains a list of disks used by MapR-FS in a file called on each node.disktab

The following sections provide procedures for working with disks:

Adding Disks - adding disks for use by MapR-FSRemoving Disks - removing disks from use by MapR-FSHandling Disk Failure - replacing a disk in case of failureTolerating Slow Disks - increasing the disk timeout to handle slow disks

Before removing or replacing disks, make sure the Replication Alarm ( ) and DataVOLUME_ALARM_DATA_UNDER_REPLICATEDAlarm ( ) are not raised. These alarms can indicate potential or actual data loss! If eitherVOLUME_ALARM_DATA_UNAVAILABLEalarm is raised, it may be necessary to attempt repair using fsck before removing or replacing disks.

Adding Disks

You can add one or more available disks to MapR-FS using the command or the MapR Control System. In both cases, MapRdisk addautomatically takes care of formatting the disks and creating storage pools.

If you are running MapR 1.2.2 or earlier, do not use the command or the MapR Control System to add disks todisk addMapR-FS. You must either upgrade to MapR 1.2.3 before adding or replacing a disk, or use the following procedure (whichavoids the command):disk add

Use the to the failed disk. All other disks in the same storage pool are removed at theMapR Control System removesame time. Make a note of which disks have been removed.Create a text file containing a list of the disks you just removed. See ./tmp/disks.txt Setting Up Disks for MapRAdd the disks to MapR-FS by typing the following command (as or with ):root sudo/opt/mapr/server/disksetup -F /tmp/disks.txt

To add disks using the MapR Control System:

Add physical disks to the node or nodes according to the correct hardware procedure.In the Navigation pane, expand the group and click the view.Cluster NodesClick the name of the node on which you wish to add disks.In the pane, select the checkboxes beside the disks you wish to add.MapR-FS and Available DisksClick to add the disks. Properly-sized storage pools are allocated automatically.Add Disks to MapR-FS

Removing Disks

You can remove one or more disks from MapR-FS using the command or the MapR Control System. When you remove disks fromdisk removeMapR-FS, any other disks in the same storage pool are also removed from MapR-FS and become (not in use, and eligible to beavailablere-added to MapR-FS).

If you are removing and replacing failed disks, you can install the replacements, then re-add the replacement disks and the other disksthat were in the same storage pool(s) as the failed disks.If you are removing disks but not replacing them, you can just re-add the other disks that were in the same storage pool(s) as the faileddisks.

To remove disks using the MapR Control System:

In the Navigation pane, expand the group and click the view.Cluster NodesClick the name of the node from which you wish to remove disks.In the pane, select the checkboxes beside the disks you wish to remove.MapR-FS and Available DisksClick to remove the disks from MapR-FS.Remove Disks from MapR-FSWait several minutes while the removal process completes. After you remove the disks, any other disks in the same storage pools are



5.

6.

1.

2. 3.

4.

5. 6.

1.

2.

3. 4.

taken offline and marked as (not in use by MapR).availableRemove the physical disks from the node or nodes according to the correct hardware procedure.

Handling Disk Failure

When disks fail, MapR raises an alarm snd identifies which disks on which nodes have failed.

If a disk failure alarm is raised, check the field in to determine the reason for failure.Failure Reason /opt/mapr/logs/faileddisk.logThere are two failure cases that may not require disk replacement:

Failure Reason: Timeout - try increasing the in .mfs.io.disk.timeout mfs.confFailure Reason: Disk GUID mismatch - if a node has restarted, the drive labels ( , etc.) can be reassigned by the operating system,sdaand will no longer match the entries in . Edit according to the instructions in the log to repair the problem.disktab disktab

To replace disks using the MapR command-line interface:

On the node with the failed disk(s), look at the entries in to determine which disk or disksDisk /opt/mapr/logs/faileddisk.loghave failed.Look at the entries to determine whether the disk(s) should be replaced.Failure ReasonUse the command to remove the disk(s). Use the following syntax, substituting the hostname or IP address for adisk remove <host>nd a list of disks for :<disks>

maprcli disk remove -host <host> -disks <disks>

Wait several minutes while the removal process completes.

Note any disks that appear in the output from but not in the file (the disks from the same storage pool(s) as thefdisk -l disktabfailed disk(s), which have been removed from MapR-FS in the previous step).Replace the failed disks on the node or nodes according to the correct hardware procedure.Use the command to add the replacement disk(s) and the others that were in the same storage pool(s). Use the followingdisk addsyntax, substituting the hostname or IP address for and a list of disks for :<host> <disks>

maprcli disk add -host <host> -disks <disks>

Properly-sized storage pools are allocated automatically.

To replace disks using the MapR Control System:

Identify the failed disk or disks:In the Navigation pane, expand the group and click the view.Cluster NodesClick the name of the node on which you wish to replace disks, and look in the pane.MapR-FS and Available Disks

Remove the failed disk or disks from MapR-FS:In the pane, select the checkboxes beside the failed disks.MapR-FS and Available DisksClick to remove the disks from MapR-FS.Remove Disks from MapR-FSWait several minutes while the removal process completes. After you remove the disks, any other disks in the same storagepools are taken offline and marked as (not in use by MapR).available

Replace the failed disks on the node or nodes according to the correct hardware procedure.Add the replacement and available disks to MapR-FS:

In the Navigation pane, expand the group and click the view.Cluster NodesClick the name of the node on which you replaced the disks.In the pane, select the checkboxes beside the disks you wish to add.MapR-FS and Available DisksClick to add the disks. Properly-sized storage pools are allocated automatically.Add Disks to MapR-FS

Tolerating Slow Disks

The parameter in determines how long MapR waits for a disk to respond before assuming it has failed. Ifmfs.io.disk.timeout mfs.confhealthy disks are too slow, and are erroneously marked as failed, you can increase the value of this parameter.



1. 2. 3.

4.

1. 2. 3. 4. 5.

1. 2.

Nodes

This section provides information about managing nodes in the cluster:

Viewing a List of Nodes - displaying all the nodes recognized by the MapR clusterAdding a Node - installing a new node on the cluster (requires or permission)fc aManaging Services - starting or stopping services on a node (requires , , or permission)ss fc aReformatting a Node - reformatting a node's disksRemoving a Node - removing a node temporarily for maintenance (requires or permission)fc aDecommissioning a Node - permanently uninstalling a node (requires or permission)fc aReconfiguring a Node - installing, upgrading, or removing hardware or software, or changing rolesRenaming a Node - changing a node's hostname

Viewing a List of Nodes

You can view all nodes using the command, or view them in the MapR Control System using the following procedure.node list

To view all nodes using the MapR Control System:

In the Navigation pane, expand the group and click the view.Cluster Nodes

Adding a Node

To Add Nodes to a Cluster

PREPARE all nodes, making sure they meet the hardware, software, and configuration requirements.PLAN which services to run on the new nodes.INSTALL MapR Software:

On all new nodes, the MapR Repository.ADDOn each new node, the planned MapR services.INSTALLOn all new nodes, configure.sh.RUNOn all new nodes, disks for use by MapR.FORMATIf any configuration files on your existing cluster's nodes have been modified (for example, or warden.conf mapred-site.xm

), replace the default configuration files on all new nodes with the appropriate modified files.lStart ZooKeeper on all new nodes that have ZooKeeper installed:


Start the warden on all new nodes:


If any of the new nodes are CLDB and/or ZooKeeper nodes, on all new and existing nodes in the cluster, specifying allRUN configure.shCLDB and ZooKeeper nodes.SET UP node topology for the new nodes.On any new nodes running NFS, NFS for HA.SET UP

Managing Services

You can manage node services using the command, or in the MapR Control System using the following procedure.node services

To manage node services using the MapR Control System:

In the Navigation pane, expand the group and click the view.Cluster NodesSelect the checkbox beside the node or nodes you wish to remove.Click the button to display the dialog.Manage Services Manage Node ServicesFor each service you wish to start or stop, select the appropriate option from the corresponding drop-down menu.Click to start and stop the services according to your selections.Change Node

You can also display the Manage Node Services dialog by clicking in the view.Manage Services Node Properties

Reformatting a Node

Change to the root user (or use sudo for the following commands).Stop the Warden:



2.

3.

4. 5.

6.

1. 2. 3. 4. 5. 6.

1. 2.

3.

4.

5.

6.

/etc/init.d/mapr-warden stopRemove the file:disktabrm /opt/mapr/conf/disktabCreate a text file that lists all the disks and partitions to format for use by MapR. See ./tmp/disks.txt Setting Up Disks for MapRUse to re-format the disks:disksetupdisksetup -F /tmp/disks.txtStart the Warden:/etc/init.d/mapr-warden start

Removing a Node

You can remove a node using the command, or in the MapR Control System using the following procedure. Removing a nodenode removedetaches the node from the cluster, but does not remove the MapR software from the cluster.

To remove a node using the MapR Control System:

In the Navigation pane, expand the group and click the view.Cluster NodesSelect the checkbox beside the node or nodes you wish to remove.Click and stop all services on the node.Manage ServicesWait 5 minutes. The Remove button becomes active.Click the button to display the dialog.Remove Remove NodeClick to remove the node.Remove Node


You can also remove a node by clicking in the view.Remove Node Node Properties

Decommissioning a Node

Use the following procedures to remove a node and uninstall the MapR software. This procedure detaches the node from the cluster and removesthe MapR packages, log files, and configuration files, but does not format the disks.

Before Decommissioning a NodeMake sure any data on the node is replicated and any needed services are running elsewhere. For example, if decommissioning the node would result in too few instances of the CLDB, start CLDB on another node beforehand; if you aredecommissioning a ZooKeeper node, make sure you have enough ZooKeeper instances to meet a quorum after the node isremoved. See for recommendations.Planning the Deployment

To decommission a node permanently:

Change to the root user (or use sudo for the following commands).Stop the Warden:/etc/init.d/mapr-warden stopIf ZooKeeper is installed on the node, stop it:/etc/init.d/mapr-zookeeper stopDetermine which MapR packages are installed on the node:

dpkg --list | grep mapr (Ubuntu)rpm -qa | grep mapr (Red Hat or CentOS)

Remove the packages by issuing the appropriate command for the operating system, followed by the list of services. Examples:apt-get purge mapr-core mapr-cldb mapr-fileserver (Ubuntu)yum erase mapr-core mapr-cldb mapr-fileserver (Red Hat or CentOS)

If the node you have decommissioned is a CLDB node or a ZooKeeper node, then run on all other nodes in the clusterconfigure.sh(see ).Configuring a Node


Reconfiguring a Node

You can add, upgrade, or remove services on a node to perform a manual software upgrade or to change the roles a node serves. There are foursteps to this procedure:

Stopping the Node



1. 2.

3.

1. 2. 3. 4.

1. 2. 3.

1. 2.

Formatting the Disks (optional)Installing or Removing Software or HardwareConfiguring the NodeStarting the Node

This procedure is designed to make changes to existing MapR software on a machine that has already been set up as a MapR cluster node. Ifyou need to install software for the first time on a machine to create a new node, please see instead.Adding a Node

Stopping a Node

Change to the root user (or use sudo for the following commands).Stop the Warden:/etc/init.d/mapr-warden stopIf ZooKeeper is installed on the node, stop it:/etc/init.d/mapr-zookeeper stop

Installing or Removing Software or Hardware

Before installing or removing software or hardware, stop the node using the procedure described in .Stopping the Node

Once the node is stopped, you can add, upgrade or remove software or hardware. At some point in time after adding or removing services, it isrecommended to restart the warden, to re-optimize memory allocation among all the services on the node. It is not crucial to perform this stepimmediately; you can restart the warden at a time when the cluster is not busy.

To add or remove individual MapR packages, use the standard package management commands for your Linux distribution:

apt-get (Ubuntu)yum (Red Hat or CentOS)

For information about the packages to install, see .Planning the Deployment

To add roles to an existing node:

Install the packages corresponding to the new roles using or .apt-get yumRun with a list of the CLDB nodes and ZooKeeper nodes in the cluster.configure.shIf you have added the CLDB or ZooKeeper role, run on all nodes in the cluster.configure.shThe warden picks up the new configuration and automatically starts the new services. If you are adding the CLDB role, restart thewarden:


If you are not adding the CLDB role, you can wait until a convenient time to restart the warden.

To remove roles from an existing node:

Purge the packages corresponding to the roles using or .apt-get yumRun with a list of the CLDB nodes and ZooKeeper nodes in the cluster.configure.shIf you have removed the CLDB or ZooKeeper role, run on all nodes in the cluster.configure.sh

The warden picks up the new configuration automatically. When it is convenient, restart the warden:


Setting Up a Node

Formatting the Disks

The script removes all data from the specified disks. Make sure you specify the disks correctly, and that any datadisksetupyou wish to keep has been backed up elsewhere. Before following this procedure, make sure you have backed up any data youwish to keep.

Change to the user (or use for the following command).root sudoRun , specifying the disk list file.disksetup



2.

1.

2.

1.

2.

3.

4. 5.

6.

Example:


Configuring the Node





Example:



Starting the Node

If ZooKeeper is installed on the node, start it:/etc/init.d/mapr-zookeeper startStart the Warden:/etc/init.d/mapr-warden start

Renaming a Node

To rename a node:

Stop the warden on the node. Example:


If the node is a ZooKeeper node, stop ZooKeeper on the node. Example:


Rename the host:On Red Hat or CentOS, edit the parameter in the file and restart the service orHOSTNAME /etc/sysconfig/network xinetdreboot the node.On Ubuntu, change the old hostname to the new hostname in the and files./etc/hostname /etc/hosts

If the node is a ZooKeeper node or a CLDB node, run with a list of CLDB and ZooKeeper nodes. See .configure.sh configure.shIf the node is a ZooKeeper node, start ZooKeeper on the node. Example:


Start the warden on the node. Example:




1. 2. 3.

4.

5.

6.

7. 8.

Adding Nodes to a Cluster

To Add Nodes to a Cluster

PREPARE all nodes, making sure they meet the hardware, software, and configuration requirements.PLAN which services to run on the new nodes.INSTALL MapR Software:

On all new nodes, the MapR Repository.ADDOn each new node, the planned MapR services.INSTALLOn all new nodes, configure.sh.RUNOn all new nodes, disks for use by MapR.FORMATIf any configuration files on your existing cluster's nodes have been modified (for example, or warden.conf mapred-site.xm

), replace the default configuration files on all new nodes with the appropriate modified files.lStart ZooKeeper on all new nodes that have ZooKeeper installed:


Start the warden on all new nodes:


If any of the new nodes are CLDB and/or ZooKeeper nodes, on all new and existing nodes in the cluster,RUN configure.shspecifying all CLDB and ZooKeeper nodes.SET UP node topology for the new nodes.On any new nodes running NFS, NFS for HA.SET UP



1. 2. 3. 4.

Adding Roles







1. 2. 3. 4.

5.

Node Topology

Topology tells MapR about the locations of nodes and racks in the cluster. Topology is important, because it determines where MapR placesreplicated copies of data. If you define the cluster topology properly, MapR scatters replication on separate racks so that your data remainsavailable in the event an entire rack fails. Cluster topology is defined by specifying a topology path for each node in the cluster. The paths groupnodes by rack or switch, depending on how the physical cluster is arranged and how you want MapR to place replicated data.

Topology paths can be as simple or complex as needed to correspond to your cluster layout. In a simple cluster, each topology path might consistof the rack only (e. g. ). In a deployment consisting of multiple large datacenters, each topology path can be much longer (e. g. /rack-1 /europe

). MapR uses topology paths to spread out replicated copies of data, placing each copy on/uk/london/datacenter2/room4/row22/rack5/a separate path. By setting each path to correspond to a physical rack, you can ensure that replicated data is distributed across racks to improvefault tolerance.

After you have defined node topology for the nodes in your cluster, you can use volume topology to place volumes on specific racks, nodes, orgroups of nodes. See .Setting Volume Topology

Setting Node Topology Manually

You can specify a topology path for one or more nodes using the command, or in the MapR Control System using the followingnode topoprocedure.

To set node topology using the MapR Control System:

In the Navigation pane, expand the group and click the view.Cluster NodesSelect the checkbox beside each node whose topology you wish to set.Click the button to display the dialog.Change Topology Change Node TopologySet the path in the field:New Path

To define a new path, type a topology path. Topology paths must begin with a forward slash ('/').To use a path you have already defined, select it from the dropdown.

Click to set the new topology.Move Node

Setting Node Topology with a Script

If the cluster is large, it is more convenient to set the topology mapping using a text file or a script that specifies the topology. Each line of the textfile (or the output from the script) specifies a single node and its full topology path, in the following format:<ip or hostname> <topology>

The text file or script must be specified (and available) on the local filesystem on all CLDB nodes:

To set topology with a text file, set in to the text file namenet.topology.table.file.name /opt/mapr/conf/cldb.confTo set topology with a script, set in to the script file namenet.topology.script.file.name /opt/mapr/conf/cldb.conf

If both are specified, the script is used and the text file is ignored.



1. 2. 3.

Removing Roles

To remove roles from an existing node:

Purge the packages corresponding to the roles using or .apt-get yumRun with a list of the CLDB nodes and ZooKeeper nodes in the cluster.configure.shIf you have removed the CLDB or ZooKeeper role, run on all nodes in the cluster.configure.sh

The warden picks up the new configuration automatically. When it is convenient, restart the warden:




1. 2.

1. 2. 3.

1. 2. 3. 4.

1. 2. 3. 4.

1. 2. 3. 4.

Services

Viewing Services on the Cluster

You can view services on the cluster using the command, or using the MapR Control System. In the MapR Control System, thedashboard inforunning services on the cluster are displayed in the pane of the .Services Dashboard

To view the running services on the cluster using the MapR Control System:

Log on to the MapR Control System.In the Navigation pane, expand the pane and click .Cluster Dashboard

Viewing Services on a Node

You can view services on a single node using the command, or using the MapR Control System. In the MapR Control System, theservice listrunning services on a node are displayed in the .Node Properties View

To view the running services on a node using the MapR Control System:

Log on to the MapR Control System.In the Navigation pane, expand the pane and click .Cluster NodesClick the hostname of the node you would like to view. The services are displayed in the Manage Node Services pane.

Starting Services

You can start services using the command, or using the MapR Control System.node services

To start specific services on a node using the MapR Control System:

Log on to the MapR Control System.In the Navigation pane, expand the pane and click .Cluster NodesClick the hostname of the node you would like to view. The services are displayed in the Manage Node Services pane.Click the checkbox next to each service you would like to start, and click .Start Service

Stopping Services

You can stop services using the command, or using the MapR Control System.node services

To stop specific services on a node using the MapR Control System:

Log on to the MapR Control System.In the Navigation pane, expand the pane and click .Cluster NodesClick the hostname of the node you would like to view. The services are displayed in the Manage Node Services pane.Click the checkbox next to each service you would like to stop, and click .Stop Service

Adding Services

Services determine which roles a node fulfills. You can view a list of the roles configured for a given node by listing the direct/opt/mapr/rolesory on the node. To add roles to a node, you must install the corresponding services.







1. 2. 3. 4. 5. 6. 7.

1. 2.

3.

1. 2.

3.

1. 2.

3.

CLDB Failover

The CLDB automatically replicates its data to other nodes in the cluster, preserving at least two (and generally three) copies of the CLDB data. Ifthe CLDB process dies, it is automatically restarted on the node. All jobs and processes wait for the CLDB to return, and resume from where theyleft off, with no data or job loss.

If the node itself fails, the CLDB data is still safe, and the cluster can continue normally as soon as the CLDB is started on another node. In anM5-licensed cluster, a failed CLDB node automatically fails over to another CLDB node without user intervention and without data loss. It ispossible to recover from a failed CLDB node on an M3 cluster, but the procedure is somewhat different.

Recovering from a Failed CLDB Node on an M3 Cluster

To recover from a failed CLDB node, perform the steps listed below:

Restore ZooKeeper - if necessary, install ZooKeeper on an additional node.Locate the CLDB data - locate the nodes where replicates of CLDB data are stored, and choose one to serve as the new CLDB node.Stop the selected node - stop the node you have chosen, to prepare for installing the CLDB service.Install the CLDB on the selected node - install the CLDB service on the new CLDB node.Configure the selected node - run to inform the CLDB node where the CLDB and ZooKeeper services are running.configure.shStart the selected node - start the new CLDB node.Restart all nodes - stop each node in the cluster, run on it, and start it.configure.sh

After the CLDB restarts, there is a 15-minute delay before replication resumes, in order to allow all nodes to register and heartbeat. This delay canbe configured using the command to set the parameter.config save cldb.replication.manager.start.mins

Restore ZooKeeper

If the CLDB node that failed was also running ZooKeeper, install ZooKeeper on another node to maintain the minimum required number ofZooKeeper nodes.

Locate the CLDB Data

Perform the following steps on any cluster node:

Login as or use for the following commands.root sudoIssue the command, passing the , to determine which nodes contain the CLDB data:dump cldbnodes ZooKeeper connect string

maprcli dump cldbnodes -zkconnect <ZooKeeper connect string> -json

In the output, the nodes containing CLDB data are listed in the parameter.valid

Choose one of the nodes from the output, and perform the procedures listed below on it to install a new CLDB.dump cldbnodes

Stop the Selected Node

Perform the following steps on the node you have selected for installation of the CLDB:


Install the CLDB on the Selected Node


Login as or use for the following commands.root sudoInstall the CLDB service on the node:

RHEL/CentOS: yum install mapr-cldbUbuntu: apt-get install mapr-cldb

Wait until the failover delay expires. If you try to start the CLDB before the failover delay expires, the following message appears:

CLDB HA check failed: not licensed, failover denied: elapsed time since last failure=<time inminutes> minutes

Configure the Selected Node



1.

2.

1. 2.

3.

1.

2.






Example:



Start the Node


If ZooKeeper is installed on the node, start it:/etc/init.d/mapr-zookeeper startStart the Warden:/etc/init.d/mapr-warden start

Restart All Nodes

On all nodes in the cluster, perform the following procedures:

Stop the node:


Configure the node with the new CLDB and ZooKeeper addresses:





Example:



Start the node:

If ZooKeeper is installed on the node, start it:/etc/init.d/mapr-zookeeper start



2. Start the Warden:/etc/init.d/mapr-warden start



TaskTracker Blacklisting

In the event that a TaskTracker is not performing properly, it can be so that no jobs will be scheduled to run on it. There are two typesblacklistedof TaskTracker blacklisting:

Per-job blacklisting, which prevents scheduling new tasks from a particular jobCluster-wide blacklisting, which prevents scheduling new tasks from all jobs

Per-Job Blacklisting

The configuration value in specifies a number of task failures in a specific job aftermapred.max.tracker.failures mapred-site.xmlwhich the TaskTracker is blacklisted for that job. The TaskTracker can still accept tasks from other jobs, as long as it is not blacklistedcluster-wide (see below).

A job can only blacklist up to 25% of TaskTrackers in the cluster.

Cluster-Wide Blacklisting

A TaskTracker can be blacklisted cluster-wide for any of the following reasons:

The number of blacklists from successful jobs (the ) exceeds fault count mapred.max.tracker.blacklistsThe TaskTracker has been manually blacklisted using hadoop job -blacklist-tracker <host>The status of the TaskTracker (as reported by a user-provided health-check script) is not healthy

If a TaskTracker is blacklisted, any currently running tasks are allowed to finish, but no further tasks are scheduled. If a TaskTracker has beenblacklisted due to or using the command, un-blacklistingmapred.max.tracker.blacklists hadoop job -blacklist-tracker <host>requires a TaskTracker restart.

Only 50% of the TaskTrackers in a cluster can be blacklisted at any one time.

After 24 hours, the TaskTracker is automatically removed from the blacklist and can accept jobs again.

Blacklisting a TaskTracker Manually

To blacklist a TaskTracker manually, run the following command as the administrative user:

hadoop job -blacklist-tracker <hostname>

Manually blacklisting a TaskTracker prevents additional tasks from being scheduled on the TaskTracker. Any currently running tasks are allowedto fihish.

Un-blacklisting a TaskTracker Manually

If a TaskTracker is blacklisted per job, you can un-blacklist it by running the following command as the administrative user:

hadoop job -unblacklist <jobid> <hostname>

If a TaskTracker has been blacklisted cluster-wide due to or using the mapred.max.tracker.blacklists hadoop job command, un-blacklisting requires a TaskTracker restart. If a TaskTracker has been blacklisted cluster-wide-blacklist-tracker <host>

due to a non-healthy status, correct the indicated problem and run the health check script again. When the script picks up the healthy status, theTaskTracker is un-blacklisted.



1. 2. 3.

1. 2.

3.

4.

5.

6.

1. 2. 3.

4.

Startup and Shutdown

To safely shut down and restart an entire cluster, preserving all data and full replication, you must follow a specific sequence that stops writes sothat the cluster does not shut down in the middle of an operation:

Shut down the NFS service everywhere it is running.Shut down the CLDB nodes.Shut down all remaining nodes.

This procedure ensures that on restart the data is replicated and synchronized, so that there is no single point of failure for any data.

To shut down the cluster:

Change to the user (or use for the following commands).root sudoBefore shutting down the cluster, you will need a list of NFS nodes, CLDB nodes, and all remaining nodes. Once the CLDB is shut down,you cannot retrieve a list of nodes; it is important to obtain this information at the beginning of the process. Use the commannode listd as follows:

Determine which nodes are running the NFS gateway. Example:

/opt/mapr/bin/maprcli node list -filter -columns id,h,hn,svc, rp"[rp==/*]and[svc==nfs]"id service hostname health ip 6475182753920016590 fileserver,tasktracker,nfs,hoststats node-252.cluster.us 0 10.10.50.252 8077173244974255917 tasktracker,cldb,fileserver,nfs,hoststats node-253.cluster.us 0 10.10.50.253 5323478955232132984 webserver,cldb,fileserver,nfs,hoststats,jobtracker node-254.cluster.us 0 10.10.50.254

Determine which nodes are running the CLDB. Example:

/opt/mapr/bin/maprcli node list -filter -columns id,h,hn,svc, rp"[rp==/*]and[svc==cldb]"

List all non-CLDB nodes. Example:

/opt/mapr/bin/maprcli node list -filter -columns id,h,hn,svc, rp"[rp==/*]and[svc!=cldb]"

Shut down all NFS instances. Example:

/opt/mapr/bin/maprcli node services -nfs stop -nodesnode-252.cluster.us,node-253.cluster.us,node-254.cluster.us

SSH into each CLDB node and stop the warden. Example:


SSH into each of the remaining nodes and stop the warden. Example:


If desired, you can shut down the nodes using the Linux command.halt

To start up the cluster:

If the cluster nodes are not running, start them.Change to the user (or use for the following commands).root sudoStart the ZooKeeper on nodes where it is installed. Example:


On all nodes, start the warden. Example:



4.

5.


Over a period of time (depending on the cluster size and other factors) the cluster comes up automatically. After the CLDB restarts, thereis a 15-minute delay before replication resumes, in order to allow all nodes to register and heartbeat. This delay can be configured usingthe command to set the parameter.config save cldb.replication.manager.start.mins



1. 2. 3.

Managing Data with Volumes

MapR provides volumes as a way to organize data and manage cluster performance. Volumes are critical to efficient usage of the cluster, and asyour cluster grows you will work more and more with volumes to provision for efficient, high-availability access to data.

A volume is a logical unit that allows you to apply policies to a set of files, directories, and sub-volumes. Using volumes, you can enforce diskusage limits, set replication levels, establish ownership and accountability, and measure the cost generated by different projects or departments.Create a volume for each user, department, or project---you can mount volumes under other volumes, to build a structure that reflects the needsof your organization.

On a cluster with an M5 license, you can create a special type of volume called a , a local or remote copy of an entire volume. Mirrors aremirroruseful for load balancing or disaster recovery. With an M5 license, you can also create a , an image of a volume at a specific point insnapshottime. Snapshots are useful for rollback to a known data set. You can create snapshots manually or using a .schedule

See also:

MirrorsSnapshotsSchedules

MapR lets you control and configure volumes in a number of ways:

Replication - set the number of physical copies of the data, for robustness and performanceTopology - restrict a volume to certain physical racks or nodes (requires M5 license and permission on the volume)mQuota - set a hard disk usage limit for a volume (requires M5 license)Advisory Quota - receive a notification when a volume exceeds a soft disk usage quota (requires M5 license)Ownership - set a user or group as the accounting entity for the volumePermissions - give users or groups permission to perform specified volume operationsFile Permissions - Unix-style read/write permissions on volumes

The following sections describe procedures associated with volumes:

To create a new volume, see (requires permission on the volume)Creating a Volume cvTo view a list of volumes, see Viewing a List of VolumesTo view a single volume's properties, see Viewing Volume PropertiesTo modify a volume, see (requires permission on the volume)Modifying a Volume mTo mount a volume, see (requires permission on the volume)Mounting a Volume mntTo unmount a volume, see (requires permission on the volume)Unmounting a Volume mTo remove a volume, see (requires permission on the volume)Removing a Volume dTo set volume topology, see (requires permission on the volume)Setting Volume Topology m

Creating a Volume

When creating a volume, the only required parameters are the volume type (normal or mirror) and the volume name. You can set the ownership,permissions, quotas, and other parameters at the time of volume creation, or use the dialog to set them later. If you plan toVolume Propertiesschedule snapshots or mirrors, it is useful to create a ahead of time; the schedule will appear in a drop-down menu in the VolumescheduleProperties dialog.

By default, the root user and the volume creator have full control permissions on the volume. You can grant specific permissions to other usersand groups:

Code Allowed Action

dump Dump the volume

restore Mirror or restore the volume

m Modify volume properties, create and delete snapshots

d Delete a volume

fc Full control (admin access and permission to change volume ACL)

You can create a volume using the command, or use the following procedure to create a volume using the MapR Control System.volume create

To create a volume using the MapR Control System:

In the Navigation pane, expand the group and click the view.MapR-FS VolumesClick the button to display the dialog.New Volume New VolumeUse the radio button at the top of the dialog to choose whether to create a standard volume, a local mirror, or a remoteVolume Type



3.

4. 5.

6. 7. 8.

a. b. c.

9. a. b. c.

10. a. b.

c.

11.

1.

1. 2.

3.

1. 2.

3. 4.

1. 2. 3.

mirror.Type a name for the volume or source volume in the or field.Volume Name Mirror NameIf you are creating a mirror volume:

Type the name of the source volume in the field.Source Volume NameIf you are creating a remote mirror volume, type the name of the cluster where the source volume resides, in the Source Cluster

field.NameYou can set a mount path for the volume by typing a path in the field.Mount PathYou can specify which rack or nodes the volume will occupy by typing a path in the field.TopologyYou can set permissions using the fields in the section:Ownership & Permissions

Click to display fields for a new permission.[ + Add Permission ]In the left field, type either u: and a user name, or g: and a group name.In the right field, select permissions to grant to the user or group.

You can associate a standard volume with an accountable entity and set quotas in the section:Usage TrackingIn the field, select or from the dropdown menu and type the user or group name in the text field.Group/User User GroupTo set an advisory quota, select the checkbox beside and type a quota (in megabytes) in the text field.Volume Advisory QuotaTo set a quota, select the checkbox beside and type a quota (in megabytes) in the text field.Volume Quota

You can set the replication factor and choose a snapshot or mirror in the Replication and Snapshot section:scheduleType the desired replication factor in the field.Replication FactorType the minimum replication factor in the field. When the number of replicas drops down to or below thisMinimum Replicationnumber, the volume is aggressively re-replicated to bring it above the minimum replication factor.To schedule snapshots or mirrors, select a from the dropdown menu or the schedule Snapshot Schedule Mirror Update

dropdown menu respectively.ScheduleClick to create the volume.OK

Viewing a List of Volumes

You can view all volumes using the command, or view them in the MapR Control System using the following procedure.volume list

To view all volumes using the MapR Control System:

In the Navigation pane, expand the group and click the view.MapR-FS Volumes

Viewing Volume Properties

You can view volume properties using the command, or use the following procedure to view them using the MapR Control System.volume info

To view the properties of a volume using the MapR Control System:

In the Navigation pane, expand the group and click the view.MapR-FS VolumesDisplay the dialog by clicking the volume name, or by selecting the checkbox beside the volume name, then clickingVolume Propertiesthe button.PropertiesAfter examining the volume properties, click to exit without saving changes to the volume.Close

Modifying a Volume

You can modify any attributes of an existing volume, except for the following restriction:

You cannot convert a normal volume to a mirror volume.

You can modify a volume using the command, or use the following procedure to modify a volume using the MapR Control System.volume modify

To modify a volume using the MapR Control System:

In the Navigation pane, expand the group and click the view.MapR-FS VolumesDisplay the dialog by clicking the volume name, or by selecting the checkbox beside the volume name then clickingVolume Propertiesthe button.PropertiesMake changes to the fields. See for more information about the fields.Creating a VolumeAfter examining the volume properties, click to save changes to the volume.Modify Volume

Mounting a Volume

You can mount a volume using the command, or use the following procedure to mount a volume using the MapR Control System.volume mount

To mount a volume using the MapR Control System:

In the Navigation pane, expand the group and click the view.MapR-FS VolumesSelect the checkbox beside the name of each volume you wish to mount.



3.

1. 2. 3.

1. 2. 3. 4.

1. 2.

3. 4. 5. 6.

Click the button.Mount

You can also mount or unmount a volume using the checkbox in the dialog. See for moreMounted Volume Properties Modifying a Volumeinformation.

Unmounting a Volume

You can unmount a volume using the command, or use the following procedure to unmount a volume using the MapR Controlvolume unmountSystem.

To unmount a volume using the MapR Control System:

In the Navigation pane, expand the group and click the view.MapR-FS VolumesSelect the checkbox beside the name of each volume you wish to unmount.Click the button.Unmount

You can also mount or unmount a volume using the Mounted checkbox in the dialog. See for moreVolume Properties Modifying a Volumeinformation.

Removing a Volume or Mirror

You can remove a volume using the command, or use the following procedure to remove a volume using the MapR Controlvolume removeSystem.

To remove a volume or mirror using the MapR Control System:

In the Navigation pane, expand the group and click the view.MapR-FS VolumesClick the checkbox next to the volume you wish to remove.Click the button to display the Remove Volume dialog.RemoveIn the Remove Volume dialog, click the button.Remove Volume

Setting Volume Topology

You can place a volume on specific racks, nodes,or groups of nodes by setting its topology to an existing node topology. For more informationabout node topology, see .Node Topology

To set volume topology, choose the path that corresponds to the node topology of the rack or nodes where you would like the volume to reside.You can set volume topology using the MapR Control System or with the command.volume move

To set volume topology using the MapR Control System:

In the Navigation pane, expand the group and click the view.MapR-FS VolumesDisplay the dialog by clicking the volume name or by selecting the checkbox beside the volume name, then clickingVolume Propertiesthe button.PropertiesClick to display the Move Volume dialog.Move VolumeSelect a topology path that corresponds to the rack or nodes where you would like the volume to reside.Click to return to the Volume Properties dialog.Move VolumeClick to save changes to the volume.Modify Volume

Setting Default Volume Topology

By default, new volumes are created with a topology of (root directory). To change the default topology, use the command to/ config savechange the configuration parameter. Example:cldb.default.volume.topology

maprcli config save -values cldb. .volume.topology\ / -rack\"\" default ":\" default ""

After running the above command, new volumes have the volume topology by default./default-rack

Example: Setting Up CLDB-Only Nodes

In a large cluster (100 nodes or more) create CLDB-only nodes to ensure high performance. This configuration also provides additional controlover the placement of the CLDB data, for load balancing, fault tolerance, or high availability (HA). Setting up CLDB-only nodes involves restrictingthe CLDB volume to its own topology and making sure all other volumes are on a separate topology. Unless you specify a default volume



1.

2.

3. 4. 5.

1.

2.

3.

4.

1.

2.

topology, new volumes have no topology when they are created, and reside at the root topology path: " ". Because both the CLDB-only path and/the non-CLDB path are children of the root topology path, new non-CLDB volumes are not guaranteed to keep off the CLDB-only nodes. To avoidthis problem, set a default volume topology. See .Setting Default Volume Topology













Mirrors

A is a read-only physical copy of another volume, the . Creating mirrors on the same cluster (local mirroring) is usefulmirror volume source volumefor load balancing and local backup. Creating mirrors on another cluster (remote mirroring) is useful for wide distribution and disasterpreparedness. Creating a mirror is similar to creating a normal (read/write) volume, except that you must specify a source volume from which themirror retrieves its contents (the ). When you mirror a volume, read requests to the source volume can be served by any of itsmirroring operationmirrors on the same cluster via a of type . A volume link is similar to a normal volume mount point, except that you can specifyvolume link mirrorwhether it points to the source volume or its mirrors.

To write to (and read from) the source volume, mount the source volume normally. As long as the source volume is mounted below anon-mirrored volume, you can read and write to the volume normally via its direct mount path. You can also use a volume link of type wri

to write directly to the source volume regardless of its mount point.teableTo read from the mirrors, use the command to make a volume link (of type ) to the source volume. Any readsvolume link create mirrorvia the volume link will be distributed among the volume's mirrors. It is not necessary to also mount the mirrors, because the volume linkhandles access to the mirrors.

Any mount path that consists entirely of mirrored volumes will refer to a mirrored copy of the target volume; otherwise the mount path refers to thespecified volume itself. For example, assume a mirrored volume mounted at . If the root volume is mirrored, then the mount path referc /a/b/c /s to a mirror of the root volume; if in turn is mirrored, then the path refers to a mirror of and so on. If all volumes preceding in the mounta /a a cpath are mirrored, then the path refers to one of the mirrors of . However, if any volume in the path is not mirrored then the source/a/b/c cvolume is selected for that volume and subsequent volumes in the path. If is not mirrored, then although still selects a mirror, refers to thea / /asource volume itself (because there is only one) and refers to the source volume (because it was not accessed via a mirror). In thata /a/b bcase, refers to the source volume ./a/b/c c

Any mirror that is accessed via a parent mirror (all parents are mirror volumes) is implicitly mounted. For example, assume a volume that isamirrored to , and a volume that is mirrored to and ; is mounted at , is mounted at , and a-mirror b b-mirror-1 b-mirror-2 a /a b /a/b a-mir

is mounted at . In this case, reads via will access one of the mirrors or without theror /a-mirror /a-mirror/b b-mirror-1 b-mirror-2requirement to explicitly mount them.

At the start of a mirroring operation, a temporary snapshot of the source volume is created; the mirroring process reads from the snapshot so thatthe source volume remains available for both reads and writes during mirroring. If the mirroring operation is schedule-based, the snapshot expiresaccording to the Retain For parameter of the schedule; snapshots created during manual mirroring remain until they are deleted manually. Tosave bandwidth, the mirroring process transmits only the deltas between the source volume and the mirror; after the initial mirroring operation(which creates a copy of the entire source volume), subsequent updates can be extremely fast.

Mirroring is extremely resilient. In the case of a (some or all machines where the source volume resides cannot communicatenetwork partitionwith machines where the mirror volume resides), the mirroring operation will periodically retry the connection, and will complete mirroring whenthe network is restored.

Working with Mirrors

Choose which volumes you mirror and where you locate the mirrors, depending on what you plan to use the mirrors for. Backup mirrors fordisaster recovery can be located on physical media outside the cluster or in a remote cluster. In the event of a disaster affecting the sourcecluster, you can check the time of last successful synchronization to determine how up-to-date the backups are (see below). LocalMirror Statusmirrors for load balancing can be located in specific servers or racks with especially high bandwidth, and mounted in a public directory separatefrom where the source volume is mounted. You can structure several mirrors in a cascade (or chain) from the source volume, or point all mirrorsto the source volume individually, depending on whether it is more important to conserve bandwidth or to ensure that all mirrors are in sync witheach other as well as the source volume. In most cases, it is convenient to set a to automate mirror synchronization, rather than usingschedulethe command to synchronize data manually. Mirroring completion time depends on the available bandwidth and the size of thevolume mirror startdata being transmitted. For best performance, set the mirroring schedule according to the anticipated rate of data changes, and the availablebandwidth for mirroring.

The following sections provide information about various mirroring use cases.

Local and Remote Mirroring

Local mirroring (creating mirrors on the same cluster) is useful for load balancing, or for providing a read-only copy of a data set.

Although it is not possible to directly mount a volume from one cluster to another, you can mirror a volume to a remote cluster ( ).remote mirroringBy mirroring the cluster's root volume and all other volumes in the cluster, you can create an entire mirrored cluster that keeps in sync with thesource cluster. Mount points are resolved within each cluster; any volumes that are mirrors of a source volume on another cluster are read-only,because a source volume from another cluster cannot be resolved locally.

To transfer large amounts of data between physically distant clusters, you can use the command to create volumevolume dump createcopies for transport on physical media. The command creates backup files containing the volumes, which can bevolume dump createreconstituted into mirrors at the remote cluster using the command. These mirrors can be reassociated with the sourcevolume dump restorevolumes (using the command to specify the source for each mirror volume) for live mirroring.volume modify

Local Mirroring Example



1.

2.

3.

4. 5. 6.

1. 2. 3. 4.

Assume a volume containing a table of data that will be read very frequently by many clients, but updated infrequently. The data is contained in avolume named , which is to be mounted under a non-mirrored user volume belonging to . The mount path for the writeabletable-data jsmithcopy of the data is to be and the public, readable mirrors of the data are to be mounted at /home/private/users/jsmith/private-table

. You would set it up as follows:/public/data/table

Create as many mirror volumes as needed for the data, using the or the command (See MapR Control System volume create Creating a).Volume

Mount the source volume at the desired location (in this case, ) using the MapR/home/private/users/jsmith/private-tableControl System or the command.volume mountUse the command to create a volume link at pointing to the source volume. Example:volume link create /public/data/table

maprcli volume link create -volume table-data -type mirror -path / /data/tablepublic

Write the data to the source volume via the mount path as needed./home/private/users/jsmith/private-tableWhen the data is ready for public consumption, use the command to push the data out to all the mirrors.volume mirror pushCreate additional mirrors as needed and push the data to them. No additional steps are required; as soon as a mirror is created andsynchronized, it is available via the volume link.

When a user reads via the path , the data is served by a randomly selected mirror of the source volume. Reads are/public/data/tableevenly spread over all mirrors.

Remote Mirroring Example

Assume two clusters, and , and a volume on to be mirrored to . Create a mirrorcluster-1 cluster-2 volume-a cluster-1 cluster-2volume on , specifying the remote cluster and volume. You can create remote mirrors using the MapR Control System or the volumecluster-2create command:

In the MapR Control System on , specify the following values in the dialog:cluster-2 New VolumeSelect .Remote Mirror VolumeEnter or another name in the field.volume-a Volume NameEnter in the field.volume-a Source VolumeEnter in the field.cluster-1 Source Cluster

Using the command , specify the following parameters:volume create cluster-2Specify the source volume and cluster in the format , provide a for the mirror volume, and specify<volume>@<cluster> namea of . Example:type 1

maprcli volume create -name volume-a -source volume-a@cluster-1 -type 1

After creating the mirror volume, you can synchronize the data using from to pull data to the mirror volume on volume mirror start cluster-2 cl from its source volume on .uster-2 cluster-1

When you mount a mirror volume on a remote cluster, any mirror volumes below it are automatically mounted. For example, assume volumes aand on (mounted at and ) are mirrored to and on . When you mount the volume b cluster-1 /a /a/b a-mirror b-mirror cluster-2 a-mirr

at on , it contains a mount point for which gets mapped to the mirror of , making it available at .or /a-mirror cluster-2 /b b /a-mirror/bAny mirror volumes below will be similarly mounted, and so on.b

Mirror Status

You can see a list of all mirror volumes and their current status on the view (in the MapR Control System, select then Mirror Volumes MapR-FS M) or using the command. You can see additional information about mirror volumes on the CLDB status page (in theirror Volumes volume list

MapR Control System, select ), which shows the status and last successful synchronization of all mirrors, as well as the container locationsCLDBfor all volumes. You can also find container locations using the commands.hadoop mfs

Mirroring the Root Volume

The most frequently accessed volumes in a cluster are likely to be the root volume and its immediate children. In order to load-balance reads onthese volumes, it is possible to mirror the root volume (typically , which is mounted at ). There is a special writeablemapr.cluster.root /volume link called inside the root volume, to provide access to the source volume. In other words, if the root volume is mirrored:.rw

The path refers to one of the mirrors of the root volume/The path refers to the source (writeable) root volume/.rw

Mirror Cascades

A (or ) is a series of mirrors that form a chain from a single source volume: the first mirror receives updates frommirror cascade chain mirroringthe source volume, the second mirror receives updates from the first, and so on. Mirror cascades are useful for propagating data over a distance,then re-propagating the data locally instead of transferring the same data remotely again for each copy of the mirror.



1. 2. 3.

1. 2. 3.

You can create or break a mirror cascade made from existing mirror volumes by changing the source volume of each mirror in the Volume dialog.Properties

Creating, Modifying, and Removing Mirror Volumes

On an M5-licensed cluster, you can create a mirror manually or automate the process with a . You can set the of a mirrorschedule topologyvolume to determine the placement of the data, if desired. The following sections describe procedures associated with mirrors:

To create a new mirror volume, see (requires license and permission)Creating a Volume M5 cvTo modify a mirror (including changing its source), see Modifying a VolumeTo remove a mirror, see Removing a Volume or Mirror

You can change a mirror's source volume by changing the source volume in the dialog.Volume Properties

Starting a Mirror

To a mirror means to pull the data from the source volume. Before starting a mirror, you must create a mirror volume and associate it with astartsource volume. You should start a mirror operation shortly after creating the mirror volume, and then again each time you want to synchronize themirror with the source volume. You can use a to automate the synchronization. If you create a mirror and synchronize it only once, it isschedulelike a snapshot except that it uses the same amount of disk space used by the source volume at the point in time when the mirror was started.You can start a mirror using the command, or use the following procedure to start mirroring using the MapR Control System.volume mirror start

To start mirroring using the MapR Control System:

In the Navigation pane, expand the group and click the view.MapR-FS VolumesSelect the checkbox beside the name of each volume you wish to mirror.Click the button.Start Mirroring

Stopping a Mirror

To a mirror means to cancel the replication or synchronization process. Stopping a mirror does not delete or remove the mirror volume, onlystopstops any synchronization currently in progress. You can stop a mirror using the command, or use the following procedure tovolume mirror stopstop mirroring using the MapR Control System.

To stop mirroring using the MapR Control System:

In the Navigation pane, expand the group and click the view.MapR-FS VolumesSelect the checkbox beside the name of each volume you wish to stop mirroring.Click the button.Stop Mirroring

Pushing Changes to Mirrors

To a mirror means to start pushing data from the source volume to all its local mirrors. You can push source volume changes out to allpushmirrors using the command, which returns after the data has been pushed.volume mirror push



1. 2. 3. 4.

a. b.

c. d.

5. 6.

1. 2. 3.

a. b.

4.

Schedules

A schedule is a group of rules that specify recurring points in time at which certain actions are determined to occur. You can use schedules toautomate the creation of snapshots and mirrors; after you create a schedule, it appears as a choice in the scheduling menu when you are editingthe properties of a task that can be scheduled:

To apply a schedule to snapshots, see .Scheduling a SnapshotTo apply a schedule to volume mirroring, see .Creating a Volume

Schedules require the M5 license. The following sections provide information about the actions you can perform on schedules:

To create a schedule, see Creating a ScheduleTo view a list of schedules, see Viewing a List of SchedulesTo modify a schedule, see Modifying a ScheduleTo remove a schedule, see Removing a Schedule

Creating a Schedule

You can create a schedule using the command, or use the following procedure to create a schedule using the MapR Controlschedule createSystem.

To create a schedule using the MapR Control System:

In the Navigation pane, expand the group and click the view.MapR-FS SchedulesClick .New ScheduleType a name for the new schedule in the field.Schedule NameDefine one or more schedule rules in the section:Schedule Rules

From the first dropdown menu, select a frequency (Once, Yearly, Monthly, etc.))From the next dropdown menu, select a time point within the specified frequency. For example: if you selected Monthly in thefirst dropdown menu, select the day of the month in the second dropdown menu.Continue with each dropdown menu, proceeding to the right, to specify the time at which the scheduled action is to occur.Use the field to specify how long the data is to be preserved. For example: if the schedule is attached to a volume forRetain Forcreating snapshots, the Retain For field specifies how far after creation the snapshot expiration date is set.

Click to specify additional schedule rules, as desired.[ + Add Rule ]Click to create the schedule.Save Schedule

Viewing a List of Schedules

You can view a list of schedules using the command, or use the following procedure to view a list of schedules using the MapRschedule listControl System.

To view a list of schedules using the MapR Control System:

In the Navigation pane, expand the group and click the view.MapR-FS Schedules

Modifying a Schedule

When you modify a schedule, the new set of rules replaces any existing rules for the schedule.

You can modify a schedule using the command, or use the following procedure to modify a schedule using the MapR Controlschedule modifySystem.

To modify a schedule using the MapR Control System:

In the Navigation pane, expand the group and click the view.MapR-FS SchedulesClick the name of the schedule to modify.Modify the schedule as desired:

Change the schedule name in the field.Schedule NameAdd, remove, or modify rules in the section.Schedule Rules

Click to save changes to the schedule.Save Schedule

For more information, see .Creating a Schedule

Removing a Schedule

You can remove a schedule using the command, or use the following procedure to remove a schedule using the MapR Controlschedule removeSystem.



1. 2. 3. 4.

To remove a schedule using the MapR Control System:

In the Navigation pane, expand the group and click the view.MapR-FS SchedulesClick the name of the schedule to remove.Click to display the dialog.Remove Schedule Remove ScheduleClick to remove the schedule.Yes



1. 2.

3. 4.

1. 2.

3. 4.

Snapshots

A snapshot is a read-only image of a volume at a specific point in time. On an M5-licensed cluster, you can create a snapshot manually orautomate the process with a . Snapshots are useful any time you need to be able to roll back to a known good data set at a specific pointschedulein time. For example, before performing a risky operation on a volume, you can create a snapshot to enable "undo" capability for the entirevolume. A snapshot takes no time to create, and initially uses no disk space, because it stores only the incremental changes needed to roll thevolume back to the point in time when the snapshot was created.

The following sections describe procedures associated with snapshots:

To view the contents of a snapshot, see Viewing the Contents of a SnapshotTo create a snapshot, see (requires license)Creating a Volume Snapshot M5To view a list of snapshots, see Viewing a List of SnapshotsTo remove a snapshot, see Removing a Volume Snapshot

Viewing the Contents of a Snapshot

At the top level of each volume is a directory called containing all the snapshots for the volume. You can view the directory with .snapshot hado commands or by mounting the cluster with NFS. To prevent recursion problems, and do not show the diop fs ls hadoop fs -ls .snapshot

rectory when the top-level volume directory contents are listed. You must navigate explicitly to the directory to view and list the.snapshotsnapshots for the volume.

Example:

root@node41:/opt/mapr/bin# hadoop fs -ls /myvol/.snapshotFound 1 itemsdrwxrwxrwx - root root 1 2011-06-01 09:57 /myvol/.snapshot/2011-06-01.09-57-49

Creating a Volume Snapshot

You can create a snapshot manually or use a to automate snapshot creation. Each snapshot has an expiration date that determinesschedulehow long the snapshot will be retained:

When you create the snapshot manually, specify an expiration date.When you schedule snapshots, the expiration date is determined by the Retain parameter of the .schedule

For more information about scheduling snapshots, see .Scheduling a Snapshot

Creating a Snapshot Manually

You can create a snapshot using the command, or use the following procedure to create a snapshot using the MapRvolume snapshot createControl System.

To create a snapshot using the MapR Control System:

In the Navigation pane, expand the group and click the view.MapR-FS VolumesSelect the checkbox beside the name of each volume for which you want a snapshot, then click the button to display the New Snapshot

dialog.Snapshot NameType a name for the new snapshot in the field.Name...Click to create the snapshot.OK

Scheduling a Snapshot

You schedule a snapshot by associating an existing schedule with a normal (non-mirror) volume. You cannot schedule snapshots on mirrorvolumes; in fact, since mirrors are read-only, creating a snapshot of a mirror would provide no benefit. You can schedule a snapshot by passingthe ID of a to the command, or you can use the following procedure to choose a schedule for a volume using the MapRschedule volume modifyControl System.

To schedule a snapshot using the MapR Control System:

In the Navigation pane, expand the group and click the view.MapR-FS VolumesDisplay the dialog by clicking the volume name, or by selecting the checkbox beside the name of the volume thenVolume Propertiesclicking the button.PropertiesIn the Replication and Snapshot Scheduling section, choose a from the dropdown menu.schedule Snapshot ScheduleClick to save changes to the volume.Modify Volume

For information about creating a schedule, see .Schedules



1. 2.

1. 2. 3. 4.

1. 2. 3. 4. 5. 6.

1. 2. 3.

1. 2. 3. 4. 5.

Viewing a List of Snapshots

Viewing all Snapshots

You can view snapshots for a volume with the command or using the MapR Control System.volume snapshot list

To view snapshots using the MapR Control System:

In the Navigation pane, expand the group and click the view.MapR-FS Snapshots

Viewing Snapshots for a Volume

You can view snapshots for a volume by passing the volume to the command or using the MapR Control System.volume snapshot list

To view snapshots using the MapR Control System:

In the Navigation pane, expand the group and click the view.MapR-FS VolumesClick the button to display the dialog.Snapshots Snapshots for Volume

Removing a Volume Snapshot

Each snapshot has an expiration date and time, when it is deleted automatically. You can remove a snapshot manually before its expiration, oryou can preserve a snapshot to prevent it from expiring.

Removing a Volume Snapshot Manually

You can remove a snapshot using the command, or use the following procedure to remove a snapshot using the MapRvolume snapshot removeControl System.

To remove a snapshot using the MapR Control System:

In the Navigation pane, expand the group and click the view.MapR-FS SnapshotsSelect the checkbox beside each snapshot you wish to remove.Click to display the dialog.Remove Snapshot Remove SnapshotsClick to remove the snapshot or snapshots.Yes

To remove a snapshot from a specific volume using the MapR Control System:

In the Navigation pane, expand the group and click the view.MapR-FS VolumesSelect the checkbox beside the volume name.Click Snapshots to display the dialog.Snapshots for VolumeSelect the checkbox beside each snapshot you wish to remove.Click to display the dialog.Remove Remove SnapshotsClick to remove the snapshot or snapshots.Yes

Preserving a Volume Snapshot

You can preserve a snapshot using the command, or use the following procedure to create a volume using the MapRvolume snapshot preserveControl System.

To remove a snapshot using the MapR Control System:

In the Navigation pane, expand the group and click the view.MapR-FS SnapshotsSelect the checkbox beside each snapshot you wish to preserve.Click to preserve the snapshot or snapshots.Preserve Snapshot

To remove a snapshot from a specific volume using the MapR Control System:

In the Navigation pane, expand the group and click the view.MapR-FS VolumesSelect the checkbox beside the volume name.Click Snapshots to display the dialog.Snapshots for VolumeSelect the checkbox beside each snapshot you wish to preserve.Click to preserve the snapshot or snapshots.Preserve



1. 2.

3.

Users and Groups

MapR uses each node's native operating system configuration to authenticate users and groups for access to the cluster. If you are deploying alarge cluster, you should consider configuring all nodes to use LDAP or another user management system. You can use the MapR ControlSystem to give specific permissions to particular users and groups. For more information, see . Each user can be restrictedManaging Permissionsto a specific amount of disk usage. For more information, see .Managing Quotas

All nodes in the cluster must have the same set of users and groups, with the same and numbers on all nodes:uid gid

When adding a user to a cluster node, specify the option with the command to guarantee that the user has the same --uid useradd ui on all machines.d

When adding a group to a cluster node, specify the option with the command to guarantee that the group has the--gid groupaddsame on all machines.gid

Choose a specific user to be the administrative user for the cluster. By default, MapR gives the user full administrative permissions. If therootnodes do not have an explicit login (as is sometimes the case with Ubuntu, for example), you can give full permissions to the chosenrootadministrative user after deployment. See .Cluster Configuration

On the node where you plan to run the (the MapR Control System), install Pluggable Authentication Modules (PAM). See mapr-webserver PAM.Configuration

To create a volume for a user or group:

In the view, click .Volumes New VolumeIn the dialog, set the volume attributes:New Volume

In , type a volume name. Make sure the Volume Type is set to Normal Volume.Volume SetupIn , set the volume owner and specify the users and groups who can perform actions on the volume.Ownership & PermissionsIn , set the accountable group or user, and set a quota or advisory quota if needed.Usage TrackingIn , set the replication factor and choose a snapshot schedule.Replication & Snapshot Scheduling

Click to save the settings.OK

See for more information. You can also create a volume using the command.Managing Data with Volumes volume create

You can see users and groups that own volumes in the view or using the command.User Disk Usage entity list



1. 2. 3.

4. 5. 6.

1. 2. 3.

4.

5.

Managing Permissions

MapR manages permissions using two mechanisms:

Cluster and volume permissions use , which specify actions particular users are allowed to perform on aaccess control lists (ACLs)certain cluster or volumeMapR-FS permissions control access to directories and files in a manner similar to Linux file permissions. To manage permissions, youmust have permissions.fc

Cluster and Volume Permissions

Cluster and volume permissions use ACLs, which you can edit using the MapR Control System or the commands.acl

Cluster Permissions

The following table lists the actions a user can perform on a cluster, and the corresponding codes used in the cluster ACL.

Code Allowed Action Includes

login Log in to the MapR Control System, use the API and command-line interface, read access on cluster andvolumes

cv

ss Start/stop services

cv Create volumes

a Admin access All permissions exceptfc

fc Full control (administrative access and permission to change the cluster ACL) a

Setting Cluster Permissions

You can modify cluster permissions using the and commands, or using the MapR Control System.acl edit acl set

To add cluster permissions using the MapR Control System:

Expand the group and click to display the dialog.System Settings Permissions Edit PermissionsClick to add a new row. Each row lets you assign permissions to a single user or group.[ + Add Permission ]Type the name of the user or group in the empty text field:

If you are adding permissions for a user, type , replacing with the username.u:<user> <user>If you are adding permissions for a group, type , replacing with the group name.g:<group> <group>

Click the ( ) to expand the Permissions dropdown.Open ArrowSelect the permissions you wish to grant to the user or group.Click to save the changes.OK

To remove cluster permissions using the MapR Control System:

Expand the group and click to display the dialog.System Settings Permissions Edit PermissionsRemove the desired permissions:To remove all permissions for a user or group:

Click the delete button ( ) next to the corresponding row.To change the permissions for a user or group:

Click the ( ) to expand the Permissions dropdown.Open ArrowUnselect the permissions you wish to revoke from the user or group.

Click to save the changes.OK

Volume Permissions

The following table lists the actions a user can perform on a volume, and the corresponding codes used in the volume ACL.

Code Allowed Action





1.

2.

3.

4. 5. 6.

1. 2. 3. 4.

5.

6.


d Delete a volume


To mount or unmount volumes under a directory, the user must have read/write permissions on the directory (see ).MapR-FS Permissions

You can set volume permissions using the and commands, or using the MapR Control System.acl edit acl set

To add volume permissions using the MapR Control System:

Expand the group and click .MapR-FS VolumesTo create a new volume and set permissions, click to display the dialog.New Volume New VolumeTo edit permissions on a existing volume, click the volume name to display the dialog.Volume Properties

In the section, click to add a new row. Each row lets you assign permissions to a single user orPermissions [ + Add Permission ]group.Type the name of the user or group in the empty text field:

If you are adding permissions for a user, type , replacing with the username.u:<user> <user>If you are adding permissions for a group, type , replacing with the group name.g:<group> <group>

Click the ( ) to expand the Permissions dropdown.Open ArrowSelect the permissions you wish to grant to the user or group.Click to save the changes.OK

To remove volume permissions using the MapR Control System:

Expand the group and click .MapR-FS VolumesClick the volume name to display the dialog.Volume PropertiesRemove the desired permissions:To remove all permissions for a user or group:

Click the delete button ( ) next to the corresponding row.To change the permissions for a user or group:

Click the ( ) to expand the Permissions dropdown.Open ArrowUnselect the permissions you wish to revoke from the user or group.

Click to save the changes.OK

MapR-FS Permissions

MapR-FS permissions are similar to the POSIX permissions model. Each file and directory is associated with a user (the ) and a group. Youownercan set read, write, and execute permissions separately for:

The owner of the file or directoryMembers of the group associated with the file or directoryAll other users.

The permissions for a file or directory are called its . The mode of a file or directory can be expressed in two ways:mode

Text - a string that indicates the presence of the read ( ), write ( ), and execute ( ) permission or their absence ( ) for the owner, group,r w x -and other users respectively. Example:rwxr-xr-xOctal - three octal digits (for the owner, group, and other users), that use individual bits to represent the three permissions. Example:755

Both and represent the same mode: the owner has all permissions, and the group and other users have read and executerwxr-xr-x 755permissions only.

Text Modes

String modes are constructed from the characters in the following table.

Text Description

u The file's owner.

g The group associated with the file or directory.

o Other users (users that are not the owner, and not in the group).

a All (owner, group and others).



= Assigns the permissions Example: "a=rw" sets read and write permissions and disables execution for all.

- Removes a specific permission. Example: "a-x" revokes execution permission from all users without changing read and writepermissions.

+ Adds a specific permission. Example: "a+x" grants execution permission to all users without changing read and write permissions.

r Read permission

w Write permission

x Execute permission

Octal Modes

To construct each octal digit, add together the values for the permissions you wish to grant:

Read: 4Write: 2Execute: 1

Syntax

You can change the modes of directories and files in the MapR storage using either the command with the option, or usinghadoop fs -chmodthe command via NFS. The syntax for both commands is similar:chmod

hadoop fs -chmod [-R] <MODE>[,<MODE>]... | <OCTALMODE> <URI> [<URI> ...]

chmod [-R] <MODE>[,<MODE>]... | <OCTALMODE> <URI> [<URI> ...]

Parameters and Options

Parameter/Option Description

-R If specified, this option applies the new mode recursively throughout the directory structure.

MODE A string that specifies a mode.

OCTALMODE A three-digit octal number that specifies the new mode for the file or directory.

URI A relative or absolute path to the file or directory for which to change the mode.

Examples

The following examples are all equivalent:

chmod 755 script.sh

chmod u=rwx,g=rx,o=rx script.sh

chmod u=rwx,go=rx script.sh



1. 2.

3. 4. 5.

1. 2.

3. 4.

5.

Managing Quotas

Quotas limit the disk space used by a volume or an (user or group) on an M5-licensed cluster, by specifying the amount of disk space theentityvolume or entity is allowed to use:

A volume quota limits the space used by a volume.A user/group quota limits the space used by all volumes owned by a user or group.

Quotas are expressed as an integer value plus a single letter to represent the unit:

B - bytesK - kilobytesM - megabytesG - gigabytesT - terabytesP - petabytes

Example: 500G specifies a 500 gigabyte quota.

If a volume or entity exceeds its quota, further disk writes are prevented and a corresponding alarm is raised:

AE_ALARM_AEQUOTA_EXCEEDED - an entity exceeded its quotaVOLUME_ALARM_QUOTA_EXCEEDED - a volume exceeded its quota

A quota that prevents writes above a certain threshold is also called a . In addition to the hard quota, you can also set an quothard quota advisorya for a user, group, or volume. An advisory quota does not enforce disk usage limits, but raises an alarm when it is exceeded:

AE_ALARM_AEADVISORY_QUOTA_EXCEEDED - an entity exceeded its advisory quotaVOLUME_ALARM_ADVISORY_QUOTA_EXCEEDED - a volume exceeded its advisory quota

In most cases, it is useful to set the advisory quota somewhat lower than the hard quota, to give advance warning that disk usage is approachingthe allowed limit.

To manage quotas, you must have or permissions.a fc

Quota Defaults

You can set hard quota and advisory quota defaults for users and groups. When a user or group is created, the default quota and advisory quotaapply unless overridden by specific quotas.

Setting Volume Quotas and Advisory Quotas

You can set a volume quota using the command, or use the following procedure to set a volume quota using the MapR Controlvolume modifySystem.

To set a volume quota using the MapR Control System:

In the Navigation pane, expand the group and click the view.MapR-FS VolumesDisplay the dialog by clicking the volume name, or by selecting the checkbox beside the volume name then clickingVolume Propertiesthe button.PropertiesIn the Usage Tracking section, select the checkbox and type a quota (value and unit) in the field. Example: Volume Quota 500GTo set the advisory quota, select the checkbox and type a quota (value and unit) in the field. Example: Volume Advisory Quota 250GAfter setting the quota, click to exit save changes to the volume.Modify Volume

Setting User/Group Quotas and Advisory Quotas

You can set a user/group quota using the command, or use the following procedure to set a user/group quota using the MapRentity modifyControl System.

To set a user or group quota using the MapR Control System:

In the Navigation pane, expand the MapR-FS group and click the view.User Disk UsageSelect the checkbox beside the user or group name for which you wish to set a quota, then click the button to display theEdit Properties

dialog.User PropertiesIn the Usage Tracking section, select the checkbox and type a quota (value and unit) in the field. Example: User/Group Quota 500GTo set the advisory quota, select the checkbox and type a quota (value and unit) in the field. Example: User/Group Advisory Quota 250GAfter setting the quota, click to exit save changes to the entity.OK



1. 2. 3.

4.

5.

6.

7.

Setting Quota Defaults

You can set an entity quota using the command, or use the following procedure to set an entity quota using the MapR Controlentity modifySystem.

To set quota defaults using the MapR Control System:

In the Navigation pane, expand the group.System SettingsClick the view to display the dialog.Quota Defaults Configure Quota DefaultsTo set the user quota default, select the checkbox in the User Quota Defaults section, then type a quotaDefault User Total Quota(value and unit) in the field.To set the user advisory quota default, select the checkbox in the User Quota Defaults section, then typeDefault User Advisory Quotaa quota (value and unit) in the field.To set the group quota default, select the checkbox in the Group Quota Defaults section, then type a quotaDefault Group Total Quota(value and unit) in the field.To set the group advisory quota default, select the checkbox in the Group Quota Defaults section, thenDefault Group Advisory Quotatype a quota (value and unit) in the field.After setting the quota, click to exit save changes to the entity.Save



Troubleshooting

This section provides information about troubleshooting cluster problems. Click a subtopic below for more detail.

Disaster RecoveryOut of Memory TroubleshootingTroubleshooting Alarms



1.

2. 3.

1.

2.

Disaster Recovery

It is a good idea to set up an automatic backup of the CLDB volume at regular intervals; in the event that all CLDB nodes fail, you can restore theCLDB from a backup. If you have more than one MapR cluster, you can back up the CLDB volume for each cluster onto the other clusters;otherwise, you can save the CLDB locally to external media such as a USB drive.

To back up a CLDB volume from a remote cluster:

Set up a cron job on the remote cluster to save the container information to a file by running the following command:/opt/mapr/bin/maprcli dump cldbnodes -zkconnect <IP:port of ZooKeeper leader> > <path to file>Set up a cron job to copy the container information file to a volume on the local cluster.Create a mirror volume on the local cluster, choosing the volume from the remote cluster as the source volume.mapr.cldb.internalSet the mirror sync schedule so that it will run at the same time as the cron job.

To back up a CLDB volume locally:

Set up a cron job to save the container information to a file on external media by running the following command:/opt/mapr/bin/maprcli dump cldbnodes -zkconnect <IP:port of ZooKeeper leader> > <path to file>Set up a cron job to create a dump file of the local volume on external media. Example:mapr.cldb.internal/opt/mapr/bin/maprcli volume dump create -name mapr.cldb.internal -dumpfile <path_to_file>

For information about restoring from a backup of the CLDB, contact MapR Support.



1.

2.

3.

Out of Memory Troubleshooting

When the aggregated memory used by MapReduce tasks exceeds the memory reserve on a TaskTracker node, tasks can fail or be killed. MapRattempts to prevent out-of-memory exceptions by killing MapReduce tasks when memory becomes scarce. If you allocate too little Java heap forthe expected memory requirements of your tasks, an exception can occur. The following steps can help configure MapR to avoid these problems:

If a particular job encounters out-of-memory conditions, the simplest way to solve the problem might be to reduce the memory footprint ofthe map and reduce functions, and to ensure that the partitioner distributes map output to reducers evenly.

If it is not possible to reduce the memory footprint of the application, try increasing the Java heap size (-Xmx) in the client-sideMapReduce configuration.

If many jobs encounter out-of-memory conditions, or if jobs tend to fail on specific nodes, it may be that those nodes are advertising toomany TaskTracker slots. In this case, the cluster administrator should reduce the number of slots on the affected nodes.

To reduce the number of slots on a node:

Stop the TaskTracker service on the node:

$ sudo maprcli node services -nodes <node name> -tasktracker stop

Edit the file :/opt/mapr/hadoop/hadoop-<version>/conf/mapred-site.xmlReduce the number of map slots by lowering mapred.tasktracker.map.tasks.maximumReduce the number of reduce slots by lowering mapred.tasktracker.reduce.tasks.maximum

Start the TaskTracker on the node:

$ sudo maprcli node services -nodes <node name> -tasktracker start



Troubleshooting Alarms

User/Group Alarms

User/group alarms indicate problems with user or group quotas. The following tables describe the MapR user/group alarms.

Entity Advisory Quota Alarm

UI Column User Advisory Quota Alarm

LoggedAs

AE_ALARM_AEADVISORY_QUOTA_EXCEEDED

Meaning A user or group has exceeded its advisory quota. See for more information about user/group quotas. Managing Quotas

Resolution No immediate action is required. To avoid exceeding the hard quota, clear space on volumes created by the user or group, orstop further data writes to those volumes.

Entity Quota Alarm

UI Column User Quota Alarm

LoggedAs

AE_ALARM_AEQUOTA_EXCEEDED

Meaning A user or group has exceeded its quota. Further writes by the user or group will fail. See for more informationManaging Quotasabout user/group quotas.

Resolution Free some space on the volumes created by the user or group, or increase the user or group quota.

Cluster Alarms

Cluster alarms indicate problems that affect the cluster as a whole. The following tables describe the MapR cluster alarms.

Blacklist Alarm

UI Column Blacklist Alarm

LoggedAs

CLUSTER_ALARM_BLACKLIST_TTS

Meaning The JobTracker has blacklisted a TaskTracker node because tasks on the node have failed too many times.

Resolution To determine which node or nodes have been blacklisted, see the JobTracker status page (click in the JobTracker Navigation). The JobTracker status page provides links to the TaskTracker log for each node; look at the log for the blacklisted node orPane

nodes to determine why tasks are failing on the node.

License Near Expiration

UI Column License Near Expiration Alarm

Logged As CLUSTER_ALARM_LICENSE_NEAR_EXPIRATION

Meaning The M5 license associated with the cluster is within 30 days of expiration.

Resolution Renew the M5 license.

License Expired

UI Column License Expiration Alarm

Logged As CLUSTER_ALARM_LICENSE_EXPIRED

Meaning The M5 license associated with the cluster has expired. M5 features have been disabled.



Resolution Renew the M5 license.

Cluster Almost Full

UI Column Cluster Almost Full

LoggedAs

CLUSTER_ALARM_CLUSTER_ALMOST_FULL

Meaning The cluster storage is almost full. The percentage of storage used before this alarm is triggered is 90% by default, and iscontrolled by the configuration parameter .cldb.cluster.almost.full.percentage

Resolution Reduce the amount of data stored in the cluster. If the cluster storage is less than 90% full, check the cldb.cluster.almost. parameter via the command, and adjust it if necessary via the command.full.percentage config load config save

Cluster Full

UI Column Cluster Full

Logged As CLUSTER_ALARM_CLUSTER_FULL

Meaning The cluster storage is full. MapReduce operations have been halted.

Resolution Free up some space on the cluster.

Maximum Licensed Nodes Exceeded alarm

UI Column Licensed Nodes Exceeded Alarm

Logged As CLUSTER_ALARM_LICENSE_MAXNODES_EXCEEDED

Meaning The cluster has exceeded the number of nodes specified in the license.

Resolution Remove some nodes, or upgrade the license to accommodate the added nodes.

Upgrade in Progress

UI Column Software Installation & Upgrades

LoggedAs

CLUSTER_ALARM_UPGRADE_IN_PROGRESS

Meaning A rolling upgrade of the cluster is in progress.

Resolution No action is required. Performance may be affected during the upgrade, but the cluster should still function normally. After theupgrade is complete, the alarm is cleared.

VIP Assignment Failure

UI Column VIP Assignment Alarm

LoggedAs

CLUSTER_ALARM_UNASSIGNED_VIRTUAL_IPS

Meaning MapR was unable to assign a VIP to any NFS servers.

Resolution Check the VIP configuration, and make sure at least one of the NFS servers in the VIP pool are up and running. See Configuring. This alarm can also indicateNFS for HA

that a VIP's hostname exceeds the maximum allowed length of 16. Check the log file for/opt/mapr/logs/nfsmon.logadditional information.

Node Alarms

Node alarms indicate problems in individual nodes. The following tables describe the MapR node alarms.

CLDB Service Alarm

http://www.mapr.com/doc/display/MapR12/High+Availability+NFS#HighAvailabilityNFS-nfsha

http://www.mapr.com/doc/display/MapR12/High+Availability+NFS#HighAvailabilityNFS-nfsha



UI Column CLDB Alarm

LoggedAs

NODE_ALARM_SERVICE_CLDB_DOWN

Meaning The CLDB service on the node has stopped running.

Resolution Go to the pane of the to check whether the CLDB service is running. The warden will tryManage Services Node Properties Viewthree times to restart the service automatically. After an interval (30 minutes by default) the warden will again try three times torestart the service. The interval can be configured using the parameter in .services.retryinterval.time.sec warden.confIf the warden successfully restarts the CLDB service, the alarm is cleared. If the warden is unable to restart the CLDB service, itmay be necessary to contact technical support.

Core Present Alarm

UI Column Core files present

Logged As NODE_ALARM_CORE_PRESENT

Meaning A service on the node has crashed and created a core dump file. When all core files are removed, the alarm is cleared.

Resolution Contact technical support.

Debug Logging Active

UI Column Excess Logs Alarm

LoggedAs

NODE_ALARM_DEBUG_LOGGING

Meaning Debug logging is enabled on the node.

Resolution Debug logging generates enormous amounts of data, and can fill up disk space. If debug logging is not absolutely necessary, turnit off: either use the pane in the Node Properties view or the command. If it is absolutely necessary,Manage Services setloglevelmake sure that the logs in /opt/mapr/logs are not in danger of filling the entire disk.

Disk Failure

UI Column Disk Failure Alarm

LoggedAs

NODE_ALARM_DISK_FAILURE

Meaning A disk has failed on the node.

Resolution Check the disk health log (/opt/mapr/logs/faileddisk.log) to determine which disk failed and view any SMART data provided by thedisk. See Handling Disk Failure

FileServer Service Alarm

UI Column FileServer Alarm

LoggedAs

NODE_ALARM_SERVICE_FILESERVER_DOWN

Meaning The FileServer service on the node has stopped running.

Resolution Go to the pane of the Node Properties View to check whether the FileServer service is running. The warden willManage Servicestry three times to restart the service automatically. After an interval (30 minutes by default) the warden will again try three times torestart the service. The interval can be configured using the parameter in .services.retryinterval.time.sec warden.confIf the warden successfully restarts the FileServer service, the alarm is cleared. If the warden is unable to restart the FileServerservice, it may be necessary to contact technical support.

HBMaster Service Alarm

UI Column HBase Master Alarm



LoggedAs

NODE_ALARM_SERVICE_HBMASTER_DOWN

Meaning The HBMaster service on the node has stopped running.

Resolution Go to the pane of the Node Properties View to check whether the HBMaster service is running. The warden willManage Servicestry three times to restart the service automatically. After an interval (30 minutes by default) the warden will again try three times torestart the service. The interval can be configured using the parameter in .services.retryinterval.time.sec warden.confIf the warden successfully restarts the HBMaster service, the alarm is cleared. If the warden is unable to restart the HBMasterservice, it may be necessary to contact technical support.

HBRegion Service Alarm

UI Column HBase RegionServer Alarm

LoggedAs

NODE_ALARM_SERVICE_HBREGION_DOWN

Meaning The HBRegion service on the node has stopped running.

Resolution Go to the pane of the Node Properties View to check whether the HBRegion service is running. The warden willManage Servicestry three times to restart the service automatically. After an interval (30 minutes by default) the warden will again try three times torestart the service. The interval can be configured using the parameter in .services.retryinterval.time.sec warden.confIf the warden successfully restarts the HBRegion service, the alarm is cleared. If the warden is unable to restart the HBRegionservice, it may be necessary to contact technical support.

Hoststats Alarm

UI Column Hoststats process down

LoggedAs

NODE_ALARM_HOSTSTATS_DOWN

Meaning The Hoststats service on the node has stopped running.

Resolution Go to the pane of the Node Properties View to check whether the Hoststats service is running. The warden willManage Servicestry three times to restart the service automatically. After an interval (30 minutes by default) the warden will again try three times torestart the service. The interval can be configured using the parameter in .services.retryinterval.time.sec warden.confIf the warden successfully restarts the service, the alarm is cleared. If the warden is unable to restart the service, it may benecessary to contact technical support.

Installation Directory Full Alarm

UI Column Installation Directory full

Logged As NODE_ALARM_OPT_MAPR_FULL

Meaning The partition on the node is running out of space (95% full)./opt/mapr

Resolution Free up some space in on the node./opt/mapr

JobTracker Service Alarm

UI Column JobTracker Alarm

LoggedAs

NODE_ALARM_SERVICE_JT_DOWN

Meaning The JobTracker service on the node has stopped running.

Resolution Go to the pane of the Node Properties View to check whether the JobTracker service is running. The wardenManage Serviceswill try three times to restart the service automatically. After an interval (30 minutes by default) the warden will again try threetimes to restart the service. The interval can be configured using the parameter in services.retryinterval.time.sec ward

. If the warden successfully restarts the JobTracker service, the alarm is cleared. If the warden is unable to restart theen.confJobTracker service, it may be necessary to contact technical support.

MapR-FS High Memory Alarm



UI Column High FileServer Memory Alarm

Logged As NODE_ALARM_HIGH_MFS_MEMORY

Meaning Memory consumed by <strong> fileserver </strong> service on the node is high

Resolution Log on as root to the node for which the alarm is raised, and restart the Warden:

/etc/init.d/mapr-warden restart

NFS Service Alarm

UI Column NFS Alarm

LoggedAs

NODE_ALARM_SERVICE_NFS_DOWN

Meaning The NFS service on the node has stopped running.

Resolution Go to the pane of the Node Properties View to check whether the NFS service is running. The warden will tryManage Servicesthree times to restart the service automatically. After an interval (30 minutes by default) the warden will again try three times torestart the service. The interval can be configured using the parameter in .services.retryinterval.time.sec warden.confIf the warden successfully restarts the NFS service, the alarm is cleared. If the warden is unable to restart the NFS service, it maybe necessary to contact technical support.

PAM Misconfigured Alarm

UI Column PAM Alarm

Logged As NODE_ALARM_PAM_MISCONFIGURED

Meaning The PAM authentication on the node is configured incorrectly.

Resolution See .PAM Configuration

Root Partition Full Alarm

UI Column Root partition full

Logged As NODE_ALARM_ROOT_PARTITION_FULL

Meaning The root partition ('/') on the node is running out of space (99% full).

Resolution Free up some space in the root partition of the node.

TaskTracker Service Alarm

UI Column TaskTracker Alarm

LoggedAs

NODE_ALARM_SERVICE_TT_DOWN

Meaning The TaskTracker service on the node has stopped running.

Resolution Go to the pane of the Node Properties View to check whether the TaskTracker service is running. The wardenManage Serviceswill try three times to restart the service automatically. After an interval (30 minutes by default) the warden will again try threetimes to restart the service. The interval can be configured using the parameter in services.retryinterval.time.sec ward

. If the warden successfully restarts the TaskTracker service, the alarm is cleared. If the warden is unable to restart theen.confTaskTracker service, it may be necessary to contact technical support.

TaskTracker Local Directory Full Alarm

UI Column TaskTracker Local Directory Full Alarm

Logged As NODE_ALARM_TT_LOCALDIR_FULL

Meaning The local directory used by the TaskTracker on the specified node(s) is full, and the TaskTracker cannot operate as a result.



Resolution Delete or move data from the local disks, or add storage to the specified node(s), and try the jobs again.

Time Skew Alarm

UI Column Time Skew Alarm

Logged As NODE_ALARM_TIME_SKEW

Meaning The clock on the node is out of sync with the master CLDB by more than 20 seconds.

Resolution Use NTP to synchronize the time on all the nodes in the cluster.

Version Alarm

UI Column Version Alarm

Logged As NODE_ALARM_VERSION_MISMATCH

Meaning One or more services on the node are running an unexpected version.

Resolution Stop the node, Restore the correct version of any services you have modified, and re-start the node. See .Managing Nodes

WebServer Service Alarm

UI Column WebServer Alarm

LoggedAs

NODE_ALARM_SERVICE_WEBSERVER_DOWN

Meaning The WebServer service on the node has stopped running.

Resolution Go to the pane of the Node Properties View to check whether the WebServer service is running. The wardenManage Serviceswill try three times to restart the service automatically. After an interval (30 minutes by default) the warden will again try threetimes to restart the service. The interval can be configured using the parameter in services.retryinterval.time.sec ward

. If the warden successfully restarts the WebServer service, the alarm is cleared. If the warden is unable to restart theen.confWebServer service, it may be necessary to contact technical support.

Volume Alarms

Volume alarms indicate problems in individual volumes. The following tables describe the MapR volume alarms.

Data Unavailable

UI Column Data Alarm

LoggedAs

VOLUME_ALARM_DATA_UNAVAILABLE

Meaning This is a potentially very serious alarm that may indicate data loss. Some of the data on the volume cannot be located. This alarmindicates that enough nodes have failed to bring the replication factor of part or all of the volume to zero. For example, if thevolume is stored on a single node and has a replication factor of one, the Data Unavailable alarm will be raised if that volume failsor is taken out of service unexpectedly. If a volume is replicated properly (and therefore is stored on multiple nodes) then the DataUnavailable alarm can indicate that a significant number of nodes is down.

Resolution Investigate any nodes that have failed or are out of service.

You can see which nodes have failed by looking at the Cluster Node Heatmap pane of the .DashboardCheck the cluster(s) for any snapshots or mirrors that can be used to re-create the volume. You can see snapshots andmirrors in the view.MapR-FS

Data Under-Replicated

UI Column Replication Alarm



LoggedAs

VOLUME_ALARM_DATA_UNDER_REPLICATED

Meaning The volume replication factor is lower than the set in . This can be caused by failingminimum replication factor Volume Propertiesdisks or nodes, or the cluster may be running out of storage space.

Resolution Investigate any nodes that are failing. You can see which nodes have failed by looking at the Cluster Node Heatmap pane of the . Determine whether it is necessary to add disks or nodes to the cluster. This alarm is generally raised when the nodesDashboard

that store the volumes or replicas have not sent a for five minutes. To prevent re-replication during normal maintenanceheartbeatprocedures, MapR waits a specified interval (by default, one hour) before considering the node dead and re-replicating its data.You can control this interval by setting the parameter using the command.cldb.fs.mark.rereplicate.sec config save

Mirror Failure

UI Column Mirror Alarm

LoggedAs

VOLUME_ALARM_MIRROR_FAILURE

Meaning A mirror operation failed.

Resolution Make sure the CLDB is running on both the source cluster and the destination cluster. Look at the CLDB log(/opt/mapr/logs/cldb.log) and the MapR-FS log (/opt/mapr/logs/mfs.log) on both clusters for more information. If the attemptedmirror operation was between two clusters, make sure that both clusters are reachable over the network. Make sure the sourcevolume is available and reachable from the cluster that is performing the mirror operation.

No Nodes in Topology

UI Column No Nodes in Vol Topo

LoggedAs

VOLUME_ALARM_NO_NODES_IN_TOPOLOGY

Meaning The path specified in the volume's topology no longer corresponds to a physical topology that contains any nodes, either due tonode failures or changes to node topology settings. While this alarm is raised, MapR places data for the volume on nodes outsidethe volume's topology to prevent write failures.

Resolution Add nodes to the specified volume topology, either by moving existing nodes or adding nodes to the cluster. See .Node Topology

Snapshot Failure

UI Column Snapshot Alarm

LoggedAs

VOLUME_ALARM_SNAPSHOT_FAILURE

Meaning A snapshot operation failed.

Resolution Make sure the CLDB is running. Look at the CLDB log (/opt/mapr/logs/cldb.log) and the MapR-FS log (/opt/mapr/logs/mfs.log) onboth clusters for more information. If the attempted snapshot was a scheduled snapshot that was running in the background, try amanual snapshot.

Topology Almost Full

UI Column Vol Topo Almost Full

LoggedAs

VOLUME_ALARM_TOPOLOGY_ALMOST_FULL

Meaning The nodes in the specified topology are running out of storage space.

Resolution Move volumes to another topology, enlarge the specified topology by adding more nodes, or add disks to the nodes in thespecified topology.

Topology Full Alarm

UI Column Vol Topo Full



LoggedAs

VOLUME_ALARM_TOPOLOGY_FULL

Meaning The nodes in the specified topology have out of storage space.

Resolution Move volumes to another topology, enlarge the specified topology by adding more nodes, or add disks to the nodes in thespecified topology.

Volume Advisory Quota Alarm

UI Column Vol Advisory Quota Alarm

Logged As VOLUME_ALARM_ADVISORY_QUOTA_EXCEEDED

Meaning A volume has exceeded its advisory quota.

Resolution No immediate action is required. To avoid exceeding the hard quota, clear space on the volume or stop further data writes.

Volume Quota Alarm

UI Column Vol Quota Alarm

Logged As VOLUME_ALARM_QUOTA_EXCEEDED

Meaning A volume has exceeded its quota. Further writes to the volume will fail.

Resolution Free some space on the volume or increase the volume hard quota.



Development GuideWelcome to the MapR Development Guide! This guide is for Hadoop developers who create, manage and optimize MapReduce jobs on a MapRcluster. The topics in this guide include tuning MapReduce settings; working with the MapR file system (MapR-FS); and more.

The focus of the Development Guide is job management. For details of configuring the cluster topology and services, see the Administration. See the for details on planning and installing a MapR cluster.Guide Installation Guide

Click on one of the sub-sections below to get started.

Working with MapReduceCompiling Pipes ProgramsExpressLaneSecured TaskTrackerStandalone OperationTuning MapReduce

Working with MapR-FSChunk SizeCompression

Working with DataAccessing Data with NFSCopying Data from Apache HadoopData ProtectionProvisioning Applications



Working with MapReduce

If you have used Hadoop in the past to run MapReduce jobs, then running jobs on MapR Distribution for Apache Hadoop will be very familiar toyou. MapR is a full Hadoop distribution, API-compatible with all versions of Hadoop. Mapr provides additional capabilities not present in any otherHadoop distribution. This section contains information about the following topics:

Secured TaskTracker - Controlling which users are able to submit jobs to the TaskTrackerStandalone Operation - Running MapReduce jobs locally, using the local filesystemTuning MapReduce - Strategies for optimizing resources to meet the goals of your application



1. 2.

3.

4. 5.

1. 2.

Compiling Pipes Programs

To facilitate running jobs on various platforms, MapR provides , , and sources.hadoop pipes hadoop pipes util pipes-example

When using , all nodes must run the same distribution of the operating system. If you run different distributions (Red Hatpipesand CentOS, for example) on nodes in the same cluster, the compiled application might run on some nodes but not others.

To compile the pipes example:

Install on all nodes.libsslChange to the directory, and execute the following commands:/opt/mapr/hadoop/hadoop-0.20.2/src/c++/utils

chmod +x configure./configure # resolve any errorsmake install

Change to the directory, and execute the following commands:/opt/mapr/hadoop/hadoop-0.20.2/src/c++/pipes

chmod +x configure./configure # resolve any errorsmake install

The APIs and libraries will be in the directory./opt/mapr/hadoop/hadoop-0.20.2/src/c++/installCompile :pipes-example

cd /opt/mapr/hadoop/hadoop-0.20.2/src/c++g++ pipes-example/impl/wordcount-simple.cc -Iinstall/include/ -Linstall/lib/ -lhadooputils-lhadooppipes -lssl -lpthread -o wc-simple

To run the pipes example:

Copy the pipes program into MapR-FS.Run the command:hadoop pipes

hadoop pipes -Dhadoop.pipes.java.recordreader= -Dhadoop.pipes.java.recordwriter= -inputtrue true<input-dir> -output <output-dir> -program <MapR-FS path to program>



ExpressLane

MapR provides an express path for small MapReduce jobs to run when all slots are occupied by long tasks. Small jobs are only given this specialtreatment when the cluster is busy, and only if they meet the criteria specified by the following parameters in :mapred-site.xml


mapred.fairscheduler.smalljob.schedule.enable true Enable small job fast scheduling inside fair scheduler. TaskTrackersshould reserve a slot called ephemeral slot which is used for smalljob ifcluster is busy.

mapred.fairscheduler.smalljob.max.maps 10 Small job definition. Max number of maps allowed in small job.

mapred.fairscheduler.smalljob.max.reducers 10 Small job definition. Max number of reducers allowed in small job.

mapred.fairscheduler.smalljob.max.inputsize 10737418240 Small job definition. Max input size in bytes allowed for a small job.Default is 10GB.

mapred.fairscheduler.smalljob.max.reducer.inputsize 1073741824 Small job definition. Max estimated input size for a reducer allowed insmall job. Default is 1GB per reducer.

mapred.cluster.ephemeral.tasks.memory.limit.mb 200 Small job definition. Max memory in mbytes reserved for an ephermalslot. Default is 200mb. This value must be same on JobTracker andTaskTracker nodes.

MapReduce jobs that appear to fit the small job definition but are in fact larger than anticipated are killed and re-queued for normal execution.



1.

2. 3.

1.

2. 3.

1.

2.

3.

1.

2.

Secured TaskTracker

You can control which users are able to submit jobs to the TaskTracker. By default, the TaskTracker is secured; All TaskTracker nodes shouldhave the same user and group databases, and only users who are present on all TaskTracker nodes (same user ID on all nodes) can submit jobs.You can disallow certain users (including or other superusers) from submitting jobs, or remove user restrictions from the TaskTrackerrootcompletely./opt/mapr/hadoop/hadoop-0.20.2/conf/mapred-site.xml

To disallow :root

Edit and set on all TaskTrackermapred-site.xml mapred.tasktracker.task-controller.config.overwrite = falsenodes.Edit and set on all TaskTracker nodes.taskcontroller.cfg min.user.id=0Restart all TaskTrackers.

To disallow all superusers:

Edit and set on all TaskTrackermapred-site.xml mapred.tasktracker.task-controller.config.overwrite = falsenodes.Edit and set on all TaskTracker nodes.taskcontroller.cfg min.user.id=1000Restart all TaskTrackers.

To disallow specific users:

Edit and set on all TaskTrackermapred-site.xml mapred.tasktracker.task-controller.config.overwrite = falsenodes.Edit and add the parameter on all TaskTracker nodes, setting it to a comma-separated list oftaskcontroller.cfg banned.usersusernames. Example:

banned.users=foo,bar

Restart all TaskTrackers.

To remove all user restrictions, and run all jobs as :root

Edit and set mapred-site.xml mapred.task.tracker.task.controller = on all TaskTracker nodes.org.apache.hadoop.mapred.DefaultTaskController

Restart all TaskTrackers.

When you make the above setting, the tasks generated by all jobs submitted by any user will run with the same privileges as theTaskTracker ( privileges), and will have the ability to overwrite, delete, or damage data regardless of ownership orrootpermissions.



Standalone Operation

You can run MapReduce jobs locally, using the local filesystem, by setting in . With that=localmapred.job.tracker mapred-site.xmlparameter set, you can use the local filesystem for both input and output, use MapR-FS for input and output to the local filesystem, or use thelocal filesystem for input and output to MapR-FS.

Examples

Input and output on local filesystem

./bin/hadoop jar hadoop-0.20.2-dev-examples.jar grep -Dmapred.job.tracker=local file:///opt/mapr/hadoop/hadoop-0.20.2/input file:///opt/mapr/hadoop/hadoop-0.20.2/output 'dfs[a-z.]+'

Input from MapR-FS

./bin/hadoop jar hadoop-0.20.2-dev-examples.jar grep -Dmapred.job.tracker=local input file:///opt/mapr/hadoop/hadoop-0.20.2/output 'dfs[a-z.]+'

Output to MapR-FS

./bin/hadoop jar hadoop-0.20.2-dev-examples.jar grep -Dmapred.job.tracker=local file:///opt/mapr/hadoop/hadoop-0.20.2/input output 'dfs[a-z.]+'



Tuning MapReduce

MapR automatically tunes the cluster for most purposes. A service called the determines machine resources on nodes configured to runwardenthe TaskTracker service, and sets MapReduce parameters accordingly.

On nodes with multiple CPUs, MapR uses to reserve CPUs for MapR services:taskset

On nodes with five to eight CPUs, CPU 0 is reserved for MapR servicesOn nodes with nine or more CPUs, CPU 0 and CPU 1 are reserved for MapR services

In certain circumstances, you might wish to manually tune MapR to provide higher performance. For example, when running a job consisting ofunusually large tasks, it is helpful to reduce the number of slots on each TaskTracker and adjust the Java heap size. The following sectionsprovide MapReduce tuning tips. If you change any settings in , restart the TaskTracker.mapred-site.xml

Memory Settings

Memory for MapR Services

The memory allocated to each MapR service is specified in the file, which MapR automatically configures/opt/mapr/conf/warden.confbased on the physical memory available on the node. For example, you can adjust the minimum and maximum memory used for theTaskTracker, as well as the percentage of the heap that the TaskTracker tries to use, by setting the appropriate , , and parametpercent max miners in the file:warden.conf

...service.command.tt.heapsize.percent=2service.command.tt.heapsize.max=325service.command.tt.heapsize.min=64...

The percentages of memory used by the services need not add up to 100; in fact, you can use less than the full heap by setting the heapsize.p parameters for all services to add up to less than 100% of the heap size. In general, you should not need to adjust the memory settingsercent

for individual services, unless you see specific memory-related problems occurring.

MapReduce Memory

The memory allocated for MapReduce tasks normally equals the total system memory minus the total memory allocated for MapR services. Ifnecessary, you can use the parameter to set the maximum physical memory reserved bymapreduce.tasktracker.reserved.physicalmemory.mbMapReduce tasks, or you can set it to to disable physical memory accounting and task management.-1

If the node runs out of memory, MapReduce tasks are killed by the to free memory. You can use (copyOOM-killer mapred.child.oom_adjfrom to adjust the parameter for MapReduce tasks. The possible values of range from -17 to +15.mapred-default.xml oom_adj oom_adjThe higher the score, more likely the associated process is to be killed by OOM-killer.

Job Configuration

Map Tasks

Map tasks use memory mainly in two ways:

The MapReduce framework uses an intermediate buffer to hold serialized (key, value) pairs.The application consumes memory to run the map function.

MapReduce framework memory is controlled by . If is less than the data emitted from the mapper, the task ends upio.sort.mb io.sort.mbspilling data to disk. If is too large, the task can run out of memory or waste allocated memory. By default is 100mb. Itio.sort.mb io.sort.mbshould be approximately 1.25 times the number of data bytes emitted from mapper. If you cannot resolve memory problems by adjusting io.sor

, then try to re-write the application to use less memory in its map function.t.mb

Compression

To turn off MapR compression for map outputs, set mapreduce.maprfs.use.compression=falseTo turn on LZO or any other compression, set and mapreduce.maprfs.use.compression=false mapred.compress.map.output=true

Reduce Tasks

If tasks fail because of an Out of Heap Space error, increase the heap space (the option in ) to give-Xmx mapred.reduce.child.java.opts

http://www.unix.com/man-page/Linux/1/taskset/

http://linux-mm.org/OOM_Killer



more memory to the tasks. If map tasks are failing, you can also try reducing .io.sort.mb(see mapred.map.child.java.opts in mapred-site.xml)

TaskTracker Configuration

MapR sets up map and reduce slots on each TaskTracker node using formulas based on the number of CPUs present on the node. The defaultformulas are stored in the following parameters in :mapred-site.xml

mapred.tasktracker.map.tasks.maximum: (CPUS > 2) ? (CPUS * 0.75) : 1 (At least one Map slot, up to 0.75 times the number ofCPUs)mapred.tasktracker.reduce.tasks.maximum: (CPUS > 2) ? (CPUS * 0.50) : 1 (At least one Map slot, up to 0.50 times thenumber of CPUs)

You can adjust the maximum number of map and reduce slots by editing the formula used in anmapred.tasktracker.map.tasks.maximumd . The following variables are used in the formulas:mapred.tasktracker.reduce.tasks.maximum

CPUS - number of CPUs present on the nodeDISKS - number of disks present on the nodeMEM - memory reserved for MapReduce tasks

Ideally, the number of map and reduce slots should be decided based on the needs of the application. Map slots should be based on how manymap tasks can fit in memory, and reduce slots should be based on the number of CPUs. If each task in a MapReduce job takes 3 GB, and eachnode has 9GB reserved for MapReduce tasks, then the total number of map slots should be 3. The amount of data each map task must processalso affects how many map slots should be configured. If each map task processes 256 MB (the default chunksize in MapR), then each map taskshould have 800 MB of memory. If there are 4 GB reserved for map tasks, then the number of map slots should be 4000MB/800MB, or 5 slots.

MapR allows the JobTracker to over-schedule tasks on TaskTracker nodes in advance of the availability of slots, creating a pipeline. Thisoptimization allows TaskTracker to launch each map task as soon as the previous running map task finishes. The number of tasks toover-schedule should be about 25-50% of total number of map slots. You can adjust this number with the parameter mapreduce.tasktracker.prefe

.tch.maptasks



Working with MapR-FS

Working with MapR-FS

The file contains a list of the clusters that can be used by your application. The first cluster in the list is treated as themapr-clusters.confdefault cluster.

In the parameter determines the default filesystem used by your application. Normally, this should be setcore-site.xml fs.default.nameto one of the following values:

maprfs:/// - resolves to the default cluster in mapr-clusters.confmaprfs:///mapr/<cluster name>/ or - resolves to the specified cluster/mapr/<cluster name>/

In general, the first two options ( and ) provide the most flexibility, because they are not tiedmaprfs:/// maprfs:///mapr/<cluster name>/to an IP address and will continue to function even if the IP address of the master CLDB changes (during failover, for example).

Using Java to Interface with MapR-FS

In your Java application, you will use a object to interface with MapR-FS. When you run your Java application, add the HadoopConfigurationconfiguration directory to the Java classpath. When you instantiate a object/opt/mapr/hadoop/hadoop-<version>/conf Configuration, it is created with default values drawn from configuration files in that directory.

Sample Code

The following sample code shows how to interface with MapR-FS using Java. The example creates a directory, writes a file, then reads thecontents of the file.

Compiling the sample code requires only the Hadoop core JAR:

javac -cp /opt/mapr/hadoop/hadoop-0.20.2/lib/hadoop-0.20.2-dev-core.jar MapRTest.java

Running the sample code uses the following library path:

java -Djava.library.path=/opt/mapr/lib -cp .:\ /opt/mapr/hadoop/hadoop-0.20.2/conf:\ /opt/mapr/hadoop/hadoop-0.20.2/lib/hadoop-0.20.2-dev-core.jar:\ /opt/mapr/hadoop/hadoop-0.20.2/lib/commons-logging-1.0.4.jar:\ /opt/mapr/hadoop/hadoop-0.20.2/lib/maprfs-0.1.jar:\ /opt/mapr/hadoop/hadoop-0.20.2/lib/zookeeper-3.3.2.jar \ MapRTest /test

Sample Code

/* Copyright (c) 2009 & onwards. MapR Tech, Inc., All rights reserved */

// com.mapr.fs;package

java.net.*;import org.apache.hadoop.fs.*;import org.apache.hadoop.conf.*;import

/** * Assumes mapr installed in /opt/mapr * * compilation needs only hadoop jars: * javac -cp /opt/mapr/hadoop/hadoop-0.20.2/lib/hadoop-0.20.2-dev-core.jar MapRTest.java * * Run: * java -Djava.library.path=/opt/mapr/lib -cp/opt/mapr/hadoop/hadoop-0.20.2/conf:/opt/mapr/hadoop/hadoop-0.20.2/lib/hadoop-0.20.2-dev-core.jar:/opt/mapr/hadoop/hadoop-0.20.2/lib/maprfs-0.1.jar:.:/opt/mapr/hadoop/hadoop-0.20.2/lib/commons-logging-1.0.4.jar:/opt/mapr/hadoop/hadoop-0.20.2/lib/zookeeper-3.3.2.jar MapRTest /test



*/ class MapRTestpublic

void main( args[]) Exception public static String throws buf[] = [ 65*1024];byte new byte ac = 0;int (args.length != 1) if .out.println( );System "usage: MapRTest pathname" ;return

// maprfs:/// -> uses the first entry in /opt/mapr/conf/mapr-clusters.conf// maprfs:///mapr/my.cluster.com/// /mapr/my.cluster.com/

// uri = ;String "maprfs:///" dirname = args[ac++];String

Configuration conf = Configuration();new

//FileSystem fs = FileSystem.get(URI.create(uri), conf); // wanting to use a different clusterifFileSystem fs = FileSystem.get(conf);

Path dirpath = Path( dirname + );new "/dir" Path wfilepath = Path( dirname + );new "/file.w" //Path rfilepath = Path( dirname + );new "/file.r"Path rfilepath = wfilepath;

// mkdirtry res = fs.mkdirs( dirpath);boolean

(!res) if .out.println( + dirpath);System "mkdir failed, path: " ;return

.out.println( + dirpath + );System "mkdir( " ") went ok, now writing file"

// create wfileFSDataOutputStream ostr = fs.create( wfilepath, , true // overwrite512, // buffersize( ) 1, short // replication( )(64*1024*1024) long // chunksize); ostr.write(buf); ostr.close();

.out.println( + wfilepath + );System "write( " ") went ok"

// read rfile.out.println( + rfilepath);System "reading file: "

FSDataInputStream istr = fs.open( rfilepath); bb = istr.readInt();int istr.close(); .out.println( );System "Read ok"



Using C to Interface with MapR-FS

Apache Hadoop comes with a library ( ) that provides a C API for operations on any filesystem (maprfs, localfs, hdfs and others).libhdfs.soMapR Distribution for Apache Hadoop provides an additional custom library ( ), that provides the same API for operations onlibMapRClientMapR-FS. MapR recommends for operations exclusive to MapR-FS, but includes to perform operations onlibMapRClient.so libhdfs.soother filesystems and for backward compatibility.

The APIs are defined in the header file, which includes documentation for/opt/mapr/hadoop/hadoop-0.20.2/src/c++/libhdfs/hdfs.heach API. Three sample programs are included in the same directory: , , and .hdfs_test.c hdfs_write.c hdfs_read.c

There are 2 ways to link your program with :libMapRClient

Since is a common library for both MapReduce and the filesystem, it has a dependency on . You canlibMapRClient.so libjvmoperate with by linking your program with (Look at for an example). This is the recommendedlibMapRClient libjvm run1.shapproach.If you do not have JVM on your system, you can provide the options to , to ignore the undefined-Wl -allow-shlib-undefined gccJNI symbols, while linking your program (look at for an example).run2.sh

Finally, before running your program, some environment variables need to be set depending on what option is chosen. For examples, look at run and .1.sh run2.sh

run1.sh

#!/bin/bash

#Ensure JAVA_HOME is defined [ $JAVA_HOME = "" ] ; thenif

echo "JAVA_HOME not defined" exit 1fi

#Setup environmentexport HADOOP_HOME=/opt/mapr/hadoop/hadoop-0.20.2/export LD_LIBRARY_PATH=$LD_LIBRARY_PATH:/opt/mapr/lib:$JAVA_HOME/jre/lib/amd64/server/GCC_OPTS="-I. -I$HADOOP_HOME/src/c++/libhdfs -I$JAVA_HOME/include-I$JAVA_HOME/include/linux -L$HADOOP_HOME/c++/lib-L$JAVA_HOME/jre/lib/amd64/server/ -L/opt/mapr/lib -lMapRClient-ljvm"

#Compilegcc $GCC_OPTS $HADOOP_HOME/src/c++/libhdfs/hdfs_test.c -o hdfs_testgcc $GCC_OPTS $HADOOP_HOME/src/c++/libhdfs/hdfs_read.c -o hdfs_readgcc $GCC_OPTS $HADOOP_HOME/src/c++/libhdfs/hdfs_write.c -o hdfs_write

#Run tests./hdfs_test -m

run2.sh



#!/bin/bash

#Setup environmentexport HADOOP_HOME=/opt/mapr/hadoop/hadoop-0.20.2/GCC_OPTS="-Wl,--allow-shlib-undefined -I.-I$HADOOP_HOME/src/c++/libhdfs -L/opt/mapr/lib -lMapRClient"export LD_LIBRARY_PATH=$LD_LIBRARY_PATH:/opt/mapr/lib

#Compile and Linkgcc $GCC_OPTS $HADOOP_HOME/src/c++/libhdfs/hdfs_test.c -o hdfs_testgcc $GCC_OPTS $HADOOP_HOME/src/c++/libhdfs/hdfs_read.c -o hdfs_readgcc $GCC_OPTS $HADOOP_HOME/src/c++/libhdfs/hdfs_write.c -o hdfs_write

#Run tests./hdfs_test -m



Chunk Size

Files in MapR-FS are split into (similar to Hadoop ) that are normally 256 MB by default. Any multiple of 65,536 bytes is a validchunks blockschunk size, but tuning the size correctly is important:

Smaller chunk sizes result in larger numbers of map tasks, which can result in lower performance due to task scheduling overheadLarger chunk sizes require more memory to sort the map task output, which can crash the JVM or add significant garbage collectionoverheadMapR can deliver a single stream at upwards of 300 MB per second, making it possible to use larger chunks than in stock Hadoop.Generally, it is wise to set the chunk size between 64 MB and 256 MB.

Chunk size is set at the directory level. Files inherit the chunk size settings of the directory that contains them, as do subdirectories on whichchunk size has not been explicitly set. Any files written by a Hadoop application, whether via the file APIs or over NFS, use chunk size specifiedby the settings for the directory where the file is written. If you change a directory's chunk size settings after writing a file, the file will keep the oldchunk size settings. Further writes to the file will use the file's existing chunk size.

Setting Chunk Size

You can set the chunk size for a given directory in two ways:

Change the attribute in the file at the top level of the directoryChunkSize .dfs_attributesUse the command -setchunksize <size>hadoop mfs

For example, if the volume is NFS-mounted at you can set the chunk size to 268,435,456test /mapr/my.cluster.com/projects/testbytes by editing the file and setting . To accomplish/mapr/my.cluster.com/projects/test/.dfs_attributes ChunkSize=268435456the same thing from the shell, use the following command:hadoop

hadoop mfs -setchunksize 268435456



Compression

MapR provides compression for files stored in the cluster. Compression is set at the directory level. Files inherit the compression settings of thedirectory that contains them, as do subdirectories on which compression has not been explicitly set. Any files written by a Hadoop application,whether via the file APIs or over NFS, is compressed according to the settings for the directory where the file is written. If you change a directory'scompression settings after writing a file, the file will keep the old compression settings---that is, if you write a file in an uncompressed directory andthen turn compression on, the file does not automatically end up compressed, and vice versa. Further writes to the file will use the file's existingcompression setting.

Only the owner of a directory can change its compression settings or other attributes. Write permission is not sufficient.

By default, MapR does not compress files whose filename extension indicate they are already compressed. The default list of filename extensionsis as follows:

bz2gzlzotgztbz2zipzZmp3jpgjpegmpgmpegavigifpng

The list of filename extensions not to compress is stored as comma-separated values in the configuration parameter,mapr.fs.nocompressionand can be modified with the command. Example:config save

maprcli config save -values ' :"mapr.fs.nocompression" "bz2,gz,lzo,tgz,tbz2,zip,z,Z,mp3,jpg,jpeg,mpg,mp'eg,avi,gif,png"

The list can be viewed with the command. Example:config load

maprcli config load -keys mapr.fs.nocompression

Setting Compression on Directories

You can turn compression on or off for a given directory in two ways:

Change the attribute in the file at the top level of the directoryCompression .dfs_attributesUse the command -setcompression on|off <dir>hadoop mfs

For example, if the volume is NFS-mounted at you can turn off compression by editing thetest /mapr/my.cluster.com/projects/testfile and setting . To accomplish the same thing from/mapr/my.cluster.com/projects/test/.dfs_attributes Compression=falsethe shell, use the following command:hadoop

hadoop mfs -setcompression off /projects/test

You can view the compression settings for directories using the command. -lshadoop mfs

Setting Compression During Shuffle

You can use the switch to turn compression off during the Shuffle phase of a MapReduce job.-Dmapreduce.maprfs.use.compressionExample:



hadoop jar xxx.jar -Dmapreduce.maprfs.use.compression=false



Working with Data

This section contains information about working with data:

Copying Data from Apache Hadoop - using to copy data to MapR from an Apache clusterdistcpData Protection - how to protect data from corruption or deletionAccessing Data with NFS - how to mount the cluster via NFSManaging Data with Volumes - using volumes to manage data

Mirrors - local or remote copies of volumesSchedules - scheduling for snapshots and mirrorsSnapshots - point-in-time images of volumes



1.

2.

3.

4.

1. 2. 3. 4.

Accessing Data with NFS

Unlike other Hadoop distributions which only allow cluster data import or import as a batch operation, MapR lets you mount the cluster itself viaNFS so that your applications can read and write data directly. MapR allows direct file modification and multiple concurrent reads and writes viaPOSIX semantics. With an NFS-mounted cluster, you can read and write data directly with standard tools, applications, and scripts. For example,you could run a MapReduce job that outputs to a CSV file, then import the CSV file directly into SQL via NFS.

MapR exports each cluster as the directory (for example, ). If you create a mount point with the local/mapr/<cluster name> /mapr/defaultpath , then Hadoop FS paths and NFS paths to the cluster will be the same. This makes it easy to work on the same files via NFS and/maprHadoop. In a multi-cluster setting, the clusters share a single namespace, and you can see them all by mounting the top-level directory./mapr

Mounting the Cluster

Before you begin, make sure you know the hostname and directory of the NFS share you plan to mount.Example:

usa-node01:/mapr - for mounting from the command linenfs://usa-node01/mapr - for mounting from the Mac Finder

Make sure the client machine has the appropriate username and password to access the NFS share. For best results, the username andpassword for accessing the MapR cluster should be the same username and password used to log into the client machine.

Linux

Make sure the NFS client is installed. Examples: sudo yum install nfs-utils (Red Hat or CentOS)sudo apt-get install nfs-common (Ubuntu)

List the NFS shares exported on the server. Example:showmount -e usa-node01Set up a mount point for an NFS share. Example:sudo mkdir /maprMount the cluster via NFS. Example:sudo mount usa-node01:/mapr /mapr

You can also add an NFS mount to so that it mounts automatically when your system starts up. Example:/etc/fstab

# device mountpoint fs-type options dump fsckorder...usa-node01:/mapr /mapr nfs rw 0 0...

Mac

To mount the cluster from the Finder:

Open the Disk Utility: go to .Applications > Utilities > Disk UtilitySelect .File > NFS MountsClick the at the bottom of the NFS Mounts window.+In the dialog that appears, enter the following information:

Remote NFS URL: The URL for the NFS mount. If you do not know the URL, use the command described below.showmountExample: nfs://usa-node01/maprMount location: The mount point where the NFS mount should appear in the local filesystem.



4.

5. 6. 7. 8.

1.

2.

3.

Click the triangle next to .Advanced Mount ParametersEnter in the text field.nolocksClick .VerifyImportant: On the dialog that appears, click to skip the verification process.Don't Verify

The MapR cluster should now appear at the location you specified as the mount point.

To mount the cluster from the command line:

List the NFS shares exported on the server. Example:showmount -e usa-node01Set up a mount point for an NFS share. Example:sudo mkdir /maprMount the cluster via NFS. Example:sudo mount -o nolock usa-node01:/mapr /mapr

Windows

Because of Windows directory caching, there may appear to be no directory in each volume's root directory. To work.snapshotaround the problem, force Windows to re-load the volume's root directory by updating its modification time (for example, bycreating an empty file or directory in the volume's root directory).

To mount the cluster on Windows 7 Ultimate or Windows 7 Enterprise:



1. 2. 3. 4. 5.

1. 2.

3.

1.

2.

Open .Start > Control Panel > ProgramsSelect .Turn Windows features on or offSelect .Services for NFSClick .OKMount the cluster and map it to a drive using the tool or from the command line. Example:Map Network Drivemount -o nolock usa-node01:/mapr z:

To mount the cluster on other Windows versions:

Download and install (SFU). You only need to install the NFS Client and the User Name Mapping.Microsoft Windows Services for UnixConfigure the user authentication in SFU to match the authentication used by the cluster (LDAP or operating system users). You canmap local Windows users to cluster Linux users, if desired.Once SFU is installed and configured, mount the cluster and map it to a drive using the tool or from the commandMap Network Driveline. Example:mount -o nolock usa-node01:/mapr z:

To map a network drive with the Map Network Drive tool:

Open .Start > My Computer

http://www.microsoft.com/downloads/en/details.aspx?FamilyID=896c9688-601b-44f1-81a4-02878ff11778&DisplayLang=en



2. 3. 4. 5.

6. 7.

Select .Tools > Map Network DriveIn the Map Network Drive window, choose an unused drive letter from the drop-down list.DriveSpecify the by browsing for the MapR cluster, or by typing the hostname and directory into the text field.FolderBrowse for the MapR cluster or type the name of the folder to map. This name must follow UNC. Alternatively, click the Browse... buttonto find the correct folder by browsing available network shares.Select to reconnect automatically to the MapR cluster whenever you log into the computer.Reconnect at loginClick Finish.

Setting Compression and Chunk Size

Each directory in MapR storage contains a hidden file called that controls compression and chunk size. To change these.dfs_attributesattributes, change the corresponding values in the file.

Valid values:

Compression: or true falseChunk size (in bytes): a multiple of 65535 (64 K) or zero (no chunks). Example: 131072

You can also set compression and chunksize using the command.hadoop mfs

By default, MapR does not compress files whose filename extension indicate they are already compressed. The default list of filename extensionsis as follows:

bz2gzlzotgztbz2zipzZmp3jpgjpegmpgmpegavigifpng

The list of filename extensions not to compress is stored as comma-separated values in the configuration parameter,mapr.fs.nocompressionand can be modified with the command. Example:config save

maprcli config save -values ' :"mapr.fs.nocompression" "bz2,gz,lzo,tgz,tbz2,zip,z,Z,mp3,jpg,jpeg,mpg,mp'eg,avi,gif,png"

The list can be viewed with the command. Example:config load

maprcli config load -keys mapr.fs.nocompression



1.

2.

3.

1. 2. 3.

4.

Copying Data from Apache Hadoop

There are three ways to copy data from an HDFS cluster to MapR:

If the HDFS cluster uses the same version of the RPC protocol that MapR uses (currently version 4), use normally to copy datadistcpwith the following procedureIf you are copying very small amounts of data, use hftpIf the HDFS cluster and the MapR cluster do not use the same version of the RPC protocol, or if for some other reason the above stepsdo not work, you can data from the HDFS clusterpush

To copy data from HDFS to MapR using :distcp

<NameNode IP> - the IP address of the NameNode in the HDFS cluster<NameNode Port> - the port for connecting to the NameNode in the HDFS cluster<HDFS path> - the path to the HDFS directory from which you plan to copy data<MapR-FS path> - the path in the MapR cluster to which you plan to copy HDFS data<file> - a file in the HDFS path

From a node in the MapR cluster, try to determine whether the MapR cluster can successfully communicate with thehadoop fs -lsHDFS cluster:

hadoop fs -ls <NameNode IP>:<NameNode port>/<path>

If the command is successful, try to determine whether the MapR cluster can read file contentshadoop fs -ls hadoop fs -catfrom the specified path on the HDFS cluster:

hadoop fs -cat <NameNode IP>:<NameNode port>/<HDFS path>/<file>

If you are able to communicate with the HDFS cluster and read file contents, use to copy data from the HDFS cluster to thedistcpMapR cluster:

hadoop distcp <NameNode IP>:<NameNode port>/<HDFS path> <MapR-FS path>

Using hftp

<NameNode IP> - the IP address of the NameNode in the HDFS cluster<NameNode HTTP Port> - the HTTP port on the NameNode in the HDFS cluster<HDFS path> - the path to the HDFS directory from which you plan to copy data<MapR-FS path> - the path in the MapR cluster to which you plan to copy HDFS data

Use over HFTP to copy files:distcp

hadoop distcp hftp://<NameNode IP>:<NameNode HTTP Port>/<HDFS path> <MapR-FS path>

To push data from an HDFS cluster

Perform the following steps from a MapR client or node (any computer that has either or installed). For moremapr-core mapr-clientinformation about setting up a MapR client, see .Setting Up the Client

<input path> - the HDFS path to the source data<output path> - the MapR-FS path to the target directory<MapR CLDB IP> - the IP address of the master CLDB node on the MapR cluster.

Log in as the user (or use for the following commands).root sudoCreate the directory on the Apache Hadoop JobClient node./tmp/maprfs-client/Copy the following files from a MapR client or any MapR node to the directory:/tmp/maprfs-client/

/opt/mapr/hadoop/hadoop-0.20.2/lib/maprfs-0.1.jar,/opt/mapr/hadoop/hadoop-0.20.2/lib/zookeeper-3.3.2.jar/opt/mapr/hadoop/hadoop-0.20.2/lib/native/Linux-amd64-64/libMapRClient.so

Install the files in the correct places on the Apache Hadoop JobClient node:



4.

5.

6. 7. 8.

cp /tmp/maprfs-client/maprfs-0.1.jar $HADOOP_HOME/lib/.cp /tmp/maprfs-client/zookeeper-3.3.2.jar $HADOOP_HOME/lib/.cp /tmp/maprfs-client/libMapRClient.so $HADOOP_HOME/lib/ /Linux-amd64-64/libMapRClient.so native

If you are on a 32-bit client, use in place of above.Linux-i386-32 Linux-amd64-64If the JobTracker is a different node from the JobClient node, copy and install the files to the JobTracker node as well using the abovesteps.On the JobTracker node, set in .fs.maprfs.impl=com.mapr.fs.MapRFileSystem $HADOOP_HOME/conf/core-site.xmlRestart the JobTracker.You can now copy data to the MapR cluster by running on the JobClient node of the Apache Hadoop cluster. Example:distcp

./bin/hadoop distcp -Dfs.maprfs.impl=com.mapr.fs.MapRFileSystem -libjars/tmp/maprfs-client/maprfs-0.1.jar,/tmp/maprfs-client/zookeeper-3.3.2.jar -files/tmp/maprfs-client/libMapRClient.so <input path> maprfs://<MapR CLDB IP>:7222/<output path>



1. 2.

a. b. c.

3.

1. 2. 3. 4.

1. 2. 3. 4. 5. 6.

1. 2.

3.

Data Protection

You can use MapR to protect your data from hardware failures, accidental overwrites, and natural disasters. MapR organizes data into volumesso that you can apply different data protection strategies to different types of data. The following scenarios describe a few common problems andhow easily and effectively MapR protects your data from loss.

Scenario: Hardware Failure

Even with the most reliable hardware, growing cluster and datacenter sizes will make frequent hardware failures a real threat to businesscontinuity. In a cluster with 10,000 disks on 1,000 nodes, it is reasonable to expect a disk failure more than once a day and a node failure everyfew days.

Solution: Topology and Replication Factor

MapR automatically replicates data and places the copies on different nodes to safeguard against data loss in the event of hardware failure. Bydefault, MapR assumes that all nodes are in a single rack. You can provide MapR with information about the rack locations of all nodes by settingtopology paths. MapR interprets each topology path as a separate rack, and attempts to replicate data onto different racks to provide continuity incase of a power failure affecting an entire rack. These replicas are maintained, copied, and made available seamlessly without user intervention.

To set up topology and replication:

In the MapR Control System, open the MapR-FS group and click to display the view.Nodes NodesSet up each rack with its own path. For each rack, perform the following steps:

Click the checkboxes next to the nodes in the rack.Click the button to display the dialog.Change Topology Change Node TopologyIn the Change Node Topology dialog, type a path to represent the rack. For example, if the cluster name is and thecluster1nodes are in rack 14, type ./cluster1/rack14

When creating volumes, choose a of 3 or more to provide sufficient data redundancy.Replication Factor

Scenario: Accidental Overwrite

Even in a cluster with data replication, important data can be overwritten or deleted accidentally. If a data set is accidentally removed, the removalitself propagates across the replicas and the data is lost. Users or applications can corrupt data, and once the corruption spreads to the replicasthe damage is permanent.

Solution: Snapshots

With MapR, you can create a point-in-time snapshot of a volume, allowing recovery from a known good data set. You can create a manualsnapshot to enable recovery to a specific point in time, or schedule snapshots to occur regularly to maintain a recent recovery point. If data is lost,you can restore the data using the most recent snapshot (or any snapshot you choose). Snapshots do not add a performance penalty, becausethey do not involve additional data copying operations; a snapshot can be created almost instantly regardless of data size.

Example: Creating a Snapshot Manually

In the Navigation pane, expand the group and click the view.MapR-FS VolumesSelect the checkbox beside the name the volume, then click the button to display the dialog.New Snapshot Snapshot NameType a name for the new snapshot in the field.Name...Click to create the snapshot.OK

Example: Scheduling Snapshots

This example schedules snapshots for a volume hourly and retains them for 24 hours.

To create a schedule:

In the Navigation pane, expand the group and click the view.MapR-FS SchedulesClick .New ScheduleIn the field, type "Every Hour".Schedule NameFrom the first dropdown menu in the Schedule Rules section, select .HourlyIn the field, specify 24 Hours.Retain ForClick to create the schedule.Save Schedule

To apply the schedule to the volume:

In the Navigation pane, expand the group and click the view.MapR-FS VolumesDisplay the dialog by clicking the volume name, or by selecting the checkbox beside the volume name then clickingVolume Propertiesthe button.Properties



3. 4.

1. 2. 3.

a. b. c. d. e. f.

g.

1.

2. 3.

a.

b.

In the section, choose "Every Hour."Replication and Snapshot SchedulingClick to apply the changes and close the dialog.Modify Volume

Scenario: Disaster Recovery

A severe natural disaster can cripple an entire datacenter, leading to permanent data loss unless a disaster plan is in place.

Solution: Mirroring to Another Cluster

MapR makes it easy to protect against loss of an entire datacenter by mirroring entire volumes to a different datacenter. A mirror is a full read-onlycopy of a volume that can be synced on a schedule to provide point-in-time recovery for critical data. If the volumes on the original cluster containa large amount of data, you can store them on physical media using the command and transport them to the mirror cluster.volume dump createOtherwise, you can simply create mirror volumes that point to the volumes on the original cluster and copy the data over the network. Themirroring operation conserves bandwidth by transmitting only the deltas between the source and the mirror, and by compressing the data over thewire. In addition, MapR uses checksums and a latency-tolerant protocol to ensure success even on high-latency WANs. You can set up acascade of mirrors to replicate data over a distance. For instance, you can mirror data from New York to London, then use lower-cost links toreplicate the data from London to Paris and Rome.

To set up mirroring to another cluster:

Use the command to create a full volume dump for each volume you want to mirror.volume dump createTransport the volume dump to the mirror cluster.For each volume on the original cluster, set up a corresponding volume on the mirror cluster.

Restore the volume using the command.volume dump restoreIn the MapR Control System, click under the MapR-FS group to display the Volumes view.VolumesClick the name of the volume to display the dialog.Volume PropertiesSet the to Remote Mirror Volume.Volume TypeSet the to the source volume name.Source Volume NameSet the to the cluster where the source volume resides.Source Cluster NameIn the section, choose a schedule to determine how often the mirror will sync.Replication and Mirror Scheduling

To recover volumes from mirrors:

Use the command to create a full volume dump for each mirror volume you want to restore. Example:volume dump createmaprcli volume create -e statefile1 -dumpfile fulldump1 -name volume@clusterTransport the volume dump to the rebuilt cluster.For each volume on the mirror cluster, set up a corresponding volume on the rebuilt cluster.

Restore the volume using the command. Example:volume dump restoremaprcli volume dump restore -name volume@cluster -dumpfile fulldump1Copy the files to a standard (non-mirror) volume.



Provisioning Applications

Provisioning a new application involves meeting the business goals of performance, continuity, and security while providing necessary resourcesto a client, department, or project. You'll want to know how much disk space is needed, and what the priorities are in terms of performance andreliability. Once you have gathered all the requirements, you will create a volume to manage the application data. A volume provides convenientcontrol over data placement, performance, protection, and policy for an entire data set.

Make sure the cluster has the storage and processing capacity for the application. You'll need to take into account the starting and predicted sizeof the data, the performance and protection requirements, and the memory required to run all the processes required on each node. Here is theinformation to gather before beginning:

Access How often will the data be read and written? What is the ratio of reads to writes?

Continuity What is the desired (RPO)?recovery point objective

What is the desired (RTO)? recovery time objective

Performance Is the data static, or will it change frequently? Is the goal data storage or data processing?

Size How much data capacity is required to start? What is the predicted growth of the data?

The considerations in the above table will determine the best way to set up a volume for the application.

About Volumes

Volumes provide a number of ways to help you meet the performance, access, and continuity goals of an application, while managing applicationdata size:

Mirroring - create read-only copies of the data for highly accessed data or multi-datacenter accessPermissions - allow users and groups to perform specific actions on a volumeQuotas - monitor and manage the data size by project, department, or userReplication - maintain multiple synchronized copies of data for high availability and failure protectionSnapshots - create a real-time point-in-time data image to enable rollbackTopology - place data on a high-performance rack or limit data to a particular set of machines

See .Managing Data with Volumes

Mirroring

Mirroring means creating , full physical read-only copies of normal volumes for fault tolerance and high performance. When youmirror volumescreate a mirror volume, you specify a source volume from which to copy data, and you can also specify a schedule to automatere-synchronization of the data to keep the mirror up-to-date. After a mirror is initially copied, the synchronization process saves bandwidth andreads on the source volume by transferring only the deltas needed to bring the mirror volume to the same state as its source volume. A mirrorvolume need not be on the same cluster as its source volume; MapR can sync data on another cluster (as long as it is reachable over thenetwork). When creating multiple mirrors, you can further reduce the mirroring bandwidth overhead by daisy-chaining the mirrors. That is, set thesource volume of the first mirror to the original volume, the source volume of the second mirror to the first mirror, and so on. Each mirror is a fullcopy of the volume, so remember to take the number of mirrors into account when planning application data size. See .Mirrors

Permissions

MapR provides fine-grained control over which users and groups can perform specific tasks on volumes and clusters. When you create a volume,keep in mind which users or groups should have these types of access to the volume. You may want to create a specific group to associate with aproject or department, then add users to the group so that you can apply permissions to them all at the same time. See .Managing Permissions

Quotas

You can use quotas to limit the amount of disk space an application can use. There are two types of quotas:

User/Group quotas limit the amount of disk space available to a user or groupVolume quotas limit the amount of disk space available to a volume

When the data owned by a user, group, or volume exceeds the quota, MapR prevents further writes until either the data size falls below the quotaagain, or the quota is raised to accommodate the data.

Volumes, users, and groups can also be assigned . An advisory quota does not limit the disk space available, but raises an alarmadvisory quotasand sends a notification when the space used exceeds a certain point. When you set a quota, you can use a slightly lower advisory quota as awarning that the data is about to exceed the quota, preventing further writes.



Remember that volume quotas do not take into account disk space used by sub-volumes (because volume paths are logical, not physical).

You can set a User/Group quota to manage and track the disk space used by an (a department, project, or application):accounting entity

Create a group to represent the accounting entity.Create one or more volumes and use the group as the Accounting Entity for each.Set a User/Group quota for the group.Add the appropriate users to the group.

When a user writes to one of the volumes associated with the group, any data written counts against the group's quota. Any writes to volumes notassociated with the group are not counted toward the group's quota. See .Managing Quotas

Replication

When you create a volume, you can choose a replication factor to safeguard important data. MapR manages the replication automatically, raisingan alarm and notification if replication falls below the minimum level you have set. A replicate of a volume is a full copy of the volume; rememberto take that into account when planning application data size.

Snapshots

A snapshot is an instant image of a volume at a particular point in time. Snapshots take no time to create, because they only record changes todata over time rather than the data itself. You can manually create a snapshot to enable rollback to a particular known data state, or scheduleperiodic automatic snapshots to ensure a specific (RPO). You can use snapshots and mirrors to achieve a near-zero recovery point objective reco

(RTO). Snapshots store only the deltas between a volume's current state and its state when the snapshot is taken. Initially,very time objectivesnapshots take no space on disk, but they can grow arbitrarily as a volume's data changes. When planning application data size, take intoaccount how much the data is likely to change, and how often snapshots will be taken. See .Snapshots

Topology

You can restrict a volume to a particular rack by setting its physical topology attribute. This is useful for placing an application's data on ahigh-performance rack (for critical applications) or a low-performance rack (to keep it out of the way of critical applications). See Setting Volume

.Topology

Scenarios

Here are a few ways to configure the application volume based on different types of data. If the application requires more than one type of data,you can set up multiple volumes.

Data Type Strategy

Important Data High replication factor Frequent snapshots to minimize RPO and RTO Mirroring in a remote cluster

Highly Acccessed Data High replication factor Mirroring for high-performance reads Topology: data placement on high-performance machines

Scratch data No snapshots, mirrors, or replication Topology: data placement on low-performance machines

Static data Mirroring and replication set by performance and availability requirements

One snapshot (to protect against accidental changes) Volume set to read-only

The following documents provide examples of different ways to provision an application to meet business goals:

Provisioning for CapacityProvisioning for Performance

Setting Up the Application

Once you know the course of action to take based on the application's data and performance needs, you can use the MapR Control System to setup the application.

Creating a Group and a VolumeSetting Up MirroringSetting Up SnapshotsSetting Up User or Group Quotas



Creating a Group and a Volume

Create a group and a volume for the application. If you already have a snapshot schedule prepared, you can apply it to the volume at creationtime. Otherwise, use the procedure in below, after you have created the volume.Setting Up Snapshots

Setting Up Mirroring

If you want the mirror to sync automatically, use the procedure in to create a schedule.Creating a ScheduleUse the procedure in to create a mirror volume. Make sure to set the following fields:Creating a Volume

Volume Type - Mirror VolumeSource Volume - the volume you created for the applicationResponsible Group/User - in most cases, the same as for the source volume

Setting Up Snapshots

To set up automatic snapshots for the volume, use the procedure in .Scheduling a Snapshot



1. 2. 3. 4. 5.

a. b.

c.

6. a. b.

7. 8.

Provisioning for Capacity

You can easily provision a volume for maximum data storage capacity by setting a low replication factor, setting hard and advisory quotas, andtracking storage use by users, groups, and volumes. You can also set permissions to limit who can write data to the volume.

The replication factor determines how many complete copies of a volume are stored in the cluster. The actual storage requirement for a volume isthe volume size multiplied by its replication factor. To maximize storage capacity, set the replication factor on the volume to 1 at the time youcreate the volume.

Volume quotas and user or group quotas limit the amount of data that can be written by a user or group, or the maximum size of a specificvolume. When the data size exceeds the advisory quota, MapR raises an alarm and notification but does not prevent additional data writes. Oncethe data exceeds the hard quota, no further writes are allowed for the volume, user, or group. The advisory quota is generally somewhat lowerthan the hard quota, to provide advance warning that the data is in danger of exceeding the hard quota. For a high-capacity volume, the volumequotas should be as large as possible. You can use the advisory quota to warn you when the volume is approaching its maximum size.

To use the volume capacity wisely, you can limit write access to a particular user or group. Create a new user or group on all nodes in the cluster.

In this scenario, storage capacity takes precedence over high performance and data recovery; to maximize data storage, there will be nosnapshots or mirrors set up in the cluster. A low replication factor means that the data is less effectively protected against loss in the event thatdisks or nodes fail. Because of these tradeoffs, this strategy is most suitable for risk-tolerant large data sets, and should not be used for data withstringent protection, recovery, or performance requirements.

To create a high-capacity volume:

Set up a user or group that will be responsible for the volume. For more information, see .Users & GroupsIn the MapR Control System, open the MapR-FS group and click to display the view.Volumes VolumesClick the button to display the dialog.New Volume New VolumeIn the pane, set the volume name and mount path.Volume SetupIn the pane:Usage Tracking

In the section, select or and enter the user or group responsible for the volume.Group/User User GroupIn the section, check and enter the maximum capacity of the volume, based on the storage capacity ofQuotas Volume Quotayour cluster. Example: 1 TBCheck and enter a lower number than the volume quota, to serve as advance warning when the dataVolume Advisory Quotaapproaches the hard quota. Example: 900 GB

In the pane:Replication & Snapshot SchedulingSet to .Replication 1Do not select a snapshot schedule.

Click OK to create the volume.Set the volume permissions on the volume via NFS or using . You can limit writes to root and the responsible user or group.hadoop fs

See for more information.Managing Data with Volumes



1. 2. 3. 4.

1. 2. 3. 4.

1. 2. 3. 4. 5. 6.

1. 2. 3. 4. 5. 6.

Provisioning for Performance

You can provision a high-performance volume by creating multiple mirrors of the data and defining volume topology to control data placement:store the data on your fastest servers (for example, servers that use SSDs instead of hard disks).

When you create mirrors of a volume, make sure your application load-balances reads across the mirrors to increase performance. Each mirror isan actual volume, so you can control data placement and replication on each mirror independently. The most efficient way to create multiplemirrors is to cascade them rather than creating all the mirrors from the same source volume. Create the first mirror from the original volume, thencreate the second mirror using the first mirror as the source volume, and so on. You can mirror the volume within the same cluster or to anothercluster, possibly in a different datacenter.

You can set node topology paths to specify the physical locations of nodes in the cluster, and volume topology paths to limit volumes to specificnodes or racks.

To set node topology:

Use the following steps to create a rack path representing the high-performance nodes in your cluster.

In the MapR Control System, open the MapR-FS group and click to display the view.Nodes NodesClick the checkboxes next to the high-performance nodes.Click the button to display the dialog.Change Topology Change Node TopologyIn the Change Node Topology dialog, type a path to represent the high-performance rack. For example, if the cluster name is cluster1and the high-performance nodes make up rack 14, type ./cluster1/rack14

To set up the source volume:

In the MapR Control System, open the MapR-FS group and click to display the view.Volumes VolumesClick the button to display the dialog.New Volume New VolumeIn the pane, set the volume name and mount path normally.Volume SetupSet the to limit the volume to the high-performance rack. Example: Topology /default/rack14

To Set Up the First Mirror

In the MapR Control System, open the MapR-FS group and click to display the view.Volumes VolumesClick the button to display the dialog.New Volume New VolumeIn the pane, set the volume name and mount path normally.Volume SetupChoose .Local Mirror VolumeSet the to the original volume name. Example: Source Volume Name original-volumeSet the to a different rack from the source volume. Topology

To Set Up Subsequent Mirrors

In the MapR Control System, open the MapR-FS group and click to display the view.Volumes VolumesClick the button to display the dialog.New Volume New VolumeIn the pane, set the volume name and mount path normally.Volume SetupChoose .Local Mirror VolumeSet the to the previous mirror volume name. Example: Source Volume Name mirror1Set the to a different rack from the source volume and the other mirror.Topology

See for more information.Managing Data with Volumes



Migration GuideThis guide provides instructions for migrating business-critical data and applications from an Apache Hadoop cluster to a MapR cluster.

The MapR distribution is 100% API-compatible with Apache Hadoop, and migration is a relatively straight-forward process. The additionalfeatures available in MapR provide new ways to interact with your data. In particular, MapR provides a fully read/write storage layer that can bemounted as a filesystem via NFS, allowing existing processes, legacy workflows, and desktop applications full access to the entire cluster.

Migration consists of the following steps:

Planning the Migration — Identify the goals of the migration, understand the differences between your current cluster and the MapRcluster, and identify potential gotchas.Initial MapR Deployment — Install, configure, and test the MapR cluster.Component Migration — Migrate your customized components to the MapR cluster.Application Migration — Migrate your applications to the MapR cluster and test using a small set of data.Data Migration — Migrate your data to the MapR cluster and test the cluster against performance benchmarks.Node Migration — Take down old nodes from the previous cluster and install them as MapR nodes.



Planning the Migration

The first phase of migration is planning. In this phase you will identify the requirements and goals of the migration, identify potential issues in themigration, and define a strategy.

The requirements and goals of the migration depend on a number of factors:

Data migration — can you move your datasets individually, or must the data be moved all at once?Downtime — can you tolerate downtime, or is it important to complete the migration with no interruption in service?Customization — what custom patches or applications are running on the cluster?Storage — is there enough space to store the data during the migration?

The MapR Hadoop distribution is 100% plug-and-play compatible with Apache Hadoop, so you do not need to make changes to your applicationsto run them on a MapR cluster. MapR Hadoop automatically configures compression and memory settings, task heap sizes, and local volumes forshuffle data.



Initial MapR Deployment

The initial MapR deployment phase consists of installing, configuring, and testing the MapR cluster and any ecosystem components (such asHive, HBase, or Pig) on an initial set of nodes. Once you have the MapR cluster deployed, you will be able to begin migrating data andapplications.

To deploy the MapR cluster on the selected nodes, follow the steps in the .Installation Guide



1.

2.

3.

4.

Component Migration

MapR Hadoop features the complete Hadoop distribution including components such as Hive and HBase. There are a few things to know aboutmigrating Hive and HBase, or about migrating custom components you have patched yourself.

Hive Migration

Hive facilitates the analysis of large datasets stored in the Hadoop filesystem by organizing that data into tables that can be queried and analyzedusing a dialect of SQL called HiveQL. The schemas that define these tables and all other Hive metadata are stored in a centralized repositorycalled the .metastore

If you would like to continue using Hive tables developed on an HDFS cluster in a MapR cluster, you can import Hive metadata from themetastore to recreate those tables in MapR. Depending on your needs, you can choose to import a subset of table schemas or the entiremetastore in a single go.

Importing table schemas into a MapR cluster

Use this procedure to import a subset of Hive metastore from an HDFS cluster to a MapR cluster. This method is preferred when you want to testa subset of applications using a smaller subset of data.

Use the following procedure to import Hive metastore data into a new metastore running on a node in the MapR cluster. You will need to redirectall of links that formerly pointed to the HDFS ( ) to point to MapR-FS ( ).hdfs://<namenode>:<port number>/<path> maprfs:///<path>

Importing an entire Hive metastore into a MapR cluster

Use this procedure to import an entire Hive metastore from an HDFS cluster to a MapR cluster. This method is preferred when you want to test allapplications using a complete dataset. MySQL is a very popular choice for the Hive metastore and so we'll use it as an example. If you are usinganother RDBMS, consult the relevant documentation.

Ensure that both Hive and your database are installed on one of the nodes in the MapR cluster. For step-by-step instructions on settingup a standalone MySQL metastore, see .Setting Up Hive with a MySQL MetastoreOn the HDFS cluster, back up the metastore to a file.

mysqldump [options] \--databases db_name... > filename

Ensure that queries in the dumpfile point to the MapR-FS rather than HDFS. Search the dumpfile and edit all of the URIs that point to hdf so that they point to instead.s:// maprfs:///

Import the data from the dumpfile into the metastore running on the node in the MapR cluster:

mysql [options] db_name < filename

Using Hive with MapR volumes

MapR-FS does not allow moving or renaming across volume boundaries. Be sure to set the Hive Scratch Directory and Hive Warehouse Directoryin the same volume where the data for the Hive job resides before running the job. For more information see .Using Hive with MapR Volumes

HBase Migration

HBase is the Hadoop database, which provides random, real-time read/write access to very large datasets. The MapR Hadoop distributionincludes HBase and is fully integrated with MapR enhancements for speed, usability, and dependability. MapR provides a (normallyvolumemounted at ) to store HBase data./hbase

HBase bulk load jobs: If you are currently using HBase bulk load jobs to import data into the HDFS, make sure to load your data into apath under the volume./hbaseCompression: The HBase write-ahead log (WAL) writes many tiny records, and compressing it would cause massive CPU load. Beforeusing HBase, turn off MapR compression for directories in the HBase volume. For more information, see .HBase Best Practices

Custom Components

If you have applied your own patches to a component and wish to continue to use that customized component with the MapR distribution, youshould keep the following considerations in mind:

MapR libraries: All Hadoop components must point to MapR for the Hadoop libraries. Change any absolute paths. Do not hardcode hdf or into your applications. This is also true of Hadoop ecosystem components that are not included in the MapRs:// maprfs://

Hadoop distribution (such as Cascading). For more information see .Working with MapR-FS



Component compatibility: Before you commit to the migration of a customized component (for example, customized HBase), check theMapR release notes to see if MapR Technologies has issued a patch that satisfies your business requirements. MapR Technologiespublishes a list of Hadoop common patches and MapR patches with each release and makes those patches available for our customersto take, build, and deploy. For more information, see the .Release NotesZooKeeper coordination service: Certain components, such as HBase, depend on ZooKeeper. When you migrate your customizedcomponent from the HDFS cluster to the MapR cluster, make sure it points correctly to the MapR ZooKeeper service.



1.

2. 3. 4.

Application Migration

In this phase you will migrate your applications to the MapR cluster test environment. The goal of this phase is to get your applications runningsmoothly on the MapR cluster using a subset of data. Once you have confirmed that all applications and components are running as expectedyou can begin migrating your data.

Migrating your applications from HDFS to MapR is relatively easy. MapR Hadoop is 100% plug-and-play compatible with Apache Hadoop, so youdo not need to make changes to your applications to run them on a MapR cluster.

Application Migration Guidelines

Keep the following considerations in mind when you migrate your applications:

MapR Libraries — Ensure that your applications can find the libraries/configs it is expecting. Make sure the java classpath includes thepath to and the includes maprfs.jar java.library.path libMapRClient.soMapR Storage — Every application must point to MapR-FS ( ) rather than the HDFS ( ). If your application uses maprfs:/// hdfs:// fs

then it will work automatically. If you have hardcoded HDFS links into your applications, you must redirect those links.default.nameso that they point to MapR-FS. Setting a default path of tells your applications to use the cluster specified in the first line of maprfs:///

. You can also specify a specific cluster with .mapr-clusters.conf maprfs://<cluster name>/Permissions — The command does not copy permissions; permissions defined in HDFS do not transfer automatically todistcpMapR-FS. MapR uses a combination of access control lists (ACLs) to specify cluster or volume-level permissions and file permissions tomanage directory and file access. You must define these permissions in MapR when you migrate your customized components,applications, and data. For more information, see .Managing PermissionsMemory — Remove explicit memory settings defined in your applications. If memory is set explicitly in the application, the jobs may failafter migration to MapR.

Application Migration Roadmap

Generally, the best approach to migrating your applications to MapR is to import a small subset of data and test and tune your application usingthat data in a test environment before you import your production data.

The following procedure offers a simple roadmap for migrating and running your applications in a MapR cluster test environment.

Copy over a small amount of data to the MapR cluster. Use the command to copy over a small number of files:hadoop distcp hftp

$ hadoop distcp hftp://namenode1:50070/foo maprfs:///bar

You must specify the namenode IP address, port number, and source directory on the HDFS cluster. For more information, see CopyingData from Apache Hadoop

Run the application.Add more data and test again.When the application is running to your satisfaction, use the same process to test and tune another application.



1.

2.

3.

1. 2. 3.

Data Migration

Once you have installed and configured your MapR cluster in a test environment and migrated your applications to the MapR cluster you canbegin to copy over your data from the Apache Hadoop HDFS to the MapR cluster.

In the application migration phase, you should have already moved over small amounts of data using the command. Seehadoop distcp hftp. While this method is ideal for copying over the very small amounts of data required for an initial test, you mustApplication Migration Roadmap

use different methods to migrate your data.

There two ways to migrate large datasets from an HDFS cluster to MapR:

Distributed Copy — Use the command to copy data from the HDFS to the MapR-FS. This is the preferred method forhadoop distcpmoving large amounts of data.Push Data — If the HDFS cluster and the MapR cluster do not use the same version of the RPC protocol, or if for some other reason youcannot use the hadoop distcp command, you can push data from HDFS to MapR-FS.

Important: Ensure that you have laid out your volumes and defined policies before you migrate your data from the HDFS cluster to the to theMapR cluster. Note that you cannot copy over permissions defined in HDFS.

Distributed Copy

The command (distributed copy) enables you to use a MapReduce job to copy large amounts of data between clusters. "The hadoop distcp ha command expands a list of files and directories into input to map tasks, each of which will copy a partition of the files specified indoop distcp

the source list."

You can use the command to migrate data from a Hadoop HDFS cluster to the MapR-FS only if the HDFShadoop distcpcluster uses the same version of the RPC protocol as that used by the MapR cluster. Currently, MapR uses version 4. (If theclusters do not share the same version of the RPC protocol, you must use the push data method described below.)

To copy data from HDFS to MapR using hadoop distcp:

From a node in the MapR cluster, try hadoop fs -ls to determine whether the MapR cluster can successfully communicate with the HDFScluster:

hadoop fs \-ls <NameNode IP>:<NameNode port>/<path>

If the hadoop fs -ls command is successful, try hadoop fs -cat to determine whether the MapR cluster can read file contents from thespecified path on the HDFS cluster:

hadoop fs \-cat <NameNode IP>:<NameNode port>/<HDFS path>/<file>

If you are able to communicate with the HDFS cluster and read file contents, use distcp to copy data from the HDFS cluster to the MapRcluster:

hadoop distcp <NameNode IP>:<NameNode port>/<HDFS path> <MapR-FS path>

Pushing Data from HDFS to MapR-FS

If the HDFS cluster and the MapR cluster do not use the same version of the RPC protocol, or if for some other reason you cannot use the hadoo command to copy files from HDFS to MapR-FS, you can push data from the HDFS cluster to the MapR cluster.p distcp

Perform the following steps from a MapR client or node (any computer that has either mapr-core or mapr-client installed). For more informationabout setting up a MapR client, see .Setting Up the Client

<input path>: The HDFS path to the source data.<output path>: The MapR-FS path to the target directory.<MapR CLDB IP>: The IP address of the master CLDB node on the MapR cluster.

Log into a MapR client or node as the root user (or use for the following commands).sudoCreate the directory on the Apache Hadoop JobClient node./tmp/maprfs-client/Copy the following files from a MapR client or any MapR node to the directory:/tmp/maprfs-client/

/opt/mapr/hadoop/hadoop-0.20.2/lib/maprfs-0.1.jar



1.

1.

2. 3. 4.

/opt/mapr/hadoop/hadoop-0.20.2/lib/zookeeper-3.3.2.jar/opt/mapr/hadoop/hadoop-0.20.2/lib/native/Linux-amd64-64/libMapRClient.so

Install the files in the correct places on the Apache Hadoop JobClient node:

cp /tmp/maprfs-client/maprfs-0.1.jar $HADOOP_HOME/lib/cp /tmp/maprfs-client/zookeeper-3.3.2.jar $HADOOP_HOME/lib/cp /tmp/maprfs-client/libMapRClient.so $HADOOP_HOME/lib/native/Linux-amd64-64/libMapRClient.soNote: If you are on a 32-bit client, use Linux-i386-32 in place of Linux-amd64-64 above.

If the JobTracker is a different node from the JobClient node, copy and install the files to the JobTracker node as well using the abovesteps.On the JobTracker node, set in .fs.maprfs.impl=com.mapr.fs.MapRFileSystem $HADOOP_HOME/conf/core-site.xmlRestart the JobTracker.Copy data to the MapR cluster by running the command on the JobClient node of the Apache Hadoop cluster.hadoop distcp



Node Migration

Once you have loaded your data and tested and tuned your applications, you can add decommission HDFS data-nodes and add them to theMapR cluster.

This is a three-step process:

Decommissioning nodes on an Apache Hadoop cluster: The Hadoop decommission feature enables you to gracefully remove a setof existing data-nodes from a cluster while it is running, without data loss. For more information, see .Hadoop Wiki FAQMeeting minimum hardware and software requirements: Ensure that every data-node you want to add to the MapR cluster meets thehardware, software, and configuration requirements described in .RequirementsAdding Nodes to a MapR cluster: You can add those data-nodes to the MapR cluster. For more information, see Adding Nodes to a

.Cluster

http://wiki.apache.org/hadoop/FAQ



Reference GuideThe MapR Reference Guide contains in-depth reference information for MapR Software. Choose a subtopic below for more detail.

Release Notes - Known issues and new features, by releaseMapR Control System - User interface referenceAPI Reference - Information about the command-line interface and the REST APIUtilities - MapR tool and utility referenceEnvironment Variables - Environment variables specific to MapRConfiguration Files - Information about MapR settingsPorts Used by MapR - List of network ports used by MapR servicesGlossary - Essential MapR terms and definitionsHadoop Commands - Listing of Hadoop commands and options



Release Notes

This section contains Release Notes for all releases of MapR Distribution for Apache Hadoop:

Version 1.2.10 Release Notes - October 2, 2012Version 1.2.9 Release Notes - July 12, 2012Version 1.2.7 Release Notes - June 4, 2012Version 1.2.3 Release Notes - March 14, 2012Version 1.2.2 Release Notes - February 2, 2012Version 1.2 Release Notes - December 12, 2011Version 1.1.3 Release Notes - September 29, 2011Version 1.1.2 Release Notes - September 7, 2011Version 1.1.1 Release Notes - August 22, 2011Version 1.1 Release Notes - July 28, 2011Version 1.0 Release Notes - June 29, 2011Beta Release Notes - April 1, 2011Alpha Release Notes - February 15, 2011

Repository Paths

Version 1.2.10http://package.mapr.com/releases/v1.2.10/redhat/ (CentOS, Red Hat, or SUSE)http://package.mapr.com/releases/v1.2.10/ubuntu/ (Ubuntu)

Version 1.2.9http://package.mapr.com/releases/v1.2.9/mac/ (Mac)http://package.mapr.com/releases/v1.2.9/redhat/ (CentOS, Red Hat, or SUSE)http://package.mapr.com/releases/v1.2.9/ubuntu/ (Ubuntu)http://package.mapr.com/releases/v2.1.0/windows/ (Windows)

Version 1.2.7http://package.mapr.com/releases/v1.2.7/mac/ (Mac)http://package.mapr.com/releases/v1.2.7/redhat/ (CentOS, Red Hat, or SUSE)http://package.mapr.com/releases/v1.2.7/ubuntu/ (Ubuntu)http://package.mapr.com/releases/v1.2.7/windows/ (Windows)

Version 1.2.3http://package.mapr.com/releases/v1.2.3/mac/ (Mac)http://package.mapr.com/releases/v1.2.3/redhat/ (Red Hat or CentOS)http://package.mapr.com/releases/v1.2.3/ubuntu/ (Ubuntu)http://package.mapr.com/releases/v1.2.3/windows/ (Windows)



Version 1.1.3http://package.mapr.com/releases/v1.1.3/redhat/ (Red Hat or CentOS)http://package.mapr.com/releases/v1.1.3/ubuntu/ (Ubuntu)

Version 1.1.2 - Internal maintenance releaseVersion 1.1.1

http://package.mapr.com/releases/v1.1.1/mac/ (Mac client)http://package.mapr.com/releases/v1.1.1/redhat/ (Red Hat or CentOS)http://package.mapr.com/releases/v1.1.1/ubuntu/ (Ubuntu)

Version 1.1.0http://package.mapr.com/releases/v1.1.0-sp0/mac/ (Mac client)http://package.mapr.com/releases/v1.1.0-sp0/redhat/ (Red Hat or CentOS)http://package.mapr.com/releases/v1.1.0-sp0/ubuntu/ (Ubuntu)

Version 1.0.0http://package.mapr.com/releases/v1.0.0-sp0/redhat/ (Red Hat or CentOS)http://package.mapr.com/releases/v1.0.0-sp0/ubuntu/ (Ubuntu)



http://package.mapr.com/releases/v1.2.9/mac/



http://package.mapr.com/releases/v2.1.0/windows/






















http://package.mapr.com/releases/v1.1.0-sp0/mac/

http://package.mapr.com/releases/v1.1.0-sp0/redhat/

http://package.mapr.com/releases/v1.1.0-sp0/ubuntu/

http://package.mapr.com/releases/v1.0.0-sp0/redhat/

http://package.mapr.com/releases/v1.0.0-sp0/ubuntu/



Version 1.2 Release Notes

General Information


Apache Hadoop 0.20.2flume-0.9.4hbase-0.90.4hive-0.7.1mahout-0.5oozie-3.0.0pig-0.9.0sqoop-1.3.0whirr-0.3.0

New in This Release

Dial Home

Dial Home is a feature that collects information about the cluster for MapR support and engineering. You can opt in or out of Dial Home featurewhen you first install or upgrade MapR. To change the Dial Home status of your cluster at any time, see the commands.Dial Home

Rolling Upgrade

The script performs a software upgrade of an entire MapR cluster. See for details.rollingupgrade.sh Cluster Upgrade

Improvements to the Support Tools

The script has been enhanced to generate and gather support output from specified cluster nodes into a singlemapr-support-collect.shoutput file via MapR-FS. To support this feature, has a new option:mapr-support-collect.sh

-O, --online Specifies a space-separated list of nodes from which to gather support output.

There is now a "mini-dump" option for both and to limit the size of the support output. When the orsupport-dump.sh support-collect.sh m option is specified along with a size, collects only a head and tail, each limited to the specified size, from any-mini-dump support-dump.sh

log file that is larger than twice the specified size. The total size of the output is therefore limited to approximately 2 * size * number of logs. Thesize can be specified in bytes, or using the following suffixes:

b - blocks (512 bytes)k - kilobytes (1024 bytes)m - megabytes (1024 kilobytes)

-m, --mini-dump<size>

For any log file greater than 2 * <size>, collects only a head and tail each of the specified size. The <size> may have asuffix specifying units:


MapR Virtual Machine

The is a fully-functional single-node Hadoop cluster capable of running MapReduce programs and working withMapR Virtual Machineapplications like Hive, Pig, and HBase. You can try the MapR Virtual Machine on nearly any 64-bit computer by downloading the free VMware

.Player

Windows 7 Client

A is now available. The MapR client lets you interact with MapR Hadoop directly. With the MapR client, youWindows 7 version of the MapR Clientcan submit Map/Reduce jobs and run hadoop fs and hadoop mfs commands.

Resolved Issues

(Issue 4307) Snapshot create fails with error EEXIST





Known Issues

(Issue 5590)

When is upgraded from 1.1.3 to 1.2, is updated to contain , even ifmapr-core /opt/mapr/bin/versions.sh hbase-version 0.90.4HBase has not been upgraded. This can create problems for any process that uses to determine the correct version of HBase.versions.shAfter upgrading to Version 1.2, check that the version of HBase specified in is correct.versions.sh

(Issue 5489) HBase nodes require during rolling upgradeconfigure.sh

When performing an upgrade to MapR 1.2, the HBase package is upgraded from version 0.90.2 to 0.90.4, and it is necessary to run configure. on any nodes that are running the HBase region server or HBase master.sh

(Issue 4269) Bulk operations

The MapR Control System provides both a checkbox and a link for selecting all alarms, nodes, snapshots, or volumes matching a filter,Select Alleven if there are too many results to display on a single screen. However, the following operations can only be performed on individually selectedresults, or results selected using the link at the bottom of the MapR Control System screen:Select Visible

Volumes - Edit VolumesVolumes - Remove VolumesVolumes - New SnapshotVolumes - UnmountMirror Volumes - Edit VolumesMirror Volumes - Remove VolumesMirror Volumes - UnmountUser Disk Usage - EditSnapshots - RemoveSnapshots - PreserveNode Alarms - Change TopologyNodes - Change TopologyVolume Alarms - EditVolume Alarms - UnmountVolume Alarms - RemoveUser/Group Alarms - Edit

In order to perform these operations on a large number of alarms, nodes, snapshots, or volumes, it is necessary to select each screenful ofresults using and perform the operation before selecting the next screenful of results.Select Visible

(Issue 3122) Mirroring with fsck-repaired volume

If a source or mirror volume is repaired with then the source and mirror volumes can go out of sync. It is necessary to perform a full mirrorfsckoperation with to bring them back in sync. Similarly, when creating a dump file from a volume that hasvolume mirror start -full truebeen repaired with , use on the command.fsck -full true volume dump create



Version 1.2.10 Release Notes

Release Information


Apache Hadoop 0.20.2Flume 0.9.4Hbase 0.90.6Hive 0.7.1Mahout 0.5Oozie 3.0.0Pig 0.9.0Sqoop 1.3.0Whirr 0.3.0

New in This Release

This is a maintenance release. No new features.

Resolved Issues

MapReduce

(Issue 6132) High-availability JobTracker now logs jstack and automatically restarts when the JT process exists but is unresponsive for aconfigurable amount of time. The default configuration is ten minutes.(Issue 7628) TaskTracker won't run if the node hostname resolves to an IP that is not covered by the value in the enviroMAPR_SUBNETSnment variable.(Issue 8115) Split calculation now correctly handles a MapR-FS chunk size of zero. The client JVM no longer spins at 100% CPUutilization while submitting a Map/Reduce job.

File System

(Issue 6717) Potential memory leak in flushing dirty inodes in file system.(Issue 7894) Mirroring no longer causes an assert failure by losing track of the most recent replica of a blank source container.(Issue 7925) A node going down no longer creates a large number of containers in a volume.(Issue 7931) IOMgr no longer causes assert failures in the filesystem while trying to access the cache pages after failing.(Issue 7984) Name containers for large volumes are no longer stuck as under-replicated due to message timeout.(Issue 8071) CLDB no longer shows a stale container as master due to messages over 64k not being sent.(Issue 8092) fsck now works on Storage Pools over 6TB.(Issue 8139) CLDB no longer takes over 20 minutes to come up on clusters with large numbers of containers.(Issue 5537) Allocating buffers while buffers are still draining no longer causes segmentation faults.(Issue 8179) "disk online" operations no longer crash with an EIO error.(Issue 7502) Mirroring: Rollforward operations now correctly update the container epoch, allowing for mirror restarts.

Logging

(Issue 7677) Debug log levels can be set from the command line.(Issue 7793) Updating the hostname entry in before calling no longer generates garbageserverTab_ createMapRBlockLocation()characters in the JobTracker log.(Issue 7653) Excessive MFS logging fixed.

MCS and CLI

(Issue 8206) Spurious volume alarms no longer being raised.(Issue 7989) The output of the command now includes the start time of the resyncingmaprcli dump rereplicationinfo -jsonoperation.(Issue 8047) Nodes in maintenance mode now display correctly in MCS.(Issue 8404) tab in the MCS UI now behaves properly.Volumes

NFS

(Issue 8260) NFS client no longer hangs while accessing files with chunksize set to 0.(Issue 8136) The command no longer generates a buffer overflow when maprcli nfsmgmt refreshexports /opt/mapr/conf/ex

is over 1024 bytes in size.ports




Release Information



New in This Release

This is a maintenance release. No new features.

Resolved Issues

General

(Issue 5862) now correctly exports APIMapRclient.dll libhdfs(Issue 5941) Fixed email sending inconsistency(Issue 7531) Fixed problems with inline setup and "permission denied" errors(Issue 7582) Fix to ignore invalid hostnames in and reprocess them on demand/opt/mapr/conf/mapr-clusters.conf

JobTracker

(Issue 5761) Fixed JobTracker registration problem with zookeeper after reboot(Issue 6132) Fixed JobTracker hang that caused inconsistent failover behavior(Issue 6861) Users submitting jobs without a queue name can no longer cause the JobTracker to fail(Issue 6901) Fixed issue where TaskTrackers fail to kill tasks and then the TaskTracker hangs

NFS

Various fixes and refinements to NFS feature to enhance performance, improve reliability, increase failover performance and improve scalability.

CLDB

Various fixes and refinements to improve CLDB failover performance and replication behavior including:

(Issue 7052) Resolved problem with ZooKeeper disconnecting CLDB and causing failover.

MapR-FS

(Issue 7218) Fixed CLDB rpc problems on multi-homed servers (caused CLDB shutdown on ZooKeeper restart)(Issue 7465) Resolved intermittent container resync failures(Issue 7558) Fixed cause of MapR file system cores seen in customer clusters(Issue 7586) Fix to avoid creating directories with chunksize zero(Issue 7605) Fixed problem with Hadoop jobs failing due to MFS error 110

Known Issues

(Issue 7630)

If a user submits a job with a relative path on a fresh installation, the job may fail with a permission denied error for the directory because/userthat path does not yet exist in a fresh installation.As a workaround, the administrator should create a volume called mounted at . Example:users /user

maprcli volume create -name users -path /user

It is a good idea to create a volume for each user, mounted within ./user






Important Notes

Linux Leap Second Bug

A bug in the component of the Linux subsystem was discovered when the leap second was applied to Coordinated Universal Timehrtimer ntp(UTC) on June 30, 2012 at 23:59:60 UTC. To work around the bug, MapR recommends running the following command as on all nodes:root

/etc/init.d/ntp stop; date; date `date + `; date;"%m%d%H%M%C%y.%S"

Wait a day before re-enabling NTP:

/etc.init.d/ntp start

Inline Setup

Inline setup is a setting that causes each job's setup task to run as a thread directly inside of the JobTracker instead of being forked out as aseparate task by a TaskTracker. This means that jobs that need a setup task will start running faster in some cases because they don't need towait for the TaskTrackers to get scheduled and then run the setup task.

MapR recommends turning off inline setup ( in ) on productionmapreduce.jobtracker.inline.setup.cleanup mapred-site.xmlclusters, because it is dangerous to have the JobTracker execute user-defined code as the privileged JT user (root). If you originally installedversion 1.2.7 or earlier, inline setup defaults to and you should set it to by adding the following to :true false mapred-site.xml

<property> <name>mapreduce.jobtracker.inline.setup.cleanup</name> <value> </value>false <description> </description> </property>

Release Information



New in This Release

Support for SUSE LinuxCLDB enhancements to improve reliability, scalabilty and performanceFixes in the MapReduce layer related to security and data integrityFSCK performance and logging improvementsUpdates to MapR storage services layer to improve performance, security and stabilitySupport for whitelist of subnets MapR-FS will accept requests fromSupport for Hbase 0.90.6NFS improvements to increase performance, reliability and failure recoveryMiscellaneous defect fixes to rolling upgrade and the MapR GUIMapR now works with Accumulo



Hadoop Compatibility in Version 1.2



MapR HBase Patches

In the directory, MapR provides the following patches for HBase:/opt/mapr/hbase/hbase-0.90.4/mapr-hbase-patches

0000-hbase-with-mapr.patch0001-HBASE-4196-0.90.4.patch0002-HBASE-4144-0.90.4.patch0003-HBASE-4148-0.90.4.patch0004-HBASE-4159-0.90.4.patch0005-HBASE-4168-0.90.4.patch0006-HBASE-4196-0.90.4.patch0007-HBASE-4095-0.90.4.patch0008-HBASE-4222-0.90.4.patch0009-HBASE-4270-0.90.4.patch0010-HBASE-4238-0.90.4.patch0011-HBASE-4387-0.90.4.patch0012-HBASE-4295-0.90.4.patch0013-HBASE-4563-0.90.4.patch0014-HBASE-4570-0.90.4.patch0015-HBASE-4562-0.90.4.patch

MapR Pig Patches

In the directory, MapR provides the following patches for Pig:/opt/mapr/pig/pig-0.9.0/mapr-pig-patches

0000-pig-mapr-compat.patch0001-remove-hardcoded-hdfs-refs.patch0002-pigmix2.patch0003-pig-hbase-compatibility.patch

MapR Mahout Patches

In the directory, MapR provides the following patches for Mahout:/opt/mapr/mahout/mahout-0.5/mapr-mahout-patches

0000-mahout-mapr-compat.patch

MapR Hive Patches

In the directory, MapR provides the following patches for Hive:/opt/mapr/hive/hive-0.7.1/mapr-hive-patches

0000-symlink-support-in-hive-binary.patch 0001-remove-unnecessary-fsscheme-check.patch 0002-remove-unnecessary-fsscheme-check-1.patch

MapR Flume Patches

In the directory, MapR provides the following patches for Flume:/opt/mapr/flume/flume-0.9.4/mapr-flume-patches



0000-flume-mapr-compat.patch

MapR Sqoop Patches

In the directory, MapR provides the following patches for Sqoop:/opt/mapr/sqoop/sqoop-1.3.0/mapr-sqoop-patches

0000-setting-hadoop-hbase-versions-to-mapr-shipped-versions.patch

MapR Oozie Patches

In the directory, MapR provides the following patches for Oozie:/opt/mapr/oozie/oozie-3.0.0/mapr-oozie-patches

0000-oozie-with-mapr.patch0001-OOZIE-022-3.0.0.patch0002-OOZIE-139-3.0.0.patch

HBase Common Patches

MapR 1.2 includes the following Apache Hadoop patches that are not included in the Apache HBase base version 0.90.4:

[HBASE-4169] FSUtils LeaseRecovery for non HDFS FileSystems. A client continues to try and connect to a powered down regionserver[HBASE-4168] TableRecordReader may skip first row of region[HBASE-4196] RS does not abort if the initialization of RS fails[HBASE-4144] HFileOutputFormat doesn't fill in TIMERANGE_KEY metadata[HBASE-4148] HBaseServer - IPC Reader threads are not daemons[HBASE-4159] Hlog may not be rolled in a long time if checkLowReplication's request of LogRoll is blocked[HBASE-4095] IOE ignored during flush-on-close causes dataloss[HBASE-4270] CatalogJanitor can clear a daughter that split before processing its parent[HBASE-4238] Error while syncing: DFSOutputStream is closed[HBASE-4387] rowcounter does not return the correct number of rows in certain circumstances[HBASE-4295] When error occurs in this.parent.close(false) of split, the split region cannot write or read[HBASE-4563] Fix a race condition that could cause inconsistent results from scans during concurrent writes.[HBASE-4570] When split doing offlineParentInMeta encounters error, it\'ll cause data loss[HBASE-4562] Make HLog more resilient to write pipeline failures[HBASE-4222]

Oozie Common Patches

MapR 1.2 includes the following Apache Hadoop patches that are not included in the Apache Oozie base version 3.0.0: Add Hive action[GH-0022] Add Sqoop action[GH-0139]

Hadoop Common Patches

MapR 1.2 includes the following Apache Hadoop patches that are not included in the Apache Hadoop base version 0.20.2:

[HADOOP-1722] Make streaming to handle non-utf8 byte array IPC server max queue size should be configurable[HADOOP-1849] speculative execution start up condition based on completion time[HADOOP-2141] Space in the value for dfs.data.dir can cause great problems[HADOOP-2366] Use job control for tasks (and therefore for pipes and streaming)[HADOOP-2721] Add HADOOP_LIBRARY_PATH config setting so Hadoop will include external directories for jni[HADOOP-2838] Shuffling fetchers waited too long between map output fetch re-tries[HADOOP-3327] Patch to allow hadoop native to compile on Mac OS X[HADOOP-3659] Providing splitting support for bzip2 compressed files[HADOOP-4012] IsolationRunner does not work as documented[HADOOP-4041] Map and Reduce tasks should run as the user who submitted the job[HADOOP-4490] FileSystem.CACHE should be ref-counted[HADOOP-4655] Add a user to groups mapping service[HADOOP-4656] Current Ganglia metrics implementation is incompatible with Ganglia 3.1[HADOOP-4675] Allow FileSystem shutdown hook to be disabled[HADOOP-4829] Streaming combiner should allow command, not just JavaClass[HADOOP-4842] Implement setuid executable for Linux to assist in launching tasks as job owners[HADOOP-4930] ConcurrentModificationException in JobHistory.java[HADOOP-4933] Set max map/reduce tasks on a per-job basis, either per-node or cluster-wide[HADOOP-5170]

https://issues.apache.org/jira/browse/HBASE-4169















https://github.com/yahoo/oozie/issues/22

https://github.com/yahoo/oozie/issues/139

https://issues.apache.org/jira/browse/HADOOP-1722





















Option to prohibit jars unpacking[HADOOP-5175] TT's version build is too restrictive[HADOOP-5203] Queue ACLs should be refreshed without requiring a restart of the job tracker[HADOOP-5396] Provide a way for users to find out what operations they can do on which M/R queues[HADOOP-5419] Support killing of process groups in LinuxTaskController binary[HADOOP-5420] The job history display needs to be paged[HADOOP-5442] Add support for application-specific typecodes to typed bytes[HADOOP-5450] Exposing Hadoop metrics via HTTP[HADOOP-5469] calling new SequenceFile.Reader(...) leaves an InputStream open, if the given sequence file is broken[HADOOP-5476] HADOOP-2721 doesn't clean up descendant processes of a jvm that exits cleanly after running a task successfully[HADOOP-5488] Binary partitioner[HADOOP-5528] Hadoop Vaidya throws number format exception due to changes in the job history counters string format (escaped compact[HADOOP-5582]

representation). Hadoop Streaming - GzipCodec[HADOOP-5592] change S3Exception to checked exception[HADOOP-5613] Ability to blacklist tasktracker[HADOOP-5643] Counter for S3N Read Bytes does not work[HADOOP-5656] DistCp should not launch a job if it is not necessary[HADOOP-5675] Add map/reduce slot capacity and lost map/reduce slot capacity to JobTracker metrics[HADOOP-5733] UGI checks in testcases are broken[HADOOP-5737] Split waiting tasks field in JobTracker metrics to individual tasks[HADOOP-5738] Allow setting the default value of maxRunningJobs for all pools[HADOOP-5745] The length of the heartbeat cycle should be configurable.[HADOOP-5784] JobTracker should refresh the hosts list upon recovery[HADOOP-5801] problem using top level s3 buckets as input/output directories[HADOOP-5805] s3n files are not getting split by default[HADOOP-5861] GzipCodec should read compression level etc from configuration[HADOOP-5879] Allow administrators to be able to start and stop queues[HADOOP-5913] Use JDK 1.6 File APIs in DF.java wherever possible[HADOOP-5958] create script to provide classpath for external tools[HADOOP-5976] LD_LIBRARY_PATH not passed to tasks spawned off by LinuxTaskController[HADOOP-5980] HADOOP-2838 doesnt work as expected[HADOOP-5981] RPC client opens an extra connection for VersionedProtocol[HADOOP-6132] ReflectionUtils performance regression[HADOOP-6133] Implement a pure Java CRC32 calculator[HADOOP-6148] Add get/setEnum to Configuration[HADOOP-6161] Improve PureJavaCrc32[HADOOP-6166] Provide a configuration dump in json format.[HADOOP-6184] Configuration does not lock parameters marked final if they have no value.[HADOOP-6227] Permission configuration files should use octal and symbolic[HADOOP-6234] s3n fails with SocketTimeoutException[HADOOP-6254] Missing synchronization for defaultResources in Configuration.addResource[HADOOP-6269] Add JVM memory usage to JvmMetrics[HADOOP-6279] Any hadoop commands crashing jvm (SIGBUS) when /tmp (tmpfs) is full[HADOOP-6284] Use JAAS LoginContext for our login[HADOOP-6299] Configuration sends too much data to log4j[HADOOP-6312] Update FilterInitializer class to be more visible and take a conf for further development[HADOOP-6337] Stack trace of any runtime exceptions should be recorded in the server logs.[HADOOP-6343] Log errors getting Unix UGI[HADOOP-6400] Add a /conf servlet to dump running configuration[HADOOP-6408] Change RPC layer to support SASL based mutual authentication[HADOOP-6419] Add AsyncDiskService that is used in both hdfs and mapreduce[HADOOP-6433] Prevent remote CSS attacks in Hostname and UTF-7.[HADOOP-6441] Hadoop wrapper script shouldn't ignore an existing JAVA_LIBRARY_PATH[HADOOP-6453] StringBuffer -> StringBuilder - conversion of references as necessary[HADOOP-6471] HttpServer sends wrong content-type for CSS files (and others)[HADOOP-6496] doAs for proxy user[HADOOP-6510] FsPermission:SetUMask not updated to use new-style umask setting.[HADOOP-6521] LocalDirAllocator should use whitespace trimming configuration getters[HADOOP-6534] Allow authentication-enabled RPC clients to connect to authentication-disabled RPC servers[HADOOP-6543] archive does not work with distcp -update[HADOOP-6558] Authorization for default servlets[HADOOP-6568] FsShell#cat should avoid calling unecessary getFileStatus before opening a file to read[HADOOP-6569] RPC responses may be out-of-order with respect to SASL[HADOOP-6572] IPC server response buffer reset threshold should be configurable[HADOOP-6577] Configuration should trim whitespace around a lot of value types[HADOOP-6578] Split RPC metrics into summary and detailed metrics[HADOOP-6599] Deadlock in DFSClient#getBlockLocations even with the security disabled[HADOOP-6609] RPC server should check for version mismatch first[HADOOP-6613] "Bad Connection to FS" message in FSShell should print message from the exception[HADOOP-6627] FileUtil.fullyDelete() should continue to delete other files despite failure at any level.[HADOOP-6631] AccessControlList uses full-principal names to verify acls causing queue-acls to fail[HADOOP-6634] Benchmark overhead of RPC session establishment[HADOOP-6637] FileSystem.get() does RPC retries within a static synchronized block[HADOOP-6640] util.Shell getGROUPS_FOR_USER_COMMAND method name - should use common naming convention[HADOOP-6644]













































































login object in UGI should be inside the subject[HADOOP-6649] ShellBasedUnixGroupsMapping shouldn't have a cache[HADOOP-6652] NullPointerException in setupSaslConnection when browsing directories[HADOOP-6653] BlockDecompressorStream get EOF exception when decompressing the file compressed from empty file[HADOOP-6663] RPC.waitForProxy should retry through NoRouteToHostException[HADOOP-6667] zlib.compress.level ignored for DefaultCodec initialization[HADOOP-6669] UserGroupInformation doesn't support use in hash tables[HADOOP-6670] Performance Improvement in Secure RPC[HADOOP-6674] user object in the subject in UGI should be reused in case of a relogin.[HADOOP-6687] Incorrect exit codes for "dfs -chown", "dfs -chgrp"[HADOOP-6701] Relogin behavior for RPC clients could be improved[HADOOP-6706] Symbolic umask for file creation is not consistent with posix[HADOOP-6710] FsShell 'hadoop fs -text' does not support compression codecs[HADOOP-6714] Client does not close connection when an exception happens during SASL negotiation[HADOOP-6718] NetUtils.connect should check that it hasn't connected a socket to itself[HADOOP-6722] unchecked exceptions thrown in IPC Connection orphan clients[HADOOP-6723] IPC doesn't properly handle IOEs thrown by socket factory[HADOOP-6724] adding some java doc to Server.RpcMetrics, UGI[HADOOP-6745] NullPointerException for hadoop clients launched from streaming tasks[HADOOP-6757] WebServer shouldn't increase port number in case of negative port setting caused by Jetty's race[HADOOP-6760] exception while doing RPC I/O closes channel[HADOOP-6762] UserGroupInformation.createProxyUser's javadoc is broken[HADOOP-6776] Add a new newInstance method in FileSystem that takes a "user" as argument[HADOOP-6813] refreshSuperUserGroupsConfiguration should use server side configuration for the refresh[HADOOP-6815] Provide a JNI-based implementation of GroupMappingServiceProvider[HADOOP-6818] Provide a web server plugin that uses a static user for the web UI[HADOOP-6832] IPC leaks call parameters when exceptions thrown[HADOOP-6833] Introduce additional statistics to FileSystem[HADOOP-6859] Provide a JNI-based implementation of ShellBasedUnixGroupsNetgroupMapping (implementation of[HADOOP-6864]

GroupMappingServiceProvider) The efficient comparators aren't always used except for BytesWritable and Text[HADOOP-6881] RawLocalFileSystem#setWorkingDir() does not work for relative names[HADOOP-6899] Rpc client doesn't use the per-connection conf to figure out server's Kerberos principal[HADOOP-6907] BZip2Codec incorrectly implements read()[HADOOP-6925] Fix BooleanWritable comparator in 0.20[HADOOP-6928] The GroupMappingServiceProvider interface should be public[HADOOP-6943] Suggest that HADOOP_CLASSPATH should be preserved in hadoop-env.sh.template[HADOOP-6950] Allow wildcards to be used in ProxyUsers configurations[HADOOP-6995] Configuration.writeXML should not hold lock while outputting[HADOOP-7082] UserGroupInformation.getCurrentUser() fails when called from non-Hadoop JAAS context[HADOOP-7101] Remove unnecessary DNS reverse lookups from RPC layer[HADOOP-7104] Implement chmod with JNI[HADOOP-7110] FsShell should dump all exceptions at DEBUG level[HADOOP-7114] Add a cache for getpwuid_r and getpwgid_r calls[HADOOP-7115] NPE in Configuration.writeXml[HADOOP-7118] Timed out shell commands leak Timer threads[HADOOP-7122] getpwuid_r is not thread-safe on RHEL6[HADOOP-7156] SecureIO should not check owner on non-secure clusters that have no native support[HADOOP-7172] Remove unused fstat() call from NativeIO[HADOOP-7173] WritableComparator.get should not cache comparator objects[HADOOP-7183] Remove deprecated local.cache.size from core-default.xml[HADOOP-7184]

MapReduce Patches

MapR 1.2 includes the following Apache MapReduce patches that are not included in the Apache Hadoop base version 0.20.2:

[MAPREDUCE-112] Reduce Input Records and Reduce Output Records counters are not being set when using the new Mapreduce reducer API Job.getJobID() will always return null[MAPREDUCE-118] TaskMemoryManager should log process-tree's status while killing tasks.[MAPREDUCE-144] Secure job submission[MAPREDUCE-181] Provide a node health check script and run it periodically to check the node health status[MAPREDUCE-211] Collecting cpu and memory usage for MapReduce tasks[MAPREDUCE-220] TaskTracker could send an out-of-band heartbeat when the last running map/reduce completes[MAPREDUCE-270] Job history counters should be avaible on the UI.[MAPREDUCE-277] JobTracker should give preference to failed tasks over virgin tasks so as to terminate the job ASAP if it is eventually going to[MAPREDUCE-339]

fail. Change org.apache.hadoop.examples.MultiFileWordCount to use new mapreduce api.[MAPREDUCE-364] Change org.apache.hadoop.mapred.lib.MultipleInputs to use new api.[MAPREDUCE-369] Change org.apache.hadoop.mapred.lib.MultipleOutputs to use new api.[MAPREDUCE-370] JobControl Job does always has an unassigned name[MAPREDUCE-415] Move the completed jobs' history files to a DONE subdirectory inside the configured history directory[MAPREDUCE-416] Enable ServicePlugins for the JobTracker[MAPREDUCE-461] The job setup and cleanup tasks should be optional[MAPREDUCE-463] Collect information about number of tasks succeeded / total per time unit for a tasktracker.[MAPREDUCE-467]



















































https://issues.apache.org/jira/browse/MAPREDUCE-112



















extend DistributedCache to work locally (LocalJobRunner)[MAPREDUCE-476] separate jvm param for mapper and reducer[MAPREDUCE-478] Fix the 'cluster drain' problem in the Capacity Scheduler wrt High RAM Jobs[MAPREDUCE-516] The capacity-scheduler should assign multiple tasks per heartbeat[MAPREDUCE-517] After JobTracker restart Capacity Schduler does not schedules pending tasks from already running tasks.[MAPREDUCE-521] Allow admins of the Capacity Scheduler to set a hard-limit on the capacity of a queue[MAPREDUCE-532] Add preemption to the fair scheduler[MAPREDUCE-551] If #link is missing from uri format of -cacheArchive then streaming does not throw error.[MAPREDUCE-572] Change KeyValueLineRecordReader and KeyValueTextInputFormat to use new api.[MAPREDUCE-655] Existing diagnostic rules fail for MAP ONLY jobs[MAPREDUCE-676] XML-based metrics as JSP servlet for JobTracker[MAPREDUCE-679] Reuse of Writable objects is improperly handled by MRUnit[MAPREDUCE-680] Reserved tasktrackers should be removed when a node is globally blacklisted[MAPREDUCE-682] Conf files not moved to "done" subdirectory after JT restart[MAPREDUCE-693] Per-pool task limits for the fair scheduler[MAPREDUCE-698] Support for FIFO pools in the fair scheduler[MAPREDUCE-706] Provide a jobconf property for explicitly assigning a job to a pool[MAPREDUCE-707] node health check script does not display the correct message on timeout[MAPREDUCE-709] JobConf.findContainingJar unescapes unnecessarily on Linux[MAPREDUCE-714] org.apache.hadoop.mapred.lib.db.DBInputformat not working with oracle[MAPREDUCE-716] More slots are getting reserved for HiRAM job tasks then required[MAPREDUCE-722] node health check script should not log "UNHEALTHY" status for every heartbeat in INFO mode[MAPREDUCE-732] java.util.ConcurrentModificationException observed in unreserving slots for HiRam Jobs[MAPREDUCE-734] Allow relative paths to be created inside archives.[MAPREDUCE-739] Provide summary information per job once a job is finished.[MAPREDUCE-740] Support in DistributedCache to share cache files with other users after HADOOP-4493[MAPREDUCE-744] NPE in expiry thread when a TT is lost[MAPREDUCE-754] TypedBytesInput's readRaw() does not preserve custom type codes[MAPREDUCE-764] Configuration information should generate dump in a standard format.[MAPREDUCE-768] Setup and cleanup tasks remain in UNASSIGNED state for a long time on tasktrackers with long running high RAM tasks[MAPREDUCE-771] Use PureJavaCrc32 in mapreduce spills[MAPREDUCE-782] -files, -archives should honor user given symlink path[MAPREDUCE-787] Job summary logs show status of completed jobs as RUNNING[MAPREDUCE-809] Move completed Job history files to HDFS[MAPREDUCE-814] Add a cache for retired jobs with minimal job info and provide a way to access history file url[MAPREDUCE-817] JobClient completion poll interval of 5s causes slow tests in local mode[MAPREDUCE-825] DBInputFormat leaves open transaction[MAPREDUCE-840] Per-job local data on the TaskTracker node should have right access-control[MAPREDUCE-842] Localized files from DistributedCache should have right access-control[MAPREDUCE-856] Job/Task local files have incorrect group ownership set by LinuxTaskController binary[MAPREDUCE-871] Make DBRecordReader execute queries lazily[MAPREDUCE-875] More efficient SQL queries for DBInputFormat[MAPREDUCE-885] After HADOOP-4491, the user who started mapred system is not able to run job.[MAPREDUCE-890] Users can set non-writable permissions on temporary files for TT and can abuse disk usage.[MAPREDUCE-896] When using LinuxTaskController, localized files may become accessible to unintended users if permissions are[MAPREDUCE-899]

misconfigured. Cleanup of task-logs should happen in TaskTracker instead of the Child[MAPREDUCE-927] OutputCommitter should have an abortJob method[MAPREDUCE-947] Inaccurate values in jobSummary logs[MAPREDUCE-964] TaskTracker does not need to fully unjar job jars[MAPREDUCE-967] NPE in distcp encountered when placing _logs directory on S3FileSystem[MAPREDUCE-968] distcp does not always remove distcp.tmp.dir[MAPREDUCE-971] Cleanup tasks are scheduled using high memory configuration, leaving tasks in unassigned state.[MAPREDUCE-1028] Reduce tasks are getting starved in capacity scheduler[MAPREDUCE-1030] Show total slot usage in cluster summary on jobtracker webui[MAPREDUCE-1048] distcp can generate uneven map task assignments[MAPREDUCE-1059] Use the user-to-groups mapping service in the JobTracker[MAPREDUCE-1083] For tasks, "ulimit -v -1" is being run when user doesn't specify mapred.child.ulimit[MAPREDUCE-1085] hadoop commands in streaming tasks are trying to write to tasktracker's log[MAPREDUCE-1086] JobHistory files should have narrower 0600 perms[MAPREDUCE-1088] Fair Scheduler preemption triggers NPE when tasks are scheduled but not running[MAPREDUCE-1089] Modify log statement in Tasktracker log related to memory monitoring to include attempt id.[MAPREDUCE-1090] Incorrect synchronization in DistributedCache causes TaskTrackers to freeze up during localization of Cache for tasks.[MAPREDUCE-1098] User's task-logs filling up local disks on the TaskTrackers[MAPREDUCE-1100] Additional JobTracker metrics[MAPREDUCE-1103] CapacityScheduler: It should be possible to set queue hard-limit beyond it's actual capacity[MAPREDUCE-1105] Capacity Scheduler scheduling information is hard to read / should be tabular format[MAPREDUCE-1118] Using profilers other than hprof can cause JobClient to report job failure[MAPREDUCE-1131] Per cache-file refcount can become negative when tasks release distributed-cache files[MAPREDUCE-1140] runningMapTasks counter is not properly decremented in case of failed Tasks.[MAPREDUCE-1143] Streaming tests swallow exceptions[MAPREDUCE-1155] running_maps is not decremented when the tasks of a job is killed/failed[MAPREDUCE-1158] Two log statements at INFO level fill up jobtracker logs[MAPREDUCE-1160] Lots of fetch failures[MAPREDUCE-1171] MultipleInputs fails with ClassCastException[MAPREDUCE-1178]













































































URL to JT webconsole for running job and job history should be the same[MAPREDUCE-1185] While localizing a DistributedCache file, TT sets permissions recursively on the whole base-dir[MAPREDUCE-1186] MAPREDUCE-947 incompatibly changed FileOutputCommitter[MAPREDUCE-1196] Alternatively schedule different types of tasks in fair share scheduler[MAPREDUCE-1198] TaskTrackers restart is very slow because it deletes distributed cache directory synchronously[MAPREDUCE-1213] JobTracker Metrics causes undue load on JobTracker[MAPREDUCE-1219] Kill tasks on a node if the free physical memory on that machine falls below a configured threshold[MAPREDUCE-1221] Distcp is very slow[MAPREDUCE-1231] Refactor job token to use a common token interface[MAPREDUCE-1250] Fair scheduler event log not logging job info[MAPREDUCE-1258] DistCp cannot handle -delete if destination is local filesystem[MAPREDUCE-1285] DistributedCache localizes only once per cache URI[MAPREDUCE-1288] AutoInputFormat doesn't work with non-default FileSystems[MAPREDUCE-1293] TrackerDistributedCacheManager can delete file asynchronously[MAPREDUCE-1302] Add counters for task time spent in GC[MAPREDUCE-1304] Introduce the concept of Job Permissions[MAPREDUCE-1307] NPE in FieldFormatter if escape character is set and field is null[MAPREDUCE-1313] JobTracker holds stale references to retired jobs via unreported tasks[MAPREDUCE-1316] Potential JT deadlock in faulty TT tracking[MAPREDUCE-1342] Incremental enhancements to the JobTracker for better scalability[MAPREDUCE-1354] ConcurrentModificationException in JobInProgress[MAPREDUCE-1372] Args in job details links on jobhistory.jsp are not URL encoded[MAPREDUCE-1378] MRAsyncDiscService should tolerate missing local.dir[MAPREDUCE-1382] NullPointerException observed during task failures[MAPREDUCE-1397] TaskLauncher remains stuck on tasks waiting for free nodes even if task is killed.[MAPREDUCE-1398] The archive command shows a null error message[MAPREDUCE-1399] Save file-sizes of each of the artifacts in DistributedCache in the JobConf[MAPREDUCE-1403] LinuxTaskController tests failing on trunk after the commit of MAPREDUCE-1385[MAPREDUCE-1421] Changing permissions of files/dirs under job-work-dir may be needed sothat cleaning up of job-dir in all[MAPREDUCE-1422]

mapred-local-directories succeeds always Improve performance of CombineFileInputFormat when multiple pools are configured[MAPREDUCE-1423] archive throws OutOfMemoryError[MAPREDUCE-1425] symlinks in cwd of the task are not handled properly after MAPREDUCE-896[MAPREDUCE-1435] Deadlock in preemption code in fair scheduler[MAPREDUCE-1436] MapReduce should use the short form of the user names[MAPREDUCE-1440] Configuration of directory lists should trim whitespace[MAPREDUCE-1441] StackOverflowError when JobHistory parses a really long line[MAPREDUCE-1442] DBInputFormat can leak connections[MAPREDUCE-1443] The servlets should quote server generated strings sent in the response[MAPREDUCE-1454] Authorization for servlets[MAPREDUCE-1455] For secure job execution, couple of more UserGroupInformation.doAs needs to be added[MAPREDUCE-1457] In JobTokenIdentifier change method getUsername to getUser which returns UGI[MAPREDUCE-1464] FileInputFormat should save #input-files in JobConf[MAPREDUCE-1466] committer.needsTaskCommit should not be called for a task cleanup attempt[MAPREDUCE-1476] CombineFileRecordReader does not properly initialize child RecordReader[MAPREDUCE-1480] Authorization for job-history pages[MAPREDUCE-1493] Push HADOOP-6551 into MapReduce[MAPREDUCE-1503] Cluster class should create the rpc client only when needed[MAPREDUCE-1505] Protection against incorrectly configured reduces[MAPREDUCE-1521] FileInputFormat may change the file system of an input path[MAPREDUCE-1522] Cache the job related information while submitting the job , this would avoid many RPC calls to JobTracker.[MAPREDUCE-1526] Reduce or remove usage of String.format() usage in CapacityTaskScheduler.updateQSIObjects and[MAPREDUCE-1533]

Counters.makeEscapedString() TrackerDistributedCacheManager can fail because the number of subdirectories reaches system limit[MAPREDUCE-1538] Log messages of JobACLsManager should use security logging of HADOOP-6586[MAPREDUCE-1543] Add 'first-task-launched' to job-summary[MAPREDUCE-1545] UGI.doAs should not be used for getting the history file of jobs[MAPREDUCE-1550] Task diagnostic info would get missed sometimes.[MAPREDUCE-1563] Shuffle stage - Key and Group Comparators[MAPREDUCE-1570] Task controller may not set permissions for a task cleanup attempt's log directory[MAPREDUCE-1607] TaskTracker.localizeJob should not set permissions on job log directory recursively[MAPREDUCE-1609] Refresh nodes and refresh queues doesnt work with service authorization enabled[MAPREDUCE-1611] job conf file is not accessible from job history web page[MAPREDUCE-1612] Streaming's TextOutputReader.getLastOutput throws NPE if it has never read any output[MAPREDUCE-1621] ResourceEstimator does not work after MAPREDUCE-842[MAPREDUCE-1635] Job submission should fail if same uri is added for mapred.cache.files and mapred.cache.archives[MAPREDUCE-1641] JobStory should provide queue info.[MAPREDUCE-1656] After task logs directory is deleted, tasklog servlet displays wrong error message about job ACLs[MAPREDUCE-1657] Job Acls affect Queue Acls[MAPREDUCE-1664] Add a metrics to track the number of heartbeats processed[MAPREDUCE-1680] Tasks should not be scheduled after tip is killed/failed.[MAPREDUCE-1682] Remove JNI calls from ClusterStatus cstr[MAPREDUCE-1683] JobHistory shouldn't be disabled for any reason[MAPREDUCE-1699] TaskRunner can get NPE in getting ugi from TaskTracker[MAPREDUCE-1707] Truncate logs of finished tasks to prevent node thrash due to excessive logging[MAPREDUCE-1716]












































































Authentication between pipes processes and java counterparts.[MAPREDUCE-1733] Un-deprecate the old MapReduce API in the 0.20 branch[MAPREDUCE-1734] DistributedCache creates its own FileSytem instance when adding a file/archive to the path[MAPREDUCE-1744] Replace mapred.persmissions.supergroup with an acl : mapreduce.cluster.administrators[MAPREDUCE-1754] Exception message for unauthorized user doing killJob, killTask, setJobPriority needs to be improved[MAPREDUCE-1759] CompletedJobStatusStore initialization should fail if mapred.job.tracker.persist.jobstatus.dir is unwritable[MAPREDUCE-1778] IFile should check for null compressor[MAPREDUCE-1784] Add streaming config option for not emitting the key[MAPREDUCE-1785] Support for file sizes less than 1MB in DFSIO benchmark.[MAPREDUCE-1832] FairScheduler.tasksToPeempt() can return negative number[MAPREDUCE-1845] Include job submit host information (name and ip) in jobconf and jobdetails display[MAPREDUCE-1850] MultipleOutputs does not cache TaskAttemptContext[MAPREDUCE-1853] Add read timeout on userlog pull[MAPREDUCE-1868] Re-think (user|queue) limits on (tasks|jobs) in the CapacityScheduler[MAPREDUCE-1872] MRAsyncDiskService does not properly absolutize volume root paths[MAPREDUCE-1887] MapReduce daemons should close FileSystems that are not needed anymore[MAPREDUCE-1900] TrackerDistributedCacheManager never cleans its input directories[MAPREDUCE-1914] Ability for having user's classes take precedence over the system classes for tasks' classpath[MAPREDUCE-1938] Limit the size of jobconf.[MAPREDUCE-1960] ConcurrentModificationException when shutting down Gridmix[MAPREDUCE-1961] java.lang.ArrayIndexOutOfBoundsException in analysejobhistory.jsp of jobs with 0 maps[MAPREDUCE-1985] TestDFSIO read test may not read specified bytes.[MAPREDUCE-2023] Race condition in writing the jobtoken password file when launching pipes jobs[MAPREDUCE-2082] Secure local filesystem IO from symlink vulnerabilities[MAPREDUCE-2096] task-controller shouldn't require o-r permissions[MAPREDUCE-2103] safely handle InterruptedException and interrupted status in MR code[MAPREDUCE-2157] Race condition in LinuxTaskController permissions handling[MAPREDUCE-2178] JT should not try to remove mapred.system.dir during startup[MAPREDUCE-2219] If Localizer can't create task log directory, it should fail on the spot[MAPREDUCE-2234] JobTracker "over-synchronization" makes it hang up in certain cases[MAPREDUCE-2235] LinuxTaskController doesn't properly escape environment variables[MAPREDUCE-2242] Servlets should specify content type[MAPREDUCE-2253] FairScheduler fairshare preemption from multiple pools may preempt all tasks from one pool causing that pool to go below[MAPREDUCE-2256]

fairshare. Permissions race can make getStagingDir fail on local filesystem[MAPREDUCE-2289] TT should fail to start on secure cluster when SecureIO isn't available[MAPREDUCE-2321] Add metrics to the fair scheduler[MAPREDUCE-2323] memory-related configurations missing from mapred-default.xml[MAPREDUCE-2328] Improve error messages when MR dirs on local FS have bad ownership[MAPREDUCE-2332] mapred.job.tracker.history.completed.location should support an arbitrary filesystem URI[MAPREDUCE-2351] Make the MR changes to reflect the API changes in SecureIO library[MAPREDUCE-2353] A task succeeded even though there were errors on all attempts.[MAPREDUCE-2356] Shouldn't hold lock on rjob while localizing resources.[MAPREDUCE-2364] TaskTracker can't retrieve stdout and stderr from web UI[MAPREDUCE-2366] TaskLogsTruncater does not need to check log ownership when running as Child[MAPREDUCE-2371] TaskLogAppender mechanism shouldn't be set in log4j.properties[MAPREDUCE-2372] When tasks exit with a nonzero exit status, task runner should log the stderr as well as stdout[MAPREDUCE-2373] Should not use PrintWriter to write taskjvm.sh[MAPREDUCE-2374] task-controller fails to parse configuration if it doesn't end in \n[MAPREDUCE-2377] Distributed cache sizing configurations are missing from mapred-default.xml[MAPREDUCE-2379]





















































General Information



New in This Release

This is a maintenance release with no new features.

Resolved Issues

(5840) Mfs process stays at 100% after upgrade(5848) Container resyncs not executed in a timely manner(5866) Mfs generates core file.(5897) CLDB exception causes failover(5907) CLDB exception causes failover(5961) File System loses track which container is master.(5971) Memory leak in MapR client(6044) CLDB over-replicates data




General Information



New in This Release


Resolved Issues

(1438) Home directory support(5833) Improve CLDB failover time(5857) Streamline container replication and reporting(5934) Ensure disktab is correct after reboot(6014) Optimize Java garbage collection to mitigate CLDB disruptions(6074) Enhance CLDB exception handling(6140) Improve mfs exception handling(6144) CLDB timeout in M3 reduced to 1 hour on new node (instant on same node)(6166) Add API to blacklist a tasktracker manually(6171) Fixed container stuck offline problem(6198) Corrected overcommit documentation page(6211) Rolling upgrade enhancements(6235) BTree improvements(6273) Fixed getBlockLocations() in MapReduce layer




General Information



New in This Release

Mac OS Client

A Mac OS client is now available. For more information, see Installing the MapR Client on Mac OS X

Resolved Issues

(Issue 4415) Select and Kill Controls in JobTracker UI(Issue 2809) Add Release Note for NFS Dependencies

Known Issues


The error indicates an attempt to create a new snapshot with the same name as an existing snapshot, but can occur in the followingEEXISTcases as well:

If the node with the snapshot's name container fails during snapshot creation, the failed snapshot remains until it is removed by the CLDBafter 30 minutes.If snapshot creation fails after reserving the name, then the name exists but the snapshot does not.If the response to a successful snapshot is delayed by a network glitch, and the snapshot operation is retried as a result, correcEEXISTStly indicates that the snapshot exists although it does not appear to.

In any of the above cases, either retry the snapshot with a different name, or delete the existing (or failed) snapshot and create it again.

(Issue 4269) Bulk Operations






1.

2.

3. 4.

(Issue 4037) Starting Newly Added Services

After you install new services on a node, you can start them in two ways:

Use the MapR Control System, the API, or the command-line interface to start the services individuallyRestart the warden to stop and start all services on the node

If you start the services individually, the node's memory will not be reconfigured to account for the newly installed services. This can causememory paging, slowing or stopping the node. However, stopping and restarting the warden can take the node out of service.

For best results, choose a time when the cluster is not very busy if you need to install additional services on a node. If that is not possible, makesure to restart the warden as soon as it is practical to do so after installing new services.

(Issue 4024) Hadoop Copy Commands Do Not Handle Broken Symbolic Links

The and commands attempt to resolve symbolic links in the source data set, tohadoop fs -copyToLocal hadoop fs -copyFromLocalcreate physical copies of the files referred to by the links. If a broken symbolic link is encountered by either command, the copy operation fails atthat point.

(Issue 4018)(HDFS-1768) fs -put crash that depends on source file name

Copying a file using the command generates a warning or exception if a corresponding checksum file exists. If this errorhadoop fs .*.crcoccurs, delete all local checksum files and try again. See http://www.mail-archive.com/[email protected]/msg15824.html

(Issue 3524) Apache Port 80 Open


(Issue 3488) Ubuntu IRQ Balancer Issue on Virtual Machines

In VM environments like EC2, VMWare, and Xen, when running Ubuntu 10.10, problems can occur due to an Ubuntu bug unless the IRQbalancer is turned off. On all nodes, edit the file and set to turn off the IRQ balancer (requires reboot/etc/default/irqbalance ENABLED=0to take effect).

(Issue 3244) Volume Mirror Issue

If a volume dump restore command is interrupted before completion (killed by the user, node fails, etc.) then the volume remains in the "Mirroringin Progress" state. Before retrying the operation, you must issue the command explicitly.volume dump restore volume mirror stop



(Issue 3028) Changing the Time on a ZooKeeper Node

To avoid cluster downtime, use the following steps to set the time on any node running ZooKeeper:

Use the MapR to check that all configured ZooKeeper services on the cluster are running. Start any non-running ZooKeeperDashboardinstances.Stop ZooKeeper on the node:/etc/init.d/mapr-zookeeper stopChange the time on the node or sync the time to NTP.Start ZooKeeper on the node:/etc/init.d/mapr-zookeeper start

http://www.mail-archive.com/[email protected]/msg15824.html




General Information



New in This Release


Resolved Issues


Known Issues










General Information



New in This Release


Resolved Issues

(Issue 4037) Starting Newly Added Services (Documentation fix)(Issue 4024) Hadoop Copy Commands Do Not Handle Broken Symbolic Links(Issue 4018)(HDFS-1768) fs -put crash that depends on source file name(Issue 3524) Apache Port 80 Open (Documentation fix)(Issue 3488) Ubuntu IRQ Balancer Issue on Virtual Machines (Documentation fix)(Issue 3244) Volume Mirror Issue(Issue 3028) Changing the Time on a ZooKeeper Node (Documentation fix)

Known Issues
















General Information



New in This Release

EMC License support

Packages built for EMC have "EMC" in the MapRBuildVersion (example: ).1.1.0.10806EMC-1

HBase LeaseRecovery

FSUtis LeaseRecovery is supported in HBase trunk and HBase 90.2. To run a different version of HBase with MapR, apply the following patchesand compile HBase:

https://issues.apache.org/jira/secure/attachment/12489782/4169-v5.txthttps://issues.apache.org/jira/secure/attachment/12489818/4169-correction.txt

Resolved Issues

(Issue 4792) Synchronization in reading JobTracker address(Issue 4910) File attributes sometimes not updated correctly in client cache(HBase Jira 4169) FSUtils LeaseRecovery for non HDFS FileSystems(Issue 4905) FSUtils LeaseRecovery for MapR

Known Issues







Volumes - Edit VolumesVolumes - Remove VolumesVolumes - New SnapshotVolumes - UnmountMirror Volumes - Edit VolumesMirror Volumes - Remove VolumesMirror Volumes - UnmountUser Disk Usage - EditSnapshots - RemoveSnapshots - PreserveNode Alarms - Change Topology

https://issues.apache.org/jira/secure/attachment/12489782/4169-v5.txt

https://issues.apache.org/jira/secure/attachment/12489818/4169-correction.txt




1.

2.

3. 4.

Nodes - Change TopologyVolume Alarms - EditVolume Alarms - UnmountVolume Alarms - RemoveUser/Group Alarms - Edit





























MapR 1.0 includes the following Apache Hadoop issues that are not included in the Apache Hadoop base version 0.20.2:

[ Make streaming to handle non-utf8 byte arrayHADOOP-1722] IPC server max queue size should be configurable[HADOOP-1849] speculative execution start up condition based on completion time[HADOOP-2141] Space in the value for dfs.data.dir can cause great problems[HADOOP-2366] Use job control for tasks (and therefore for pipes and streaming)[HADOOP-2721] Add HADOOP_LIBRARY_PATH config setting so Hadoop will include external directories for jni[HADOOP-2838] Shuffling fetchers waited too long between map output fetch re-tries[HADOOP-3327] Patch to allow hadoop native to compile on Mac OS X[HADOOP-3659] Providing splitting support for bzip2 compressed files[HADOOP-4012] IsolationRunner does not work as documented[HADOOP-4041] Map and Reduce tasks should run as the user who submitted the job[HADOOP-4490] FileSystem.CACHE should be ref-counted[HADOOP-4655] Add a user to groups mapping service[HADOOP-4656] Current Ganglia metrics implementation is incompatible with Ganglia 3.1[HADOOP-4675] Allow FileSystem shutdown hook to be disabled[HADOOP-4829] Streaming combiner should allow command, not just JavaClass[HADOOP-4842] Implement setuid executable for Linux to assist in launching tasks as job owners[HADOOP-4930] ConcurrentModificationException in JobHistory.java[HADOOP-4933] Set max map/reduce tasks on a per-job basis, either per-node or cluster-wide[HADOOP-5170] Option to prohibit jars unpacking[HADOOP-5175] TT's version build is too restrictive[HADOOP-5203] Queue ACLs should be refreshed without requiring a restart of the job tracker[HADOOP-5396] Provide a way for users to find out what operations they can do on which M/R queues[HADOOP-5419] Support killing of process groups in LinuxTaskController binary[HADOOP-5420] The job history display needs to be paged[HADOOP-5442] Add support for application-specific typecodes to typed bytes[HADOOP-5450] Exposing Hadoop metrics via HTTP[HADOOP-5469] calling new SequenceFile.Reader(...) leaves an InputStream open, if the given sequence file is broken[HADOOP-5476] HADOOP-2721 doesn't clean up descendant processes of a jvm that exits cleanly after running a task successfully[HADOOP-5488] Binary partitioner[HADOOP-5528] Hadoop Vaidya throws number format exception due to changes in the job history counters string format (escaped compact[HADOOP-5582]

representation). Hadoop Streaming - GzipCodec[HADOOP-5592] change S3Exception to checked exception[HADOOP-5613] Ability to blacklist tasktracker[HADOOP-5643] Counter for S3N Read Bytes does not work[HADOOP-5656] DistCp should not launch a job if it is not necessary[HADOOP-5675] Add map/reduce slot capacity and lost map/reduce slot capacity to JobTracker metrics[HADOOP-5733] UGI checks in testcases are broken[HADOOP-5737] Split waiting tasks field in JobTracker metrics to individual tasks[HADOOP-5738] Allow setting the default value of maxRunningJobs for all pools[HADOOP-5745] The length of the heartbeat cycle should be configurable.[HADOOP-5784] JobTracker should refresh the hosts list upon recovery[HADOOP-5801] problem using top level s3 buckets as input/output directories[HADOOP-5805] s3n files are not getting split by default[HADOOP-5861] GzipCodec should read compression level etc from configuration[HADOOP-5879] Allow administrators to be able to start and stop queues[HADOOP-5913] Use JDK 1.6 File APIs in DF.java wherever possible[HADOOP-5958] create script to provide classpath for external tools[HADOOP-5976] LD_LIBRARY_PATH not passed to tasks spawned off by LinuxTaskController[HADOOP-5980] HADOOP-2838 doesnt work as expected[HADOOP-5981] RPC client opens an extra connection for VersionedProtocol[HADOOP-6132] ReflectionUtils performance regression[HADOOP-6133] Implement a pure Java CRC32 calculator[HADOOP-6148] Add get/setEnum to Configuration[HADOOP-6161] Improve PureJavaCrc32[HADOOP-6166]


























































Provide a configuration dump in json format.[HADOOP-6184] Configuration does not lock parameters marked final if they have no value.[HADOOP-6227] Permission configuration files should use octal and symbolic[HADOOP-6234] s3n fails with SocketTimeoutException[HADOOP-6254] Missing synchronization for defaultResources in Configuration.addResource[HADOOP-6269] Add JVM memory usage to JvmMetrics[HADOOP-6279] Any hadoop commands crashing jvm (SIGBUS) when /tmp (tmpfs) is full[HADOOP-6284] Use JAAS LoginContext for our login[HADOOP-6299] Configuration sends too much data to log4j[HADOOP-6312] Update FilterInitializer class to be more visible and take a conf for further development[HADOOP-6337] Stack trace of any runtime exceptions should be recorded in the server logs.[HADOOP-6343] Log errors getting Unix UGI[HADOOP-6400] Add a /conf servlet to dump running configuration[HADOOP-6408] Change RPC layer to support SASL based mutual authentication[HADOOP-6419] Add AsyncDiskService that is used in both hdfs and mapreduce[HADOOP-6433] Prevent remote CSS attacks in Hostname and UTF-7.[HADOOP-6441] Hadoop wrapper script shouldn't ignore an existing JAVA_LIBRARY_PATH[HADOOP-6453] StringBuffer -> StringBuilder - conversion of references as necessary[HADOOP-6471] HttpServer sends wrong content-type for CSS files (and others)[HADOOP-6496] doAs for proxy user[HADOOP-6510] FsPermission:SetUMask not updated to use new-style umask setting.[HADOOP-6521] LocalDirAllocator should use whitespace trimming configuration getters[HADOOP-6534] Allow authentication-enabled RPC clients to connect to authentication-disabled RPC servers[HADOOP-6543] archive does not work with distcp -update[HADOOP-6558] Authorization for default servlets[HADOOP-6568] FsShell#cat should avoid calling unecessary getFileStatus before opening a file to read[HADOOP-6569] RPC responses may be out-of-order with respect to SASL[HADOOP-6572] IPC server response buffer reset threshold should be configurable[HADOOP-6577] Configuration should trim whitespace around a lot of value types[HADOOP-6578] Split RPC metrics into summary and detailed metrics[HADOOP-6599] Deadlock in DFSClient#getBlockLocations even with the security disabled[HADOOP-6609] RPC server should check for version mismatch first[HADOOP-6613] "Bad Connection to FS" message in FSShell should print message from the exception[HADOOP-6627] FileUtil.fullyDelete() should continue to delete other files despite failure at any level.[HADOOP-6631] AccessControlList uses full-principal names to verify acls causing queue-acls to fail[HADOOP-6634] Benchmark overhead of RPC session establishment[HADOOP-6637] FileSystem.get() does RPC retries within a static synchronized block[HADOOP-6640] util.Shell getGROUPS_FOR_USER_COMMAND method name - should use common naming convention[HADOOP-6644] login object in UGI should be inside the subject[HADOOP-6649] ShellBasedUnixGroupsMapping shouldn't have a cache[HADOOP-6652] NullPointerException in setupSaslConnection when browsing directories[HADOOP-6653] BlockDecompressorStream get EOF exception when decompressing the file compressed from empty file[HADOOP-6663] RPC.waitForProxy should retry through NoRouteToHostException[HADOOP-6667] zlib.compress.level ignored for DefaultCodec initialization[HADOOP-6669] UserGroupInformation doesn't support use in hash tables[HADOOP-6670] Performance Improvement in Secure RPC[HADOOP-6674] user object in the subject in UGI should be reused in case of a relogin.[HADOOP-6687] Incorrect exit codes for "dfs -chown", "dfs -chgrp"[HADOOP-6701] Relogin behavior for RPC clients could be improved[HADOOP-6706] Symbolic umask for file creation is not consistent with posix[HADOOP-6710] FsShell 'hadoop fs -text' does not support compression codecs[HADOOP-6714] Client does not close connection when an exception happens during SASL negotiation[HADOOP-6718] NetUtils.connect should check that it hasn't connected a socket to itself[HADOOP-6722] unchecked exceptions thrown in IPC Connection orphan clients[HADOOP-6723] IPC doesn't properly handle IOEs thrown by socket factory[HADOOP-6724] adding some java doc to Server.RpcMetrics, UGI[HADOOP-6745] NullPointerException for hadoop clients launched from streaming tasks[HADOOP-6757] WebServer shouldn't increase port number in case of negative port setting caused by Jetty's race[HADOOP-6760] exception while doing RPC I/O closes channel[HADOOP-6762] UserGroupInformation.createProxyUser's javadoc is broken[HADOOP-6776] Add a new newInstance method in FileSystem that takes a "user" as argument[HADOOP-6813] refreshSuperUserGroupsConfiguration should use server side configuration for the refresh[HADOOP-6815] Provide a JNI-based implementation of GroupMappingServiceProvider[HADOOP-6818] Provide a web server plugin that uses a static user for the web UI[HADOOP-6832] IPC leaks call parameters when exceptions thrown[HADOOP-6833] Introduce additional statistics to FileSystem[HADOOP-6859] Provide a JNI-based implementation of ShellBasedUnixGroupsNetgroupMapping (implementation of[HADOOP-6864]

GroupMappingServiceProvider) The efficient comparators aren't always used except for BytesWritable and Text[HADOOP-6881] RawLocalFileSystem#setWorkingDir() does not work for relative names[HADOOP-6899] Rpc client doesn't use the per-connection conf to figure out server's Kerberos principal[HADOOP-6907] BZip2Codec incorrectly implements read()[HADOOP-6925] Fix BooleanWritable comparator in 0.20[HADOOP-6928] The GroupMappingServiceProvider interface should be public[HADOOP-6943] Suggest that HADOOP_CLASSPATH should be preserved in hadoop-env.sh.template[HADOOP-6950]













































































Allow wildcards to be used in ProxyUsers configurations[HADOOP-6995] Configuration.writeXML should not hold lock while outputting[HADOOP-7082] UserGroupInformation.getCurrentUser() fails when called from non-Hadoop JAAS context[HADOOP-7101] Remove unnecessary DNS reverse lookups from RPC layer[HADOOP-7104] Implement chmod with JNI[HADOOP-7110] FsShell should dump all exceptions at DEBUG level[HADOOP-7114] Add a cache for getpwuid_r and getpwgid_r calls[HADOOP-7115] NPE in Configuration.writeXml[HADOOP-7118] Timed out shell commands leak Timer threads[HADOOP-7122] getpwuid_r is not thread-safe on RHEL6[HADOOP-7156] SecureIO should not check owner on non-secure clusters that have no native support[HADOOP-7172] Remove unused fstat() call from NativeIO[HADOOP-7173] WritableComparator.get should not cache comparator objects[HADOOP-7183] Remove deprecated local.cache.size from core-default.xml[HADOOP-7184]

MapReduce Patches

MapR 1.0 includes the following Apache MapReduce issues that are not included in the Apache Hadoop base version 0.20.2:


fail. Change org.apache.hadoop.examples.MultiFileWordCount to use new mapreduce api.[MAPREDUCE-364] Change org.apache.hadoop.mapred.lib.MultipleInputs to use new api.[MAPREDUCE-369] Change org.apache.hadoop.mapred.lib.MultipleOutputs to use new api.[MAPREDUCE-370] JobControl Job does always has an unassigned name[MAPREDUCE-415] Move the completed jobs' history files to a DONE subdirectory inside the configured history directory[MAPREDUCE-416] Enable ServicePlugins for the JobTracker[MAPREDUCE-461] The job setup and cleanup tasks should be optional[MAPREDUCE-463] Collect information about number of tasks succeeded / total per time unit for a tasktracker.[MAPREDUCE-467] extend DistributedCache to work locally (LocalJobRunner)[MAPREDUCE-476] separate jvm param for mapper and reducer[MAPREDUCE-478] Fix the 'cluster drain' problem in the Capacity Scheduler wrt High RAM Jobs[MAPREDUCE-516] The capacity-scheduler should assign multiple tasks per heartbeat[MAPREDUCE-517] After JobTracker restart Capacity Schduler does not schedules pending tasks from already running tasks.[MAPREDUCE-521] Allow admins of the Capacity Scheduler to set a hard-limit on the capacity of a queue[MAPREDUCE-532] Add preemption to the fair scheduler[MAPREDUCE-551] If #link is missing from uri format of -cacheArchive then streaming does not throw error.[MAPREDUCE-572] Change KeyValueLineRecordReader and KeyValueTextInputFormat to use new api.[MAPREDUCE-655] Existing diagnostic rules fail for MAP ONLY jobs[MAPREDUCE-676] XML-based metrics as JSP servlet for JobTracker[MAPREDUCE-679] Reuse of Writable objects is improperly handled by MRUnit[MAPREDUCE-680] Reserved tasktrackers should be removed when a node is globally blacklisted[MAPREDUCE-682] Conf files not moved to "done" subdirectory after JT restart[MAPREDUCE-693] Per-pool task limits for the fair scheduler[MAPREDUCE-698] Support for FIFO pools in the fair scheduler[MAPREDUCE-706] Provide a jobconf property for explicitly assigning a job to a pool[MAPREDUCE-707] node health check script does not display the correct message on timeout[MAPREDUCE-709] JobConf.findContainingJar unescapes unnecessarily on Linux[MAPREDUCE-714] org.apache.hadoop.mapred.lib.db.DBInputformat not working with oracle[MAPREDUCE-716] More slots are getting reserved for HiRAM job tasks then required[MAPREDUCE-722] node health check script should not log "UNHEALTHY" status for every heartbeat in INFO mode[MAPREDUCE-732] java.util.ConcurrentModificationException observed in unreserving slots for HiRam Jobs[MAPREDUCE-734] Allow relative paths to be created inside archives.[MAPREDUCE-739] Provide summary information per job once a job is finished.[MAPREDUCE-740] Support in DistributedCache to share cache files with other users after HADOOP-4493[MAPREDUCE-744] NPE in expiry thread when a TT is lost[MAPREDUCE-754] TypedBytesInput's readRaw() does not preserve custom type codes[MAPREDUCE-764] Configuration information should generate dump in a standard format.[MAPREDUCE-768] Setup and cleanup tasks remain in UNASSIGNED state for a long time on tasktrackers with long running high RAM tasks[MAPREDUCE-771] Use PureJavaCrc32 in mapreduce spills[MAPREDUCE-782] -files, -archives should honor user given symlink path[MAPREDUCE-787] Job summary logs show status of completed jobs as RUNNING[MAPREDUCE-809] Move completed Job history files to HDFS[MAPREDUCE-814] Add a cache for retired jobs with minimal job info and provide a way to access history file url[MAPREDUCE-817] JobClient completion poll interval of 5s causes slow tests in local mode[MAPREDUCE-825] DBInputFormat leaves open transaction[MAPREDUCE-840]







































































Per-job local data on the TaskTracker node should have right access-control[MAPREDUCE-842] Localized files from DistributedCache should have right access-control[MAPREDUCE-856] Job/Task local files have incorrect group ownership set by LinuxTaskController binary[MAPREDUCE-871] Make DBRecordReader execute queries lazily[MAPREDUCE-875] More efficient SQL queries for DBInputFormat[MAPREDUCE-885] After HADOOP-4491, the user who started mapred system is not able to run job.[MAPREDUCE-890] Users can set non-writable permissions on temporary files for TT and can abuse disk usage.[MAPREDUCE-896] When using LinuxTaskController, localized files may become accessible to unintended users if permissions are[MAPREDUCE-899]

misconfigured. Cleanup of task-logs should happen in TaskTracker instead of the Child[MAPREDUCE-927] OutputCommitter should have an abortJob method[MAPREDUCE-947] Inaccurate values in jobSummary logs[MAPREDUCE-964] TaskTracker does not need to fully unjar job jars[MAPREDUCE-967] NPE in distcp encountered when placing _logs directory on S3FileSystem[MAPREDUCE-968] distcp does not always remove distcp.tmp.dir[MAPREDUCE-971] Cleanup tasks are scheduled using high memory configuration, leaving tasks in unassigned state.[MAPREDUCE-1028] Reduce tasks are getting starved in capacity scheduler[MAPREDUCE-1030] Show total slot usage in cluster summary on jobtracker webui[MAPREDUCE-1048] distcp can generate uneven map task assignments[MAPREDUCE-1059] Use the user-to-groups mapping service in the JobTracker[MAPREDUCE-1083] For tasks, "ulimit -v -1" is being run when user doesn't specify mapred.child.ulimit[MAPREDUCE-1085] hadoop commands in streaming tasks are trying to write to tasktracker's log[MAPREDUCE-1086] JobHistory files should have narrower 0600 perms[MAPREDUCE-1088] Fair Scheduler preemption triggers NPE when tasks are scheduled but not running[MAPREDUCE-1089] Modify log statement in Tasktracker log related to memory monitoring to include attempt id.[MAPREDUCE-1090] Incorrect synchronization in DistributedCache causes TaskTrackers to freeze up during localization of Cache for tasks.[MAPREDUCE-1098] User's task-logs filling up local disks on the TaskTrackers[MAPREDUCE-1100] Additional JobTracker metrics[MAPREDUCE-1103] CapacityScheduler: It should be possible to set queue hard-limit beyond it's actual capacity[MAPREDUCE-1105] Capacity Scheduler scheduling information is hard to read / should be tabular format[MAPREDUCE-1118] Using profilers other than hprof can cause JobClient to report job failure[MAPREDUCE-1131] Per cache-file refcount can become negative when tasks release distributed-cache files[MAPREDUCE-1140] runningMapTasks counter is not properly decremented in case of failed Tasks.[MAPREDUCE-1143] Streaming tests swallow exceptions[MAPREDUCE-1155] running_maps is not decremented when the tasks of a job is killed/failed[MAPREDUCE-1158] Two log statements at INFO level fill up jobtracker logs[MAPREDUCE-1160] Lots of fetch failures[MAPREDUCE-1171] MultipleInputs fails with ClassCastException[MAPREDUCE-1178] URL to JT webconsole for running job and job history should be the same[MAPREDUCE-1185] While localizing a DistributedCache file, TT sets permissions recursively on the whole base-dir[MAPREDUCE-1186] MAPREDUCE-947 incompatibly changed FileOutputCommitter[MAPREDUCE-1196] Alternatively schedule different types of tasks in fair share scheduler[MAPREDUCE-1198] TaskTrackers restart is very slow because it deletes distributed cache directory synchronously[MAPREDUCE-1213] JobTracker Metrics causes undue load on JobTracker[MAPREDUCE-1219] Kill tasks on a node if the free physical memory on that machine falls below a configured threshold[MAPREDUCE-1221] Distcp is very slow[MAPREDUCE-1231] Refactor job token to use a common token interface[MAPREDUCE-1250] Fair scheduler event log not logging job info[MAPREDUCE-1258] DistCp cannot handle -delete if destination is local filesystem[MAPREDUCE-1285] DistributedCache localizes only once per cache URI[MAPREDUCE-1288] AutoInputFormat doesn't work with non-default FileSystems[MAPREDUCE-1293] TrackerDistributedCacheManager can delete file asynchronously[MAPREDUCE-1302] Add counters for task time spent in GC[MAPREDUCE-1304] Introduce the concept of Job Permissions[MAPREDUCE-1307] NPE in FieldFormatter if escape character is set and field is null[MAPREDUCE-1313] JobTracker holds stale references to retired jobs via unreported tasks[MAPREDUCE-1316] Potential JT deadlock in faulty TT tracking[MAPREDUCE-1342] Incremental enhancements to the JobTracker for better scalability[MAPREDUCE-1354] ConcurrentModificationException in JobInProgress[MAPREDUCE-1372] Args in job details links on jobhistory.jsp are not URL encoded[MAPREDUCE-1378] MRAsyncDiscService should tolerate missing local.dir[MAPREDUCE-1382] NullPointerException observed during task failures[MAPREDUCE-1397] TaskLauncher remains stuck on tasks waiting for free nodes even if task is killed.[MAPREDUCE-1398] The archive command shows a null error message[MAPREDUCE-1399] Save file-sizes of each of the artifacts in DistributedCache in the JobConf[MAPREDUCE-1403] LinuxTaskController tests failing on trunk after the commit of MAPREDUCE-1385[MAPREDUCE-1421] Changing permissions of files/dirs under job-work-dir may be needed sothat cleaning up of job-dir in all[MAPREDUCE-1422]

mapred-local-directories succeeds always Improve performance of CombineFileInputFormat when multiple pools are configured[MAPREDUCE-1423] archive throws OutOfMemoryError[MAPREDUCE-1425] symlinks in cwd of the task are not handled properly after MAPREDUCE-896[MAPREDUCE-1435] Deadlock in preemption code in fair scheduler[MAPREDUCE-1436] MapReduce should use the short form of the user names[MAPREDUCE-1440] Configuration of directory lists should trim whitespace[MAPREDUCE-1441] StackOverflowError when JobHistory parses a really long line[MAPREDUCE-1442]












































































DBInputFormat can leak connections[MAPREDUCE-1443] The servlets should quote server generated strings sent in the response[MAPREDUCE-1454] Authorization for servlets[MAPREDUCE-1455] For secure job execution, couple of more UserGroupInformation.doAs needs to be added[MAPREDUCE-1457] In JobTokenIdentifier change method getUsername to getUser which returns UGI[MAPREDUCE-1464] FileInputFormat should save #input-files in JobConf[MAPREDUCE-1466] committer.needsTaskCommit should not be called for a task cleanup attempt[MAPREDUCE-1476] CombineFileRecordReader does not properly initialize child RecordReader[MAPREDUCE-1480] Authorization for job-history pages[MAPREDUCE-1493] Push HADOOP-6551 into MapReduce[MAPREDUCE-1503] Cluster class should create the rpc client only when needed[MAPREDUCE-1505] Protection against incorrectly configured reduces[MAPREDUCE-1521] FileInputFormat may change the file system of an input path[MAPREDUCE-1522] Cache the job related information while submitting the job , this would avoid many RPC calls to JobTracker.[MAPREDUCE-1526] Reduce or remove usage of String.format() usage in CapacityTaskScheduler.updateQSIObjects and[MAPREDUCE-1533]

Counters.makeEscapedString() TrackerDistributedCacheManager can fail because the number of subdirectories reaches system limit[MAPREDUCE-1538] Log messages of JobACLsManager should use security logging of HADOOP-6586[MAPREDUCE-1543] Add 'first-task-launched' to job-summary[MAPREDUCE-1545] UGI.doAs should not be used for getting the history file of jobs[MAPREDUCE-1550] Task diagnostic info would get missed sometimes.[MAPREDUCE-1563] Shuffle stage - Key and Group Comparators[MAPREDUCE-1570] Task controller may not set permissions for a task cleanup attempt's log directory[MAPREDUCE-1607] TaskTracker.localizeJob should not set permissions on job log directory recursively[MAPREDUCE-1609] Refresh nodes and refresh queues doesnt work with service authorization enabled[MAPREDUCE-1611] job conf file is not accessible from job history web page[MAPREDUCE-1612] Streaming's TextOutputReader.getLastOutput throws NPE if it has never read any output[MAPREDUCE-1621] ResourceEstimator does not work after MAPREDUCE-842[MAPREDUCE-1635] Job submission should fail if same uri is added for mapred.cache.files and mapred.cache.archives[MAPREDUCE-1641] JobStory should provide queue info.[MAPREDUCE-1656] After task logs directory is deleted, tasklog servlet displays wrong error message about job ACLs[MAPREDUCE-1657] Job Acls affect Queue Acls[MAPREDUCE-1664] Add a metrics to track the number of heartbeats processed[MAPREDUCE-1680] Tasks should not be scheduled after tip is killed/failed.[MAPREDUCE-1682] Remove JNI calls from ClusterStatus cstr[MAPREDUCE-1683] JobHistory shouldn't be disabled for any reason[MAPREDUCE-1699] TaskRunner can get NPE in getting ugi from TaskTracker[MAPREDUCE-1707] Truncate logs of finished tasks to prevent node thrash due to excessive logging[MAPREDUCE-1716] Authentication between pipes processes and java counterparts.[MAPREDUCE-1733] Un-deprecate the old MapReduce API in the 0.20 branch[MAPREDUCE-1734] DistributedCache creates its own FileSytem instance when adding a file/archive to the path[MAPREDUCE-1744] Replace mapred.persmissions.supergroup with an acl : mapreduce.cluster.administrators[MAPREDUCE-1754] Exception message for unauthorized user doing killJob, killTask, setJobPriority needs to be improved[MAPREDUCE-1759] CompletedJobStatusStore initialization should fail if mapred.job.tracker.persist.jobstatus.dir is unwritable[MAPREDUCE-1778] IFile should check for null compressor[MAPREDUCE-1784] Add streaming config option for not emitting the key[MAPREDUCE-1785] Support for file sizes less than 1MB in DFSIO benchmark.[MAPREDUCE-1832] FairScheduler.tasksToPeempt() can return negative number[MAPREDUCE-1845] Include job submit host information (name and ip) in jobconf and jobdetails display[MAPREDUCE-1850] MultipleOutputs does not cache TaskAttemptContext[MAPREDUCE-1853] Add read timeout on userlog pull[MAPREDUCE-1868] Re-think (user|queue) limits on (tasks|jobs) in the CapacityScheduler[MAPREDUCE-1872] MRAsyncDiskService does not properly absolutize volume root paths[MAPREDUCE-1887] MapReduce daemons should close FileSystems that are not needed anymore[MAPREDUCE-1900] TrackerDistributedCacheManager never cleans its input directories[MAPREDUCE-1914] Ability for having user's classes take precedence over the system classes for tasks' classpath[MAPREDUCE-1938] Limit the size of jobconf.[MAPREDUCE-1960] ConcurrentModificationException when shutting down Gridmix[MAPREDUCE-1961] java.lang.ArrayIndexOutOfBoundsException in analysejobhistory.jsp of jobs with 0 maps[MAPREDUCE-1985] TestDFSIO read test may not read specified bytes.[MAPREDUCE-2023] Race condition in writing the jobtoken password file when launching pipes jobs[MAPREDUCE-2082] Secure local filesystem IO from symlink vulnerabilities[MAPREDUCE-2096] task-controller shouldn't require o-r permissions[MAPREDUCE-2103] safely handle InterruptedException and interrupted status in MR code[MAPREDUCE-2157] Race condition in LinuxTaskController permissions handling[MAPREDUCE-2178] JT should not try to remove mapred.system.dir during startup[MAPREDUCE-2219] If Localizer can't create task log directory, it should fail on the spot[MAPREDUCE-2234] JobTracker "over-synchronization" makes it hang up in certain cases[MAPREDUCE-2235] LinuxTaskController doesn't properly escape environment variables[MAPREDUCE-2242] Servlets should specify content type[MAPREDUCE-2253] FairScheduler fairshare preemption from multiple pools may preempt all tasks from one pool causing that pool to go below[MAPREDUCE-2256]

fairshare. Permissions race can make getStagingDir fail on local filesystem[MAPREDUCE-2289] TT should fail to start on secure cluster when SecureIO isn't available[MAPREDUCE-2321] Add metrics to the fair scheduler[MAPREDUCE-2323]












































































memory-related configurations missing from mapred-default.xml[MAPREDUCE-2328] Improve error messages when MR dirs on local FS have bad ownership[MAPREDUCE-2332] mapred.job.tracker.history.completed.location should support an arbitrary filesystem URI[MAPREDUCE-2351] Make the MR changes to reflect the API changes in SecureIO library[MAPREDUCE-2353] A task succeeded even though there were errors on all attempts.[MAPREDUCE-2356] Shouldn't hold lock on rjob while localizing resources.[MAPREDUCE-2364] TaskTracker can't retrieve stdout and stderr from web UI[MAPREDUCE-2366] TaskLogsTruncater does not need to check log ownership when running as Child[MAPREDUCE-2371] TaskLogAppender mechanism shouldn't be set in log4j.properties[MAPREDUCE-2372] When tasks exit with a nonzero exit status, task runner should log the stderr as well as stdout[MAPREDUCE-2373] Should not use PrintWriter to write taskjvm.sh[MAPREDUCE-2374] task-controller fails to parse configuration if it doesn't end in \n[MAPREDUCE-2377] Distributed cache sizing configurations are missing from mapred-default.xml[MAPREDUCE-2379]

















General Information



MapR GA Version 1.0 Documentation

New in This Release

Rolling Upgrade

The script upgrades a MapR cluster to a specified version of the MapR software, or to a specific set of MapR packages,rollingupgrade.sheither via SSH or node by node. This makes it easy to upgrade a MapR cluster with a minimum of downtime.

32-Bit Client

The MapR Client can now be installed on both 64-bit and 32-bit computers. See .MapR Client

Core File Removal

In the event of a core dump on a node, MapR writes the core file to the directory. If disk space on the node is nearly full, MapR/opt/coresautomatically reclaims space by deleting core files. To prevent a specific core file from being deleted, rename the file to start with a period ( )..Example:

mv mfs.core.2127.node12 .mfs.core.2127.node12

Resolved Issues

Removing Nodes(Issue 4068) Upgrading Red Hat(Issue 3984) HBase Upgrade(Issue 3965) Volume Dump Restore Failure(Issue 3890) Sqoop Requires HBase(Issue 3560) Intermittent Scheduled Mirror Failure(Issue 2949) NFS Mounting Issue on Ubuntu(Issue 2815) File Cleanup is Slow

Known Issues

(Issue 4415) Select and Kill Controls in JobTracker UI

The Select and Kill controls in the JobTracker UI appear when the parameter in is setwebinterface.private.actions mapred-site.xmlto . In MapR clusters upgraded from the beta version of the software, the parameter must be added manually for the controls to appear.true

To enable the Select and Kill controls in the JobTracker UI, copy the following lines from /opt/mapr/hadoop/hadoop-0.20.2/conf.new/ma to :pred-site.xml /opt/mapr/hadoop/hadoop-0.20.2/conf/mapred-site.xml

http://10.250.1.5:8080/download/attachments/3998550/MapR-GA-1.0-Docs-Final.pdf



<property> <name>webinterface. .actions</name>private <value> </value>true <description> If set to , jobs can be killed from JT's web .true interface Enable option the interfaces are only reachable bythis if those who have the right authorization. </description></property>



















1.

2.

3. 4.













(Issue 2809) NFS Dependencies

If you are installing the MapR NFS service on a node that cannot connect to the standard apt-get or yum repositories, you should install thefollowing packages by hand:

CentOS:iputilsportmapglibc-common-2.5-49.el5_5.7

Red Hat:rpcbindiputils

Ubuntu:nfs-commoniputils-arping








MapR 1.0 includes the following Apache Hadoop issues that are not included in the Apache Hadoop base version 0.20.2:

[ Make streaming to handle non-utf8 byte arrayHADOOP-1722] IPC server max queue size should be configurable[HADOOP-1849] speculative execution start up condition based on completion time[HADOOP-2141] Space in the value for dfs.data.dir can cause great problems[HADOOP-2366] Use job control for tasks (and therefore for pipes and streaming)[HADOOP-2721] Add HADOOP_LIBRARY_PATH config setting so Hadoop will include external directories for jni[HADOOP-2838] Shuffling fetchers waited too long between map output fetch re-tries[HADOOP-3327] Patch to allow hadoop native to compile on Mac OS X[HADOOP-3659] Providing splitting support for bzip2 compressed files[HADOOP-4012] IsolationRunner does not work as documented[HADOOP-4041] Map and Reduce tasks should run as the user who submitted the job[HADOOP-4490] FileSystem.CACHE should be ref-counted[HADOOP-4655] Add a user to groups mapping service[HADOOP-4656] Current Ganglia metrics implementation is incompatible with Ganglia 3.1[HADOOP-4675] Allow FileSystem shutdown hook to be disabled[HADOOP-4829] Streaming combiner should allow command, not just JavaClass[HADOOP-4842] Implement setuid executable for Linux to assist in launching tasks as job owners[HADOOP-4930] ConcurrentModificationException in JobHistory.java[HADOOP-4933] Set max map/reduce tasks on a per-job basis, either per-node or cluster-wide[HADOOP-5170] Option to prohibit jars unpacking[HADOOP-5175] TT's version build is too restrictive[HADOOP-5203] Queue ACLs should be refreshed without requiring a restart of the job tracker[HADOOP-5396] Provide a way for users to find out what operations they can do on which M/R queues[HADOOP-5419] Support killing of process groups in LinuxTaskController binary[HADOOP-5420] The job history display needs to be paged[HADOOP-5442] Add support for application-specific typecodes to typed bytes[HADOOP-5450] Exposing Hadoop metrics via HTTP[HADOOP-5469] calling new SequenceFile.Reader(...) leaves an InputStream open, if the given sequence file is broken[HADOOP-5476] HADOOP-2721 doesn't clean up descendant processes of a jvm that exits cleanly after running a task successfully[HADOOP-5488] Binary partitioner[HADOOP-5528] Hadoop Vaidya throws number format exception due to changes in the job history counters string format (escaped compact[HADOOP-5582]

representation). Hadoop Streaming - GzipCodec[HADOOP-5592] change S3Exception to checked exception[HADOOP-5613] Ability to blacklist tasktracker[HADOOP-5643] Counter for S3N Read Bytes does not work[HADOOP-5656] DistCp should not launch a job if it is not necessary[HADOOP-5675] Add map/reduce slot capacity and lost map/reduce slot capacity to JobTracker metrics[HADOOP-5733] UGI checks in testcases are broken[HADOOP-5737] Split waiting tasks field in JobTracker metrics to individual tasks[HADOOP-5738] Allow setting the default value of maxRunningJobs for all pools[HADOOP-5745] The length of the heartbeat cycle should be configurable.[HADOOP-5784] JobTracker should refresh the hosts list upon recovery[HADOOP-5801] problem using top level s3 buckets as input/output directories[HADOOP-5805] s3n files are not getting split by default[HADOOP-5861] GzipCodec should read compression level etc from configuration[HADOOP-5879] Allow administrators to be able to start and stop queues[HADOOP-5913] Use JDK 1.6 File APIs in DF.java wherever possible[HADOOP-5958] create script to provide classpath for external tools[HADOOP-5976] LD_LIBRARY_PATH not passed to tasks spawned off by LinuxTaskController[HADOOP-5980] HADOOP-2838 doesnt work as expected[HADOOP-5981] RPC client opens an extra connection for VersionedProtocol[HADOOP-6132] ReflectionUtils performance regression[HADOOP-6133] Implement a pure Java CRC32 calculator[HADOOP-6148] Add get/setEnum to Configuration[HADOOP-6161] Improve PureJavaCrc32[HADOOP-6166]


























































Provide a configuration dump in json format.[HADOOP-6184] Configuration does not lock parameters marked final if they have no value.[HADOOP-6227] Permission configuration files should use octal and symbolic[HADOOP-6234] s3n fails with SocketTimeoutException[HADOOP-6254] Missing synchronization for defaultResources in Configuration.addResource[HADOOP-6269] Add JVM memory usage to JvmMetrics[HADOOP-6279] Any hadoop commands crashing jvm (SIGBUS) when /tmp (tmpfs) is full[HADOOP-6284] Use JAAS LoginContext for our login[HADOOP-6299] Configuration sends too much data to log4j[HADOOP-6312] Update FilterInitializer class to be more visible and take a conf for further development[HADOOP-6337] Stack trace of any runtime exceptions should be recorded in the server logs.[HADOOP-6343] Log errors getting Unix UGI[HADOOP-6400] Add a /conf servlet to dump running configuration[HADOOP-6408] Change RPC layer to support SASL based mutual authentication[HADOOP-6419] Add AsyncDiskService that is used in both hdfs and mapreduce[HADOOP-6433] Prevent remote CSS attacks in Hostname and UTF-7.[HADOOP-6441] Hadoop wrapper script shouldn't ignore an existing JAVA_LIBRARY_PATH[HADOOP-6453] StringBuffer -> StringBuilder - conversion of references as necessary[HADOOP-6471] HttpServer sends wrong content-type for CSS files (and others)[HADOOP-6496] doAs for proxy user[HADOOP-6510] FsPermission:SetUMask not updated to use new-style umask setting.[HADOOP-6521] LocalDirAllocator should use whitespace trimming configuration getters[HADOOP-6534] Allow authentication-enabled RPC clients to connect to authentication-disabled RPC servers[HADOOP-6543] archive does not work with distcp -update[HADOOP-6558] Authorization for default servlets[HADOOP-6568] FsShell#cat should avoid calling unecessary getFileStatus before opening a file to read[HADOOP-6569] RPC responses may be out-of-order with respect to SASL[HADOOP-6572] IPC server response buffer reset threshold should be configurable[HADOOP-6577] Configuration should trim whitespace around a lot of value types[HADOOP-6578] Split RPC metrics into summary and detailed metrics[HADOOP-6599] Deadlock in DFSClient#getBlockLocations even with the security disabled[HADOOP-6609] RPC server should check for version mismatch first[HADOOP-6613] "Bad Connection to FS" message in FSShell should print message from the exception[HADOOP-6627] FileUtil.fullyDelete() should continue to delete other files despite failure at any level.[HADOOP-6631] AccessControlList uses full-principal names to verify acls causing queue-acls to fail[HADOOP-6634] Benchmark overhead of RPC session establishment[HADOOP-6637] FileSystem.get() does RPC retries within a static synchronized block[HADOOP-6640] util.Shell getGROUPS_FOR_USER_COMMAND method name - should use common naming convention[HADOOP-6644] login object in UGI should be inside the subject[HADOOP-6649] ShellBasedUnixGroupsMapping shouldn't have a cache[HADOOP-6652] NullPointerException in setupSaslConnection when browsing directories[HADOOP-6653] BlockDecompressorStream get EOF exception when decompressing the file compressed from empty file[HADOOP-6663] RPC.waitForProxy should retry through NoRouteToHostException[HADOOP-6667] zlib.compress.level ignored for DefaultCodec initialization[HADOOP-6669] UserGroupInformation doesn't support use in hash tables[HADOOP-6670] Performance Improvement in Secure RPC[HADOOP-6674] user object in the subject in UGI should be reused in case of a relogin.[HADOOP-6687] Incorrect exit codes for "dfs -chown", "dfs -chgrp"[HADOOP-6701] Relogin behavior for RPC clients could be improved[HADOOP-6706] Symbolic umask for file creation is not consistent with posix[HADOOP-6710] FsShell 'hadoop fs -text' does not support compression codecs[HADOOP-6714] Client does not close connection when an exception happens during SASL negotiation[HADOOP-6718] NetUtils.connect should check that it hasn't connected a socket to itself[HADOOP-6722] unchecked exceptions thrown in IPC Connection orphan clients[HADOOP-6723] IPC doesn't properly handle IOEs thrown by socket factory[HADOOP-6724] adding some java doc to Server.RpcMetrics, UGI[HADOOP-6745] NullPointerException for hadoop clients launched from streaming tasks[HADOOP-6757] WebServer shouldn't increase port number in case of negative port setting caused by Jetty's race[HADOOP-6760] exception while doing RPC I/O closes channel[HADOOP-6762] UserGroupInformation.createProxyUser's javadoc is broken[HADOOP-6776] Add a new newInstance method in FileSystem that takes a "user" as argument[HADOOP-6813] refreshSuperUserGroupsConfiguration should use server side configuration for the refresh[HADOOP-6815] Provide a JNI-based implementation of GroupMappingServiceProvider[HADOOP-6818] Provide a web server plugin that uses a static user for the web UI[HADOOP-6832] IPC leaks call parameters when exceptions thrown[HADOOP-6833] Introduce additional statistics to FileSystem[HADOOP-6859] Provide a JNI-based implementation of ShellBasedUnixGroupsNetgroupMapping (implementation of[HADOOP-6864]

GroupMappingServiceProvider) The efficient comparators aren't always used except for BytesWritable and Text[HADOOP-6881] RawLocalFileSystem#setWorkingDir() does not work for relative names[HADOOP-6899] Rpc client doesn't use the per-connection conf to figure out server's Kerberos principal[HADOOP-6907] BZip2Codec incorrectly implements read()[HADOOP-6925] Fix BooleanWritable comparator in 0.20[HADOOP-6928] The GroupMappingServiceProvider interface should be public[HADOOP-6943] Suggest that HADOOP_CLASSPATH should be preserved in hadoop-env.sh.template[HADOOP-6950]













































































Allow wildcards to be used in ProxyUsers configurations[HADOOP-6995] Configuration.writeXML should not hold lock while outputting[HADOOP-7082] UserGroupInformation.getCurrentUser() fails when called from non-Hadoop JAAS context[HADOOP-7101] Remove unnecessary DNS reverse lookups from RPC layer[HADOOP-7104] Implement chmod with JNI[HADOOP-7110] FsShell should dump all exceptions at DEBUG level[HADOOP-7114] Add a cache for getpwuid_r and getpwgid_r calls[HADOOP-7115] NPE in Configuration.writeXml[HADOOP-7118] Timed out shell commands leak Timer threads[HADOOP-7122] getpwuid_r is not thread-safe on RHEL6[HADOOP-7156] SecureIO should not check owner on non-secure clusters that have no native support[HADOOP-7172] Remove unused fstat() call from NativeIO[HADOOP-7173] WritableComparator.get should not cache comparator objects[HADOOP-7183] Remove deprecated local.cache.size from core-default.xml[HADOOP-7184]

MapReduce Patches

MapR 1.0 includes the following Apache MapReduce issues that are not included in the Apache Hadoop base version 0.20.2:


fail. Change org.apache.hadoop.examples.MultiFileWordCount to use new mapreduce api.[MAPREDUCE-364] Change org.apache.hadoop.mapred.lib.MultipleInputs to use new api.[MAPREDUCE-369] Change org.apache.hadoop.mapred.lib.MultipleOutputs to use new api.[MAPREDUCE-370] JobControl Job does always has an unassigned name[MAPREDUCE-415] Move the completed jobs' history files to a DONE subdirectory inside the configured history directory[MAPREDUCE-416] Enable ServicePlugins for the JobTracker[MAPREDUCE-461] The job setup and cleanup tasks should be optional[MAPREDUCE-463] Collect information about number of tasks succeeded / total per time unit for a tasktracker.[MAPREDUCE-467] extend DistributedCache to work locally (LocalJobRunner)[MAPREDUCE-476] separate jvm param for mapper and reducer[MAPREDUCE-478] Fix the 'cluster drain' problem in the Capacity Scheduler wrt High RAM Jobs[MAPREDUCE-516] The capacity-scheduler should assign multiple tasks per heartbeat[MAPREDUCE-517] After JobTracker restart Capacity Schduler does not schedules pending tasks from already running tasks.[MAPREDUCE-521] Allow admins of the Capacity Scheduler to set a hard-limit on the capacity of a queue[MAPREDUCE-532] Add preemption to the fair scheduler[MAPREDUCE-551] If #link is missing from uri format of -cacheArchive then streaming does not throw error.[MAPREDUCE-572] Change KeyValueLineRecordReader and KeyValueTextInputFormat to use new api.[MAPREDUCE-655] Existing diagnostic rules fail for MAP ONLY jobs[MAPREDUCE-676] XML-based metrics as JSP servlet for JobTracker[MAPREDUCE-679] Reuse of Writable objects is improperly handled by MRUnit[MAPREDUCE-680] Reserved tasktrackers should be removed when a node is globally blacklisted[MAPREDUCE-682] Conf files not moved to "done" subdirectory after JT restart[MAPREDUCE-693] Per-pool task limits for the fair scheduler[MAPREDUCE-698] Support for FIFO pools in the fair scheduler[MAPREDUCE-706] Provide a jobconf property for explicitly assigning a job to a pool[MAPREDUCE-707] node health check script does not display the correct message on timeout[MAPREDUCE-709] JobConf.findContainingJar unescapes unnecessarily on Linux[MAPREDUCE-714] org.apache.hadoop.mapred.lib.db.DBInputformat not working with oracle[MAPREDUCE-716] More slots are getting reserved for HiRAM job tasks then required[MAPREDUCE-722] node health check script should not log "UNHEALTHY" status for every heartbeat in INFO mode[MAPREDUCE-732] java.util.ConcurrentModificationException observed in unreserving slots for HiRam Jobs[MAPREDUCE-734] Allow relative paths to be created inside archives.[MAPREDUCE-739] Provide summary information per job once a job is finished.[MAPREDUCE-740] Support in DistributedCache to share cache files with other users after HADOOP-4493[MAPREDUCE-744] NPE in expiry thread when a TT is lost[MAPREDUCE-754] TypedBytesInput's readRaw() does not preserve custom type codes[MAPREDUCE-764] Configuration information should generate dump in a standard format.[MAPREDUCE-768] Setup and cleanup tasks remain in UNASSIGNED state for a long time on tasktrackers with long running high RAM tasks[MAPREDUCE-771] Use PureJavaCrc32 in mapreduce spills[MAPREDUCE-782] -files, -archives should honor user given symlink path[MAPREDUCE-787] Job summary logs show status of completed jobs as RUNNING[MAPREDUCE-809] Move completed Job history files to HDFS[MAPREDUCE-814] Add a cache for retired jobs with minimal job info and provide a way to access history file url[MAPREDUCE-817] JobClient completion poll interval of 5s causes slow tests in local mode[MAPREDUCE-825] DBInputFormat leaves open transaction[MAPREDUCE-840]







































































Per-job local data on the TaskTracker node should have right access-control[MAPREDUCE-842] Localized files from DistributedCache should have right access-control[MAPREDUCE-856] Job/Task local files have incorrect group ownership set by LinuxTaskController binary[MAPREDUCE-871] Make DBRecordReader execute queries lazily[MAPREDUCE-875] More efficient SQL queries for DBInputFormat[MAPREDUCE-885] After HADOOP-4491, the user who started mapred system is not able to run job.[MAPREDUCE-890] Users can set non-writable permissions on temporary files for TT and can abuse disk usage.[MAPREDUCE-896] When using LinuxTaskController, localized files may become accessible to unintended users if permissions are[MAPREDUCE-899]

misconfigured. Cleanup of task-logs should happen in TaskTracker instead of the Child[MAPREDUCE-927] OutputCommitter should have an abortJob method[MAPREDUCE-947] Inaccurate values in jobSummary logs[MAPREDUCE-964] TaskTracker does not need to fully unjar job jars[MAPREDUCE-967] NPE in distcp encountered when placing _logs directory on S3FileSystem[MAPREDUCE-968] distcp does not always remove distcp.tmp.dir[MAPREDUCE-971] Cleanup tasks are scheduled using high memory configuration, leaving tasks in unassigned state.[MAPREDUCE-1028] Reduce tasks are getting starved in capacity scheduler[MAPREDUCE-1030] Show total slot usage in cluster summary on jobtracker webui[MAPREDUCE-1048] distcp can generate uneven map task assignments[MAPREDUCE-1059] Use the user-to-groups mapping service in the JobTracker[MAPREDUCE-1083] For tasks, "ulimit -v -1" is being run when user doesn't specify mapred.child.ulimit[MAPREDUCE-1085] hadoop commands in streaming tasks are trying to write to tasktracker's log[MAPREDUCE-1086] JobHistory files should have narrower 0600 perms[MAPREDUCE-1088] Fair Scheduler preemption triggers NPE when tasks are scheduled but not running[MAPREDUCE-1089] Modify log statement in Tasktracker log related to memory monitoring to include attempt id.[MAPREDUCE-1090] Incorrect synchronization in DistributedCache causes TaskTrackers to freeze up during localization of Cache for tasks.[MAPREDUCE-1098] User's task-logs filling up local disks on the TaskTrackers[MAPREDUCE-1100] Additional JobTracker metrics[MAPREDUCE-1103] CapacityScheduler: It should be possible to set queue hard-limit beyond it's actual capacity[MAPREDUCE-1105] Capacity Scheduler scheduling information is hard to read / should be tabular format[MAPREDUCE-1118] Using profilers other than hprof can cause JobClient to report job failure[MAPREDUCE-1131] Per cache-file refcount can become negative when tasks release distributed-cache files[MAPREDUCE-1140] runningMapTasks counter is not properly decremented in case of failed Tasks.[MAPREDUCE-1143] Streaming tests swallow exceptions[MAPREDUCE-1155] running_maps is not decremented when the tasks of a job is killed/failed[MAPREDUCE-1158] Two log statements at INFO level fill up jobtracker logs[MAPREDUCE-1160] Lots of fetch failures[MAPREDUCE-1171] MultipleInputs fails with ClassCastException[MAPREDUCE-1178] URL to JT webconsole for running job and job history should be the same[MAPREDUCE-1185] While localizing a DistributedCache file, TT sets permissions recursively on the whole base-dir[MAPREDUCE-1186] MAPREDUCE-947 incompatibly changed FileOutputCommitter[MAPREDUCE-1196] Alternatively schedule different types of tasks in fair share scheduler[MAPREDUCE-1198] TaskTrackers restart is very slow because it deletes distributed cache directory synchronously[MAPREDUCE-1213] JobTracker Metrics causes undue load on JobTracker[MAPREDUCE-1219] Kill tasks on a node if the free physical memory on that machine falls below a configured threshold[MAPREDUCE-1221] Distcp is very slow[MAPREDUCE-1231] Refactor job token to use a common token interface[MAPREDUCE-1250] Fair scheduler event log not logging job info[MAPREDUCE-1258] DistCp cannot handle -delete if destination is local filesystem[MAPREDUCE-1285] DistributedCache localizes only once per cache URI[MAPREDUCE-1288] AutoInputFormat doesn't work with non-default FileSystems[MAPREDUCE-1293] TrackerDistributedCacheManager can delete file asynchronously[MAPREDUCE-1302] Add counters for task time spent in GC[MAPREDUCE-1304] Introduce the concept of Job Permissions[MAPREDUCE-1307] NPE in FieldFormatter if escape character is set and field is null[MAPREDUCE-1313] JobTracker holds stale references to retired jobs via unreported tasks[MAPREDUCE-1316] Potential JT deadlock in faulty TT tracking[MAPREDUCE-1342] Incremental enhancements to the JobTracker for better scalability[MAPREDUCE-1354] ConcurrentModificationException in JobInProgress[MAPREDUCE-1372] Args in job details links on jobhistory.jsp are not URL encoded[MAPREDUCE-1378] MRAsyncDiscService should tolerate missing local.dir[MAPREDUCE-1382] NullPointerException observed during task failures[MAPREDUCE-1397] TaskLauncher remains stuck on tasks waiting for free nodes even if task is killed.[MAPREDUCE-1398] The archive command shows a null error message[MAPREDUCE-1399] Save file-sizes of each of the artifacts in DistributedCache in the JobConf[MAPREDUCE-1403] LinuxTaskController tests failing on trunk after the commit of MAPREDUCE-1385[MAPREDUCE-1421] Changing permissions of files/dirs under job-work-dir may be needed sothat cleaning up of job-dir in all[MAPREDUCE-1422]

mapred-local-directories succeeds always Improve performance of CombineFileInputFormat when multiple pools are configured[MAPREDUCE-1423] archive throws OutOfMemoryError[MAPREDUCE-1425] symlinks in cwd of the task are not handled properly after MAPREDUCE-896[MAPREDUCE-1435] Deadlock in preemption code in fair scheduler[MAPREDUCE-1436] MapReduce should use the short form of the user names[MAPREDUCE-1440] Configuration of directory lists should trim whitespace[MAPREDUCE-1441] StackOverflowError when JobHistory parses a really long line[MAPREDUCE-1442]












































































DBInputFormat can leak connections[MAPREDUCE-1443] The servlets should quote server generated strings sent in the response[MAPREDUCE-1454] Authorization for servlets[MAPREDUCE-1455] For secure job execution, couple of more UserGroupInformation.doAs needs to be added[MAPREDUCE-1457] In JobTokenIdentifier change method getUsername to getUser which returns UGI[MAPREDUCE-1464] FileInputFormat should save #input-files in JobConf[MAPREDUCE-1466] committer.needsTaskCommit should not be called for a task cleanup attempt[MAPREDUCE-1476] CombineFileRecordReader does not properly initialize child RecordReader[MAPREDUCE-1480] Authorization for job-history pages[MAPREDUCE-1493] Push HADOOP-6551 into MapReduce[MAPREDUCE-1503] Cluster class should create the rpc client only when needed[MAPREDUCE-1505] Protection against incorrectly configured reduces[MAPREDUCE-1521] FileInputFormat may change the file system of an input path[MAPREDUCE-1522] Cache the job related information while submitting the job , this would avoid many RPC calls to JobTracker.[MAPREDUCE-1526] Reduce or remove usage of String.format() usage in CapacityTaskScheduler.updateQSIObjects and[MAPREDUCE-1533]

Counters.makeEscapedString() TrackerDistributedCacheManager can fail because the number of subdirectories reaches system limit[MAPREDUCE-1538] Log messages of JobACLsManager should use security logging of HADOOP-6586[MAPREDUCE-1543] Add 'first-task-launched' to job-summary[MAPREDUCE-1545] UGI.doAs should not be used for getting the history file of jobs[MAPREDUCE-1550] Task diagnostic info would get missed sometimes.[MAPREDUCE-1563] Shuffle stage - Key and Group Comparators[MAPREDUCE-1570] Task controller may not set permissions for a task cleanup attempt's log directory[MAPREDUCE-1607] TaskTracker.localizeJob should not set permissions on job log directory recursively[MAPREDUCE-1609] Refresh nodes and refresh queues doesnt work with service authorization enabled[MAPREDUCE-1611] job conf file is not accessible from job history web page[MAPREDUCE-1612] Streaming's TextOutputReader.getLastOutput throws NPE if it has never read any output[MAPREDUCE-1621] ResourceEstimator does not work after MAPREDUCE-842[MAPREDUCE-1635] Job submission should fail if same uri is added for mapred.cache.files and mapred.cache.archives[MAPREDUCE-1641] JobStory should provide queue info.[MAPREDUCE-1656] After task logs directory is deleted, tasklog servlet displays wrong error message about job ACLs[MAPREDUCE-1657] Job Acls affect Queue Acls[MAPREDUCE-1664] Add a metrics to track the number of heartbeats processed[MAPREDUCE-1680] Tasks should not be scheduled after tip is killed/failed.[MAPREDUCE-1682] Remove JNI calls from ClusterStatus cstr[MAPREDUCE-1683] JobHistory shouldn't be disabled for any reason[MAPREDUCE-1699] TaskRunner can get NPE in getting ugi from TaskTracker[MAPREDUCE-1707] Truncate logs of finished tasks to prevent node thrash due to excessive logging[MAPREDUCE-1716] Authentication between pipes processes and java counterparts.[MAPREDUCE-1733] Un-deprecate the old MapReduce API in the 0.20 branch[MAPREDUCE-1734] DistributedCache creates its own FileSytem instance when adding a file/archive to the path[MAPREDUCE-1744] Replace mapred.persmissions.supergroup with an acl : mapreduce.cluster.administrators[MAPREDUCE-1754] Exception message for unauthorized user doing killJob, killTask, setJobPriority needs to be improved[MAPREDUCE-1759] CompletedJobStatusStore initialization should fail if mapred.job.tracker.persist.jobstatus.dir is unwritable[MAPREDUCE-1778] IFile should check for null compressor[MAPREDUCE-1784] Add streaming config option for not emitting the key[MAPREDUCE-1785] Support for file sizes less than 1MB in DFSIO benchmark.[MAPREDUCE-1832] FairScheduler.tasksToPeempt() can return negative number[MAPREDUCE-1845] Include job submit host information (name and ip) in jobconf and jobdetails display[MAPREDUCE-1850] MultipleOutputs does not cache TaskAttemptContext[MAPREDUCE-1853] Add read timeout on userlog pull[MAPREDUCE-1868] Re-think (user|queue) limits on (tasks|jobs) in the CapacityScheduler[MAPREDUCE-1872] MRAsyncDiskService does not properly absolutize volume root paths[MAPREDUCE-1887] MapReduce daemons should close FileSystems that are not needed anymore[MAPREDUCE-1900] TrackerDistributedCacheManager never cleans its input directories[MAPREDUCE-1914] Ability for having user's classes take precedence over the system classes for tasks' classpath[MAPREDUCE-1938] Limit the size of jobconf.[MAPREDUCE-1960] ConcurrentModificationException when shutting down Gridmix[MAPREDUCE-1961] java.lang.ArrayIndexOutOfBoundsException in analysejobhistory.jsp of jobs with 0 maps[MAPREDUCE-1985] TestDFSIO read test may not read specified bytes.[MAPREDUCE-2023] Race condition in writing the jobtoken password file when launching pipes jobs[MAPREDUCE-2082] Secure local filesystem IO from symlink vulnerabilities[MAPREDUCE-2096] task-controller shouldn't require o-r permissions[MAPREDUCE-2103] safely handle InterruptedException and interrupted status in MR code[MAPREDUCE-2157] Race condition in LinuxTaskController permissions handling[MAPREDUCE-2178] JT should not try to remove mapred.system.dir during startup[MAPREDUCE-2219] If Localizer can't create task log directory, it should fail on the spot[MAPREDUCE-2234] JobTracker "over-synchronization" makes it hang up in certain cases[MAPREDUCE-2235] LinuxTaskController doesn't properly escape environment variables[MAPREDUCE-2242] Servlets should specify content type[MAPREDUCE-2253] FairScheduler fairshare preemption from multiple pools may preempt all tasks from one pool causing that pool to go below[MAPREDUCE-2256]

fairshare. Permissions race can make getStagingDir fail on local filesystem[MAPREDUCE-2289] TT should fail to start on secure cluster when SecureIO isn't available[MAPREDUCE-2321] Add metrics to the fair scheduler[MAPREDUCE-2323]












































































memory-related configurations missing from mapred-default.xml[MAPREDUCE-2328] Improve error messages when MR dirs on local FS have bad ownership[MAPREDUCE-2332] mapred.job.tracker.history.completed.location should support an arbitrary filesystem URI[MAPREDUCE-2351] Make the MR changes to reflect the API changes in SecureIO library[MAPREDUCE-2353] A task succeeded even though there were errors on all attempts.[MAPREDUCE-2356] Shouldn't hold lock on rjob while localizing resources.[MAPREDUCE-2364] TaskTracker can't retrieve stdout and stderr from web UI[MAPREDUCE-2366] TaskLogsTruncater does not need to check log ownership when running as Child[MAPREDUCE-2371] TaskLogAppender mechanism shouldn't be set in log4j.properties[MAPREDUCE-2372] When tasks exit with a nonzero exit status, task runner should log the stderr as well as stdout[MAPREDUCE-2373] Should not use PrintWriter to write taskjvm.sh[MAPREDUCE-2374] task-controller fails to parse configuration if it doesn't end in \n[MAPREDUCE-2377] Distributed cache sizing configurations are missing from mapred-default.xml[MAPREDUCE-2379]
















Beta Release Notes

General Information

New in This Release

Services Down Alarm Removed

The Services Down Alarm (NODE_ALARM_MISC_DOWN) has been removed.

Hoststats Service Down Alarm Added

The Hoststats Service Down Alarm (NODE_ALARM_SERVICE_HOSTSTATS_DOWN) has been added. This alarm indicates that the Hoststatsservice on the indicated node is not running.

Installation Directory Full Alarm Added

The Installation Directory Full Alarm (NODE_ALARM_OPT_MAPR_FULL) has been added. This alarm indicates that the directory on/opt/maprthe indicated node is approaching capacity.

Root Partition Full Alarm Added

The Root Partition Full Alarm (NODE_ALARM_ROOT_PARTITION_FULL) has been added. This alarm indicates that the directory on the/indicated node is approaching capacity.

Cores Present Alarm Added

The Cores Present Alarm (NODE_ALARM_CORE_PRESENT) has been added. This alarm indicates that the a service on the indicated node hascrashed, leaving a core dump file.

Global fsck scan

Global fsck automatically scans the entire MapR cluster for errors. If an error is found, contact MapR Support for assistance.

Volume Mirrors

A volume mirror is a full read-only copy of a volume that can be synced on a schedule to provide point-in-time recovery for critical data, or forhigher-performance read concurrency. Creating a mirror requires the permission. See .mir Managing Volumes

Resolved Issues

(Issue 3724) Default Settings Must Be Changed(Issue 3620) Can't Run MapReduce Jobs as Non-Root User(Issue 2434) Mirroring Disabled in Alpha(Issue 2282) fsck Not Present in Alpha

Known Issues

Removing Nodes

The MapR Beta release may experience problems when nodes are removed from the cluster. The problems are likely to be seen asinconsistencies in the GUI and can be corrected by stopping and restarting the CLDB process. This behavior will be corrected in the GA release.

(Issue 4068) Upgrading Red Hat

When upgrading MapR packages on nodes that run Red Hat, you should only upgrade packages if they appear on the following list:

mapr-coremapr-flume-internalmapr-hbase-internalmapr-hive-internalmapr-oozie-internalmapr-pig-internalmapr-sqoop-internal



1. 2. 3. 4.

5. 6.

mapr-zk-internal

Other installed packages should not be upgraded. If you accidentally upgrade other packages, you can restore the node to proper operation byforcing a reinstall of the latest versions of the packages using the following steps:

Log in as (or use for the following steps).root sudoStop the warden: /etc/init.d/mapr-warden stopIf Zookeeper is installed and running, stop it: /etc/init.d/mapr-zookeeper stopForce reinstall of the packages by running with a list of packages to be installed. Example: yum reinstall yum reinstallmapr-core mapr-zk-internalIf ZooKeeper is installed on the node, start it: /etc/init.d/mapr-zookeeper startStart the warden: /etc/init.d/mapr-warden start



Use the MapR Control System, the API, or the command-line interface to start the services individuallyRestart the warden to stop and start all services on the nodeIf you start the services individually, the node's memory will not be reconfigured to account for the newly installed services. This cancause memory paging, slowing or stopping the node. However, stopping and restarting the warden can take the node out of service.






(Issue 3965) Volume Dump Restore Failure

The command can fail with error 22 ( ) if nodes containing the volume dump are restarted during the restorevolume dump restore EINVALoperation. To fix the problem, run the command again after the nodes have restarted.

(Issue 3984) HBase Upgrade

If you are using HBase and upgrading during the MapR beta, please contact MapR Support for assistance.

(Issue 3890) Sqoop Requires HBase

The Sqoop package requires HBase, but the package dependency is not set. If you install Sqoop, you must also explicitly install HBase.

(Issue 3817) Increasing File Handle Limits Requires Restarting PAM Session Management

If you're upgrading from the Apache distribution of Hadoop on Ubuntu 10.x, it is not sufficient to modify to/etc/security/limits.confincrease the file handle limits for all the new users. You must also modify your PAM configuration, by adding the following line to /etc/pam.d/c

and then restarting the services:ommon-session

session required pam_limits.so

(Issue 3560) Intermittent Scheduled Mirror Failure

Under certain conditions, a scheduled mirror ends prematurely. To work around the issue, re-start mirroring manually. This issue will be correctedin a post-beta code release.







1.

2.

3. 4.





If a source or mirror volume is repaired with then the source and mirror volumes can go out of sync. It is necessary to perform a full mirrorfsckoperation with to bring them back in sync. If a mirror operation is not feasible (due to bandwithvolume mirror start -full trueconstraints, for example), then you should restore the mirror volume from a full dump file. When creating a dump file from a volume that has beenrepaired with , use the command without specifying to create a full volume dump.fsck volume dump create -s




(Issue 2949) NFS Mounting Issue on Ubuntu

When mounting a cluster via NFS, you must include the option, which specifies NFS protocol version 3.vers=3If no version is specified, NFS uses the highest version supported by the kernel and command, which is most cases is version 4. Versionmount4 is not yet supported by MapR-FS NFS.

(Issue 2815) File Cleanup is Slow

After a MapReduce job is completed, cleanup of files and directories associated with the tasks can take a long time and tie up the TaskTrackernode. If this happens on multiple nodes, it can cause a temporary cluster outage. If this happens, check the JobTracker View and make sure allTaskTrackers are back online before submitting additional jobs.

(Issue 2809) NFS Dependencies

If you are installing the MapR NFS service on a node that cannot connect to the standard apt-get or yum repositories, you should install thefollowing packages by hand:

CentOS:iputilsportmapglibc-common-2.5-49.el5_5.7

Red Hat:rpcbindiputils

Ubuntu:nfs-commoniputils-arping

(Issue 2495) NTP Requirement



http://www.ntp.org/

http://www.ntp.org/



Alpha Release Notes

New in This Release

As this is the first release, there are no added or changed features.

Resolved Issues

As this is the first release, there are no issues resolved or carried over from a previous release.

Known Issues

(Issue 2495) NTP Requirement



(Issue 2434) Mirroring Disabled in Alpha

Volume Mirroring is intentionally disabled in the MapR Alpha Release. User interface elements and API commands related to mirroring arenon-functional.

(Issue 2282) fsck Not Present in Alpha

MapR cluster fsck is not present in the Alpha release.

http://www.ntp.org/

http://www.ntp.org/



1.

2.

MapR Control System

The MapR Control System main screen consists of a navigation pane to the left and a view to the right. Dialogs appear over the main screen toperform certain actions.

To log on to the MapR Control System

In a browser, navigate to the node that is running the service:mapr-webserver

https://<hostname>:8443

When prompted, enter the username and password of the administrative user.

The Dashboard

The Navigation pane to the left lets you choose which to display on the right.view

The main view groups are:

Cluster - information about the nodes in the clusterMapR-FS - information about volumes, snapshots and schedulesNFS HA - NFS nodes and virtual IP addressesAlarms - node and volume alarmsSystem Settings - configuration of alarm notifications, quotas, users, groups, SMTP, and HTTP

Some other views are separate from the main navigation tree:

CLDB View - information about the container location databaseHBase View - information about HBase on the clusterJobTracker View - information about the JobTrackerNagios - generates a Nagios scriptTerminal View - an ssh terminal for logging in to the cluster

Views

Views display information about the system. As you open views, tabs along the top let you switch between them quickly.

Clicking any column name in a view sorts the data in ascending or descending order by that column.

Most views contain the following controls:

a that lets you sort data in the view, so you can quickly find the information you wantFilter toolbar

an info symbol ( ) that you can click for help

Some views contain collapsible panes that provide different types of detailed information. Each collapsible a control at the top left that expands



and collapses the pane. The control changes to show the state of the pane:

- pane is collapsed; click to expand

- pane is expanded; click to collapse

Views that contain many results provide the following controls:

( ) - navigates to the first screenful of resultsFirst

( ) - navigates to the previous screenful of resultsPrevious

( ) - navigates to the next screenful of resultsNext

( ) - navigates to the last screenful of resultsLast

( ) - refreshes the list of resultsRefresh

The Filter Toolbar

The Filter toolbar lets you build search expressions to provide sophisticated filtering capabilities for locating specific data on views that display alarge number of nodes. Expressions are implicitly connected by the AND operator; any search results satisfy the criteria specified in allexpressions.

There are three controls in the Filter toolbar:

The close control ( ) removes the expression.The button adds a new expression.AddThe button displays brief help about the Filter toolbar.Filter Help

Expressions

Each expression specifies a semantic statement that consists of a field, an operator, and a value.

The first dropdown menu specifies the field to match.The second dropdown menu specifies the type of match to perform:The text field specifies a value to match or exclude in the field. You can use a wildcard to substitute for any part of the string.



Cluster

The Cluster view group provides the following views:

Dashboard - a summary of information about cluster health, activity, and usageNodes - information about nodes in the clusterNode Heatmap - a summary of the health of nodes in the cluster

Dashboard

The Dashboard displays a summary of information about the cluster in five panes:

Cluster Heat Map - the alarms and health for each node, by rackAlarms - a summary of alarms for the clusterCluster Utilization - CPU, Memory, and Disk Space usageServices - the number of instances of each serviceVolumes - the number of available, under-replicated, and unavailable volumesMapReduce Jobs - the number of running and queued jobs, running tasks, and blacklisted nodes

Links in each pane provide shortcuts to more detailed information. The following sections provide information about each pane.

Cluster Heat Map

The Cluster Heat Map pane displays the health of the nodes in the cluster, by rack. Each node appears as a colored square to show its health ata glance.

The Show Legend/Hide Legend link above the heatmap shows or hides a key to the color-coded display.

The drop-down menu at the top right of the pane lets you filter the results to show the following criteria:

Health

(green): healthy; all services up, MapR-FS and all disks OK, and normal heartbeat

(orange): degraded; one or more services down, or no heartbeat for over 1 minute

(red): critical; Mapr-FS Inactive/Dead/Replicate, or no heartbeat for over 5 minutes

(gray): maintenance

(purple): upgrade in processCPU Utilization

(green): below 50%; (orange): 50% - 80%; (red): over 80%Memory Utilization

(green): below 50%; (orange): 50% - 80%; (red): over 80%Disk Space Utilization

(green): below 50%; (orange): 50% - 80%; (red): over 80% or all disks deadDisk Failure(s) - status of the NODE_ALARM_DISK_FAILURE alarm

(red): raised; (green): clearedExcessive Logging - status of the NODE_ALARM_DEBUG_LOGGING alarm

(red): raised; (green): clearedSoftware Installation & Upgrades - status of the NODE_ALARM_VERSION_MISMATCH alarm

(red): raised; (green): clearedTime Skew - status of theNODE_ALARM_ TIME_SKEW alarm

(red): raised; (green): clearedCLDB Service Down - status of the NODE_ALARM_SERVICE_CLDB_DOWN alarm



(red): raised; (green): clearedFileServer Service Down - status of the NODE_ALARM_SERVICE_FILESERVER_DOWN alarm

(red): raised; (green): clearedJobTracker Service Down - status of the NODE_ALARM_SERVICE_JT_DOWN alarm

(red): raised; (green): clearedTaskTracker Service Down - status of the NODE_ALARM_SERVICE_TT_DOWN alarm

(red): raised; (green): clearedHBase Master Service Down - status of the NODE_ALARM_SERVICE_HBMASTER_DOWN alarm

(red): raised; (green): clearedHBase Regionserver Service Down - status of the NODE_ALARM_SERVICE_HBREGION_DOWN alarm

(red): raised; (green): clearedNFS Service Down - status of the NODE_ALARM_SERVICE_NFS_DOWN alarm

(red): raised; (green): clearedWebServer Service Down - status of the NODE_ALARM_SERVICE_WEBSERVER_DOWN alarm

(red): raised; (green): clearedHoststats Service Down - status of the NODE_ALARM_SERVICE_HOSTSTATS_DOWN alarm

(red): raised; (green): clearedRoot Partition Full - status of the NODE_ALARM_ROOT_PARTITION_FULL alarm

(red): raised; (green): clearedInstallation Directory Full - status of the NODE_ALARM_OPT_MAPR_FULL alarm

(red): raised; (green): clearedCores Present - status of the NODE_ALARM_CORE_PRESENT alarm

(red): raised; (green): cleared

Clicking a rack name navigates to the view, which provides more detailed information about the nodes in the rack.Nodes

Clicking a colored square navigates to the , which provides detailed information about the node.Node Properties View

Alarms

The Alarms pane displays the following information about alarms on the system:

Alarm - a list of alarms raised on the clusterLast Raised - the most recent time each alarm state changedSummary - how many nodes or volumes have raised each alarm

Clicking any column name sorts data in ascending or descending order by that column.

Cluster Utilization

The Cluster Utilization pane displays a summary of the total usage of the following resources:

CPUMemoryDisk Space

For each resource type, the pane displays the percentage of cluster resources used, the amount used, and the total amount present in thesystem.



Services

The Services pane shows information about the services running on the cluster. For each service, the pane displays the following information:

Actv - the number of running instances of the serviceStby - the number of instances of the service that are configured and standing by to provide failover.Stop - the number of instances of the service that have been intentionally stopped.Fail - the number of instances of the service that have failed, indicated by a corresponsing Service Down alarmTotal - the total number of instances of the service configured on the cluster

Clicking a service navigates to the view.Services

Volumes

The Volumes pane displays the total number of volumes, and the number of volumes that are mounted and unmounted. For each category, theVolumes pane displays the number, percent of the total, and total size.

Clicking or navigates to the view.mounted unmounted Volumes

MapReduce Jobs

The MapReduce Jobs pane shows information about MapReduce jobs:

Running Jobs - the number of MapReduce jobs currently runningQueued Jobs - the number of MapReduce jobs queued to runRunning Tasks - the number of MapReduce tasks currently runningBlacklisted Nodes - the number of nodes that have been eliminated from the MapReduce pool

Nodes

The Nodes view displays the nodes in the cluster, by rack. The Nodes view contains two panes: the Topology pane and the Nodes pane. TheTopology pane shows the racks in the cluster. Selecting a rack displays that rack's nodes in the Nodes pane to the right. Selecting displaCluster



ys all the nodes in the cluster.


Selecting the checkboxes beside one or more nodes makes the following buttons available:

Manage Services - displays the dialog, which lets you start and stop services on the node.Manage Node ServicesRemove - displays the dialog, which lets you remove the node.Remove NodeChange Topology - displays the dialog, which lets you change the topology path for a node.Change Node Topology

Selecting the checkbox beside a single node makes the following button available:

Properties - navigates to the , which displays detailed information about a single node.Node Properties View

The dropdown menu at the top left specifies the type of information to display:

Overview - general information about each nodeServices - services running on each nodeMachine Performance - information about memory, CPU, I/O and RPC performance on each nodeDisks - information about disk usage, failed disks, and the MapR-FS heartbeat from each nodeMapReduce - information about the JobTracker heartbeat and TaskTracker slots on each nodeNFS Nodes - the IP addresses and Virtual IPs assigned to each NFS nodeAlarm Status - the status of alarms on each node

Clicking a node's Hostname navigates to the , which provides detailed information about the node.Node Properties View

Selecting the checkbox displays the , which provides additional data filtering options.Filter Filter toolbar

Overview

The Overview displays the following general information about nodes in the cluster:

Hlth - each node's health: healthy, degraded, or criticalHostname - the hostname of each nodePhys IP(s) - the IP address or addresses associated with each nodeFS HB - time since each node's last heartbeat to the CLDBJT HB - time since each node's last heartbeat to the JobTrackerPhysical Topology - the rack path to each node

Services

The Services view displays the following information about nodes in the cluster:

Hlth - eact node's health: healthy, degraded, or criticalHostname - the hostname of eact nodeServices - a list of the services running on each node



Physical Topology - each node's physical topology

Machine Performance

The Machine Performance view displays the following information about nodes in the cluster:

Hlth - each node's health: healthy, degraded, or criticalHostname - the hostname of each nodeMemory - the percentage of memory used and the total memory# CPUs - the number of CPUs present on each node% CPU Idle - the percentage of CPU usage on each nodeBytes Received - the network inputBytes Sent - the network output# RPCs - the number of RPC callsRPC In Bytes - the RPC input, in bytesRPC Out Bytes - the RPC output, in bytes# Disk Reads - the number of RPC disk reads# Disk Writes - the number of RPC disk writesDisk Read Bytes - the number of bytes read from diskDisk Write Bytes - the number of bytes written to disk# Disks - the number of disks present

Disks

The Disks view displays the following information about nodes in the cluster:

Hlth - each node's health: healthy, degraded, or criticalHostname - the hostname of each node# bad Disks - the number of failed disks on each nodeUsage - the amount of disk used and total disk capacity, in gigabytes

MapReduce

The MapReduce view displays the following information about nodes in the cluster:

Hlth - each node's health: healthy, degraded, or criticalHostname - the hostname of each nodeJT HB - the time since each node's most recent JobTracker heartbeatTT Map Slots - the number of map slots on each nodeTT Map Slots Used - the number of map slots in use on each nodeTT Reduce Slots - the number of reduce slots on each nodeTT Reduce Slots Used - the number of reduce slots in use on each node

NFS Nodes

The NFS Nodes view displays the following information about nodes in the cluster:

Hlth - each node's health: healthy, degraded, or criticalHostname - the hostname of each nodePhys IP(s) - the IP address or addresses associated with each nodeVIP(s) - the virtual IP address or addresses assigned to each node

Alarm Status

The Alarm Status view displays the following information about nodes in the cluster:

Hlth - each node's health: healthy, degraded, or criticalHostname - the hostname of each nodeVersion Alarm - whether the NODE_ALARM_VERSION_MISMATCH alarm is raisedExcess Logs Alarm - whether the NODE_ALARM_DEBUG_LOGGING alarm is raisedDisk Failure Alarm - whether the NODE_ALARM_DISK_FAILURE alarm is raisedTime Skew Alarm - whether the NODE_ALARM_TIME_SKEW alarm is raisedRoot Partition Alarm - whether the NODE_ALARM_ROOT_PARTITION_FULL alarm is raisedInstallation Directory Alarm - whether the NODE_ALARM_OPT_MAPR_FULL alarm is raisedCore Present Alarm - whether the NODE_ALARM_CORE_PRESENT alarm is raisedCLDB Alarm - whether the NODE_ALARM_SERVICE_CLDB_DOWN alarm is raisedFileServer Alarm - whether the NODE_ALARM_SERVICE_FILESERVER_DOWN alarm is raisedJobTracker Alarm - whether the NODE_ALARM_SERVICE_JT_DOWN alarm is raisedTaskTracker Alarm - whether the NODE_ALARM_SERVICE_TT_DOWN alarm is raisedHBase Master Alarm - whether the NODE_ALARM_SERVICE_HBMASTER_DOWN alarm is raisedHBase Region Alarm - whether the NODE_ALARM_SERVICE_HBREGION_DOWN alarm is raisedNFS Gateway Alarm - whether the NODE_ALARM_SERVICE_NFS_DOWN alarm is raised



WebServer Alarm - whether the NODE_ALARM_SERVICE_WEBSERVER_DOWN alarm is raised

Node Properties View

The Node Properties view displays detailed information about a single node in seven collapsible panes:

AlarmsMachine PerformanceGeneral InformationMapReduceManage Node ServicesMapR-FS and Available DisksSystem Disks

Buttons:

Remove Node - displays the dialogRemove Node

Alarms

The Alarms pane displays a list of alarms that have been raised on the system, and the following information about each alarm:

Alarm - the alarm nameLast Raised - the most recent time when the alarm was raisedSummary - a description of the alarm

Machine Performance

The Activity Since Last Heartbeat pane displays the following information about the node's performance and resource usage since it last reportedto the CLDB:

Memory Used - the amount of memory in use on the nodeDisk Used - the amount of disk space used on the nodeCPU - The number of CPUs and the percentage of CPU used on the nodeNetwork I/O - the input and output to the node per secondRPC I/O - the number of RPC calls on the node and the amount of RPC input and outputDisk I/O - the amount of data read to and written from the disk# Operations - the number of disk reads and writes



General Information

The General Information pane displays the following general information about the node:

FS HB - the amount of time since the node performed a heartbeat to the CLDBJT HB - the amount of time since the node performed a heartbeat to the JobTrackerPhysical Topology - the rack path to the node

MapReduce

The MapReduce pane displays the number of map and reduce slots used, and the total number of map and reduce slots on the node.

MapR-FS and Available Disks

The MapR-FS and Available Disks pane displays the disks on the node, and the following information about each disk:

Mnt - whether the disk is mounted or unmountedDisk - the disk nameFile System - the file system on the diskUsed -the percentage used and total size of the disk

Clicking the checkbox next to a disk lets you select the disk for addition or removal.



Buttons:

Add Disks to MapR-FS - with one or more disks selected, adds the disks to the MapR-FS storageRemove Disks from MapR-FS with one or more disks selected, removes the disks from the MapR-FS storage

System Disks

The System Disks pane displays information about disks present and mounted on the node:

Mnt - whether the disk is mountedDevice - the device name of the diskFile System - the file systemUsed - the percentage used and total capacity

Manage Node Services

The Manage Node Services pane displays the status of each service on the node:

Service - the name of each serviceState:

0 - NOT_CONFIGURED: the package for the service is not installed and/or the service is not configured ( has notconfigure.shrun)2 - RUNNING: the service is installed, has been started by the warden, and is currently executing3 - STOPPED: the service is installed and has run, but the service is currently not executingconfigure.sh

Log Path - the path where each service stores its logs



Buttons:

Start Service - starts the selected servicesStop Service - stops the selected servicesLog Settings - displays the Trace Activity dialog

You can also start and stop services in the the dialog, by clicking in the view.Manage Node Services Manage Services Nodes

Trace Activity

The Trace Activity dialog lets you set the log level of a specific service on a particular node.

The dropdown specifies the logging threshold for messages.Log Level

Buttons:

OK - save changes and exitClose - exit without saving changes

Remove Node

The Remove Node dialog lets you remove the specified node.



The Remove Node dialog contains a radio button that lets you choose how to remove the node:

Shut down all services and then remove - shut down services before removing the nodeRemove immediately (-force) - remove the node without shutting down services

Buttons:

Remove Node - removes the nodeCancel - returns to the Node Properties View without removing the node

Manage Node Services

The Manage Node Services dialog lets you start and stop services on the node.

The Service Changes section contains a dropdown menu for each service:

No change - leave the service running if it is running, or stopped if it is stoppedStart - start the serviceStop - stop the service

Buttons:

Change Node - start and stop the selected services as specified by the dropdown menusCancel - returns to the Node Properties View without starting or stopping any services



You can also start and stop services in the the pane of the view.Manage Node Services Node Properties

Change Node Topology

The Change Node Topology dialog lets you change the rack or switch path for one or more nodes.

The Change Node Topology dialog consists of two panes:

Node(s) to move shows the node or nodes specified in the Nodes view.New Path contains the following fields:

Path to Change - rack path or switch pathNew Path - the new node topology path

The Change Node Topology dialog contains the following buttons:

Move Node - changes the node topologyClose - returns to the Nodes view without changing the node topology

Node Heatmap

The Node Heatmap view displays information about each node, by rack

The dropdown menu above the heatmap lets you choose the type of information to display. See .Cluster Heat Map




MapR-FS

The MapR-FS group provides the following views:

Volumes - information about volumes in the clusterMirror Volumes - information about mirrorsUser Disk Usage - cluster disk usageSnapshots - information about volume snapshotsSchedules - information about schedules

Volumes

The Volumes view displays the following information about volumes in the cluster:

Mnt - whether the volume is mounted ( )Vol Name - the name of the volumeMount Path - the path where the volume is mountedCreator - the user or group that owns the volumeQuota - the volume quotaVol Size - the size of the volumeSnap Size - the size of the volume snapshotTotal Size - the size of the volume and all its snapshotsReplication Factor - the number of copies of the volumePhysical Topology - the rack path to the volume


The checkbox specifies whether to show unmounted volumes:Show Unmounted

selected - show both mounted and unmounted volumesunselected - show mounted volumes only

The checkbox specifies whether to show system volumes: Show System

selected - show both system and user volumesunselected - show user volumes only


Clicking displays the dialog.New Volume New Volume

Selecting one or more checkboxes next to volumes enables the following buttons:

Remove - displays the dialogRemove VolumeProperties - displays the dialog (becomes if more than one checkbox is selected)Volume Properties Edit X VolumesSnapshots - displays the dialogSnapshots for VolumeNew Snapshot - displays the dialogSnapshot Name



New Volume

The New Volume dialog lets you create a new volume.

For mirror volumes, the Replication & Snapshot Scheduling section is replaced with a section called Replication & Mirror Scheduling:

The Volume Setup section specifies basic information about the volume using the following fields:

Volume Type - a standard volume, or a local or remote mirror volumeVolume Name (required) - a name for the new volumeMount Path - a path on which to mount the volumeMounted - whether the volume is mounted at creation



Topology - the new volume's rack topologyRead-only - if checked, prevents writes to the volume

The Ownership & Permissions section lets you grant specific permissions on the volume to certain users or groups:

User/Group field - the user or group to which permissions are to be granted (one user or group per row)Permissions field - the permissions to grant to the user or group (see the Permissions table below)Delete button ( ) - deletes the current row[ + Add Permission ] - adds a new row

Volume Permissions

Code Allowed Action




d Delete a volume


The Usage Tracking section sets the accountable entity and quotas for the volume using the following fields:

Group/User - the group/user that is accountable for the volumeQuotas - the volume quotas:

Volume Advisory Quota - if selected, the advisory quota for the volume as an integer plus a single letter to represent the unitVolume Quota - if selected, the quota for the volume as an integer plus a single letter to represent the unit

The Replication & Snapshot Scheduling section (normal volumes) contains the following fields:

Replication - the desired replication factor for the volumeMinimum Replication - the minimum replication factor for the volume. When the number of replicas drops down to or below this number,the volume is aggressively re-replicated to bring it above the minimum replication factor.Snapshot Schedule - determines when snapshots will be automatically created; select an existing schedule from the pop-up menu

The Replication & Mirror Scheduling section (mirror volumes) contains the following fields:

Replication Factor - the desired replication factor for the volumeActual Replication - what percent of the volume data is replicated once (1x), twice (2x), and so on, respectivelyMirror Update Schedule - determines when mirrors will be automatically updated; select an existing schedule from the pop-up menuLast Mirror Operation - the status of the most recent mirror operation.

Buttons:

Save - creates the new volumeClose - exits without creating the volume

Remove Volume

The Remove Volume dialog prompts you for confirmation before removing the specified volume or volumes.



Buttons:

Remove Volume - removes the volume or volumesCancel - exits without removing the volume or volumes

Volume Properties

The Volume Properties dialog lets you view and edit volume properties.



For mirror volumes, the Replication & Snapshot Scheduling section is replaced with a section called Replication & Mirror Scheduling:



For information about the fields in the Volume Properties dialog, see .New Volume

Snapshots for Volume

The Snapshots for Volume dialog displays the following information about snapshots for the specified volume:

Snapshot Name - the name of the snapshotDisk Used - the disk space occupied by the snapshotCreated - the date and time the snapshot was createdExpires - the snapshot expiration date and time

Buttons:

New Snapshot - displays the dialog.Snapshot NameRemove - when the checkboxes beside one or more snapshots are selected, displays the dialogRemove SnapshotsPreserve - when the checkboxes beside one or more snapshots are selected, prevents the snapshots from expiringClose - closes the dialog

Snapshot Name

The Snapshot Name dialog lets you specify the name for a new snapshot you are creating.



The Snapshot Name dialog creates a new snapshot with the name specified in the following field:

Name For New Snapshot(s) - the new snapshot name

Buttons:

OK - creates a snapshot with the specified nameCancel - exits without creating a snapshot

Remove Snapshots

The Remove Snapshots dialog prompts you for confirmation before removing the specified snapshot or snapshots.

Buttons

Yes - removes the snapshot or snapshotsNo - exits without removing the snapshot or snapshots

Mirror Volumes

The Mirror Volumes pane displays information about mirror volumes in the cluster:

Mnt - whether the volume is mountedVol Name - the name of the volumeSrc Vol - the source volumeSrc Clu - the source clusterOrig Vol -the originating volume for the data being mirroredOrig Clu - the originating cluster for the data being mirroredLast Mirrored - the time at which mirroring was most recently completed

- status of the last mirroring operation% Done - progress of the mirroring operationError(s) - any errors that occurred during the last mirroring operation

User Disk Usage



The User Disk Usage view displays information about disk usage by cluster users:

Name - the usernameDisk Usage - the total disk space used by the user# Vols - the number of volumesHard Quota - the user's quotaAdvisory Quota - the user's advisory quotaEmail - the user's email address

Snapshots

The Snapshots view displays the following information about volume snapshots in the cluster:

Snapshot Name - the name of the snapshotVolume Name - the name of the source volume volume for the snapshotDisk Space used - the disk space occupied by the snapshotCreated - the creation date and time of the snapshotExpires - the expiration date and time of the snapshot



Buttons:

Remove Snapshot - when the checkboxes beside one or more snapshots are selected, displays the dialogRemove SnapshotsPreserve Snapshot - when the checkboxes beside one or more snapshots are selected, prevents the snapshots from expiring

Schedules

The Schedules view lets you view and edit schedules, which can then can be attached to events to create occurrences. A schedule is a namedgroup of rules that describe one or more points of time in the future at which an action can be specified to take place.



The left pane of the Schedules view lists the following information about the existing schedules:

Schedule Name - the name of the schedule; clicking a name displays the schedule details in the right pane for editing

In Use - indicates whether the schedule is ( ), or attached to an actionin use

The right pane provides the following tools for creating or editing schedules:

Schedule Name - the name of the scheduleSchedule Rules - specifies schedule rules with the following components:

A dropdown that specifies frequency (Once, Yearly, Monthly, Weekly, Daily, Hourly, Every X minutes)Dropdowns that specify the time within the selected frequencyRetain For - the time for which the scheduled snapshot or mirror data is to be retained after creation

[ +Add Rule ] - adds another rule to the schedule

Navigating away from a schedule with unsaved changes displays the dialog.Save Schedule

Buttons:

New Schedule - starts editing a new scheduleRemove Schedule - displays the dialogRemove ScheduleSave Schedule - saves changes to the current scheduleCancel - cancels changes to the current schedule

Remove Schedule

The Remove Schedule dialog prompts you for confirmation before removing the specified schedule.

Buttons

Yes - removes the scheduleNo - exits without removing the schedule



NFS HA

The NFS view group provides the following views:

NFS Setup - information about NFS nodes in the clusterVIP Assignments - information about virtual IP addresses (VIPs) in the clusterNFS Nodes - information about NFS nodes in the cluster

NFS Setup

The NFS Setup view displays information about NFS nodes in the cluster and any VIPs assigned to them:

Starting VIP - the starting IP of the VIP rangeEnding VIP - the ending IP of the VIP rangeNode Name(s) - the names of the NFS nodesIP Address(es) - the IP addresses of the NFS nodesMAC Address(es) - the MAC addresses associated with the IP addresses

Buttons:

Start NFS - displays the Manage Node Services dialogAdd VIP - displays the Add Virtual IPs dialogEdit - when one or more checkboxes are selected, edits the specified VIP rangesRemove- when one or more checkboxes are selected, removes the specified VIP rangesUnconfigured Nodes - displays nodes not running the NFS service (in the Nodes view)VIP Assignments - displays the VIP Assignments view

VIP Assignments

The VIP Assignments view displays VIP assignments beside the nodes to which they are assigned:

Virtual IP Address - each VIP in the rangeNode Name - the node to which the VIP is assignedIP Address - the IP address of the nodeMAC Address - the MAC address associated with the IP address

Buttons:

Start NFS - displays the Manage Node Services dialogAdd VIP - displays the Add Virtual IPs dialogUnconfigured Nodes - displays nodes not running the NFS service (in the Nodes view)

NFS Nodes

The NFS Nodes view displays information about nodes running the NFS service:

Hlth - the health of the nodeHostname - the hostname of the node



Phys IP(s) - physical IP addresses associated with the nodeVIP(s) - virtual IP addresses associated with the node

Buttons:

Properties - when one or more nodes are selected, navigates to the Node Properties ViewManage Services - navigates to the dialog, which lets you start and stop services on the nodeManage Node ServicesRemove - navigates to the dialog, which lets you remove the nodeRemove NodeChange Topology - navigates to the dialog, which lets you change the rack or switch path for a node Change Node Topology



Alarms

The Alarms view group provides the following views:

Node Alarms - information about node alarms in the clusterVolume Alarms - information about volume alarms in the clusterUser/Group Alarms - information about users or groups that have exceeded quotasAlarm Notifications - configure where notifications are sent when alarms are raised

Node Alarms

The Node Alarms view displays information about node alarms in the cluster.

Hlth - a color indicating the status of each node (see )Cluster Heat MapHostname - the hostname of the nodeVersion Alarm - last occurrence of the NODE_ALARM_VERSION_MISMATCH alarmExcess Logs Alarm - last occurrence of the NODE_ALARM_DEBUG_LOGGING alarmDisk Failure Alarm - of the NODE_ALARM_DISK_FAILURE alarmTime Skew Alarm - last occurrence of the NODE_ALARM_ TIME_SKEW alarmRoot Partition Alarm - last occurrence of the NODE_ALARM_ROOT_PARTITION_FULL alarmInstallation Directory Alarm - last occurrence of the NODE_ALARM_OPT_MAPR_FULL alarmCore Present Alarm - last occurrence of the NODE_ALARM_CORE_PRESENT alarmCLDB Alarm - last occurrence of the NODE_ALARM_SERVICE_CLDB_DOWN alarmFileServer Alarm - last occurrence of the NODE_ALARM_SERVICE_FILESERVER_DOWN alarmJobTracker Alarm - last occurrence of the NODE_ALARM_SERVICE_JT_DOWN alarmTaskTracker Alarm - last occurrence of the NODE_ALARM_SERVICE_TT_DOWN alarmHBase Master Alarm - last occurrence of the NODE_ALARM_SERVICE_HBMASTER_DOWN alarmHBase Regionserver Alarm - last occurrence of the NODE_ALARM_SERVICE_HBREGION_DOWN alarmNFS Gateway Alarm - last occurrence of the NODE_ALARM_SERVICE_NFS_DOWN alarmWebServer Alarm - last occurrence of the NODE_ALARM_SERVICE_WEBSERVER_DOWN alarmHoststats Alarm - last occurrence of the NODE_ALARM_SERVICE_HOSTSTATS_DOWN alarm

See .Troubleshooting Alarms


The left pane of the Node Alarms view displays the following information about the cluster:

Topology - the rack topology of the cluster


Clicking a node's Hostname navigates to the , which provides detailed information about the node.Node Properties View

Buttons:

Properties - navigates to the Node Properties ViewRemove - navigates to the dialog, which lets you remove the nodeRemove NodeManage Services - navigates to the dialog, which lets you start and stop services on the nodeManage Node Services



Change Topology - navigates to the dialog, which lets you change the rack or switch path for a nodeChange Node Topology

Volume Alarms

The Volume Alarms view displays information about volume alarms in the cluster:

Mnt - whether the volume is mountedVol Name - the name of the volumeSnapshot Alarm - last Snapshot Failed alarmMirror Alarm - last Mirror Failed alarmReplication Alarm - last Data Under-Replicated alarmData Alarm - last Data Unavailable alarmVol Advisory Quota Alarm - last Volume Advisory Quota Exceeded alarmVol Quota Alarm- last Volume Quota Exceeded alarm

Clicking any column name sorts data in ascending or descending order by that column. Clicking a volume name displays the Volume Propertiesdialog

Selecting the checkbox shows unmounted volumes as well as mounted volumes.Show Unmounted


Buttons:

New Volume displays the New Volume Dialog.Properties - if the checkboxes beside one or more volumes is selected,displays the dialogVolume PropertiesMount - if an unmounted volume is selected, mounts it; if a mounted volume is selected, unmounts it(Unmount)Remove - if the checkboxes beside one or more volumes is selected, displays the dialogRemove VolumeStart Mirroring - if a mirror volume is selected, starts the mirror sync processSnapshots - if the checkboxes beside one or more volumes is selected,displays the dialogSnapshots for VolumeNew Snapshot - if the checkboxes beside one or more volumes is selected,displays the dialogSnapshot Name

User/Group Alarms

The User/Group Alarms view displays information about user and group quota alarms in the cluster:

Name - the name of the user or groupUser Advisory Quota Alarm - the last Advisory Quota Exceeded alarmUser Quota Alarm - the last Quota Exceeded alarm



Buttons:

Edit Properties

Alarm Notifications

The Configure Global Alarm Notifications dialog lets you specify where email notifications are sent when alarms are raised.

Fields:

Alarm Name - select the alarm to configureStandard Notification - send notification to the default for the alarm type (the cluster administrator or volume creator, for example)Additional Email Address - specify an additional custom email address to receive notifications for the alarm type

Buttons:

Save - save changes and exitClose - exit without saving changes



System Settings

The System Settings view group provides the following views:

Email Addresses - specify MapR user email addressesPermissions - give permissions to usersQuota Defaults - settings for default quotas in the clusterSMTP - settings for sending email from MapRHTTP - settings for accessing the MapR Control System via a browserMapR Licenses - MapR license settings

Email Addresses

The Configure Email Addresses dialog lets you specify whether MapR gets user email addresses from an LDAP directory, or uses a companydomain:

Use Company Domain - specify a domain to append after each username to determine each user's email addressUse LDAP - obtain each user's email address from an LDAP server

Buttons:

Save - save changes and exitClose - exit without saving changes

Permissions

The Edit Permissions dialog lets you grant specific clluster permissions to particular users and groups.

User/Group field - the user or group to which permissions are to be granted (one user or group per row)Permissions field - the permissions to grant to the user or group (see the Permissions table below)Delete button ( ) - deletes the current row[ + Add Permission ] - adds a new row

Cluster Permissions



cv


cv Create volumes





Buttons:

OK - save changes and exitClose - exit without saving changes

Quota Defaults

The Configure Quota Defaults dialog lets you set the default quotas that apply to users and groups.

The User Quota Defaults section contains the following fields:

Default User Advisory Quota - if selected, sets the advisory quota that applies to all users without an explicit advisory quota.Default User Total Quota - if selected, sets the advisory quota that applies to all users without an explicit total quota.



The Group Quota Defaults section contains the following fields:

Default Group Advisory Quota - if selected, sets the advisory quota that applies to all groups without an explicit advisory quota.Default Group Total Quota - if selected, sets the advisory quota that applies to all groups without an explicit total quota.

Buttons:

Save - saves the settingsClose - exits without saving the settings

SMTP

The Configure Sending Email dialog lets you configure the email account from which the MapR cluster sends alerts and other notifications.

The Configure Sending Email (SMTP) dialog contains the following fields:

Provider - selects Gmail or another email provider; if you select Gmail, the other fields are partially populated to help you with theconfigurationSMTP Server specifies the SMTP server to use when sending email.The server requires an encrypted connection (SSL) - use SSL when connecting to the SMTP serverSMTP Port - the port to use on the SMTP serverFull Name - the name used in the From field when the cluster sends an alert emailEmail Address - the email address used in the From field when the cluster sends an alert email.Username - the username used to log onto the email account the cluster will use to send email.SMTP Password - the password to use when sending email.

Buttons:


HTTP

The Configure HTTP dialog lets you configure access to the MapR Control System via HTTP and HTTPS.



The sections in the Configure HTTP dialog let you enable HTTP and HTTPS access, and set the session timeout, respectively:

Enable HTTP Access - if selected, configure HTTP access with the following field:HTTP Port - the port on which to connect to the MapR Control System via HTTP

Enable HTTPS Access - if selected, configure HTTPS access with the following fields:HTTPS Port - the port on which to connect to the MapR Control System via HTTPSHTTPS Keystore Path - a path to the HTTPS keystoreHTTPS Keystore Password - a password to access the HTTPS keystoreHTTPS Key Password - a password to access the HTTPS key

Session Timeout - the number of seconds before an idle session times out.

Buttons:


MapR Licenses

The MapR License Management dialog lets you add and activate licenses for the cluster, and displays the Cluster ID and the following informationabout existing licenses:

Name - the name of each licenseIssued - the date each license was issuedExpires - the expiration date of each licenseNodes - the nodes to which each license applies







Fields:

Cluster ID - the unique identifier needed for licensing the cluster

Buttons:

Add Licenses via Web - navigates to the MapR licensing form onlineAdd License via Upload - alternate licensing mechanism: upload via browserAdd License via Copy/Paste - alternate licensing mechanism: paste license keyApply Licenses - validates the licenses and applies them to the clusterClose - closes the dialog.



Other Views

In addition to the MapR Control System views, there are views that display detailed information about the system:

CLDB View - information about the container location databaseHBase View - information about HBase on the clusterJobTracker View - information about the JobTrackerNagios - generates a Nagios scriptTerminal View - an ssh terminal for logging in to the cluster

With the exception of the MapR Launchpad, the above views include the following buttons:

- Refresh Button (refreshes the view)

- Popout Button (opens the view in a new browser window)



Hadoop Commands

All Hadoop commands are invoked by the script.bin/hadoop

Usage: hadoop [--config confdir] [COMMAND] [GENERIC_OPTIONS] [COMMAND_OPTIONS]

Hadoop has an option parsing framework that employs parsing generic options as well as running classes.

COMMAND_OPTION Description

--config confdir Overwrites the default Configuration directory. Default is .$HADOOP_HOME/conf

COMMAND Various commands with their options are described in the following sections.

GENERIC_OPTIONS The common set of options supported by multiple commands.

COMMAND_OPTIONS Various command options are described in the following sections.

Useful InformationRunning the script without any arguments prints the description for all commands.hadoop

Commands

The following commands may be run on MapR:hadoop

Command Description

archive-archiveNameNAME <src>*<dest>

The command creates a Hadoop archive, a file that contains other files. A Hadoop archive alwayshadoop archivehas a extension.*.har

classpath The command prints the class path needed to access the Hadoop JAR and the required libraries.hadoop classpath

daemonlog The command may be used to get or set the log level of Hadoop daemons.hadoop daemonlog

distcp<source><destination>

The command is a tool for large inter- and intra-cluster copying. It uses MapReduce to effect itshadoop distcpdistribution, error handling and recovery, and reporting. It expands a list of files and directories into input to map tasks,each of which will copy a partition of the files specified in the source list.

fs The command runs a generic filesystem user client that interacts with the MapR filesystem (MapR-FS).hadoop fs

jar <jar> The command runs a JAR file. Users can bundle their MapReduce code in a JAR file and execute it usinghadoop jarthis command.

job Manipulates MapReduce jobs.

jobtracker Runs the MapReduce Jobtracker node.

mfs The command performs operations on directories in the cluster. The main purposes of are tohadoop mfs hadoop mfsdisplay directory information and contents, to create symbolic links, and to set compression and chunk size on adirectory.

mradmin Runs a MapReduce admin client.

pipes Runs a pipes job.

queue Gets information about job queues.

tasktracker The command runs a MapReduce tasktracker node.hadoop tasktracker

version The command prints the Hadoop software version.hadoop version



Useful InformationMost Hadoop commands print help when invoked without parameters.

Generic Options

Implement the interface and the following generic Hadoop command-line options are available for many of the Hadoop commands.Tool

Generic options are supported by the , , , , , and Hadoop commands.distcp fs job mradmin pipes queue

Generic Option Description

-conf <filename1 filename2...>

Add the specified configuration files to the list of resources available in the configuration.

-D <property=value> Set a value for the specified Hadoop configuration property.

-fs <local|filesystem URI> Set the URI of the default filesystem.

-jt <local|jobtracker:port> Specify a jobtracker for a given host and port. This command option is a shortcut for -Dmapred.job.tracker=host:port

-files <file1,file2,...> Specify files to be copied to the map reduce cluster.

-libjars <jar1,jar2,...> Specify JAR files to be included in the classpath of the mapper and reducer tasks.

-archives<archive1,archive2,...>

Specify archive files (JAR, tar, tar.gz, ZIP) to be copied and unarchived on the task node.

CLASSNAME

hadoop script can be used to invoke any class.

Usage: hadoop CLASSNAME

Runs the class named CLASSNAME.

http://hadoop.apache.org/common/docs/current/api/org/apache/hadoop/util/Tool.html



hadoop archive

The command creates a Hadoop archive, a file that contains other files. A Hadoop archive always has a extension.hadoop archive *.har

Syntax

hadoop [ Generic Options ] archive -archiveName <name> [-p <parent>] <source> <destination>

Parameters

Parameter Description

-archiveName <name> Name of the archive to be created.

-p <parent_path> The parent argument is to specify the relative path to which the files should be archived to.

<source> Filesystem pathnames which work as usual with regular expressions.

<destination> Destination directory which would contain the archive.

Examples

Archive within a single directory

hadoop archive -archiveName myArchive.har -p /foo/bar /outputdir

The above command creates an archive of the directory in the directory ./foo/bar /outputdir

Archive to another directory

hadoop archive -archiveName myArchive.har -p /foo/bar a/b/c e/f/g

The above command creates an archive of the directory in the directory ./foo/bar/a/b/c /foo/bar/e/f/g



hadoop classpath

The command prints the class path needed to access the Hadoop jar and the required libraries.hadoop classpath

Syntax

hadoop classpath

Output

$ hadoop classpath/opt/mapr/hadoop/hadoop-0.20.2/bin/../conf:/usr/lib/jvm/java-6-sun/lib/tools.jar:/opt/mapr/hadoop/hadoop-0.20.2/bin/..:/opt/mapr/hadoop/hadoop-0.20.2/bin/../hadoop*core*.jar:/opt/mapr/hadoop/hadoop-0.20.2/bin/../lib/aspectjrt-1.6.5.jar:/opt/mapr/hadoop/hadoop-0.20.2/bin/../lib/aspectjtools-1.6.5.jar:/opt/mapr/hadoop/hadoop-0.20.2/bin/../lib/commons-cli-1.2.jar:/opt/mapr/hadoop/hadoop-0.20.2/bin/../lib/commons-codec-1.4.jar:/opt/mapr/hadoop/hadoop-0.20.2/bin/../lib/commons-daemon-1.0.1.jar:/opt/mapr/hadoop/hadoop-0.20.2/bin/../lib/commons-el-1.0.jar:/opt/mapr/hadoop/hadoop-0.20.2/bin/../lib/commons-httpclient-3.0.1.jar:/opt/mapr/hadoop/hadoop-0.20.2/bin/../lib/commons-logging-1.0.4.jar:/opt/mapr/hadoop/hadoop-0.20.2/bin/../lib/commons-logging-api-1.0.4.jar:/opt/mapr/hadoop/hadoop-0.20.2/bin/../lib/commons-net-1.4.1.jar:/opt/mapr/hadoop/hadoop-0.20.2/bin/../lib/core-3.1.1.jar:/opt/mapr/hadoop/hadoop-0.20.2/bin/../lib/eval-0.5.jar:/opt/mapr/hadoop/hadoop-0.20.2/bin/../lib/hadoop-0.20.2-dev-capacity-scheduler.jar:/opt/mapr/hadoop/hadoop-0.20.2/bin/../lib/hadoop-0.20.2-dev-core.jar:/opt/mapr/hadoop/hadoop-0.20.2/bin/../lib/hadoop-0.20.2-dev-fairscheduler.jar:/opt/mapr/hadoop/hadoop-0.20.2/bin/../lib/hsqldb-1.8.0.10.jar:/opt/mapr/hadoop/hadoop-0.20.2/bin/../lib/jackson-core-asl-1.5.2.jar:/opt/mapr/hadoop/hadoop-0.20.2/bin/../lib/jackson-mapper-asl-1.5.2.jar:/opt/mapr/hadoop/hadoop-0.20.2/bin/../lib/jasper-compiler-5.5.12.jar:/opt/mapr/hadoop/hadoop-0.20.2/bin/../lib/jasper-runtime-5.5.12.jar:/opt/mapr/hadoop/hadoop-0.20.2/bin/../lib/jets3t-0.6.1.jar:/opt/mapr/hadoop/hadoop-0.20.2/bin/../lib/jetty-6.1.14.jar:/opt/mapr/hadoop/hadoop-0.20.2/bin/../lib/jetty-servlet-tester-6.1.14.jar:/opt/mapr/hadoop/hadoop-0.20.2/bin/../lib/jetty-util-6.1.14.jar:/opt/mapr/hadoop/hadoop-0.20.2/bin/../lib/junit-4.5.jar:/opt/mapr/hadoop/hadoop-0.20.2/bin/../lib/kfs-0.2.2.jar:/opt/mapr/hadoop/hadoop-0.20.2/bin/../lib/log4j-1.2.15.jar:/opt/mapr/hadoop/hadoop-0.20.2/bin/../lib/logging-0.1.jar:/opt/mapr/hadoop/hadoop-0.20.2/bin/../lib/maprfs-0.1.jar:/opt/mapr/hadoop/hadoop-0.20.2/bin/../lib/maprfs-test-0.1.jar:/opt/mapr/hadoop/hadoop-0.20.2/bin/../lib/mockito-all-1.8.2.jar:/opt/mapr/hadoop/hadoop-0.20.2/bin/../lib/mysql-connector-java-5.0.8-bin.jar:/opt/mapr/hadoop/hadoop-0.20.2/bin/../lib/oro-2.0.8.jar:/opt/mapr/hadoop/hadoop-0.20.2/bin/../lib/servlet-api-2.5-6.1.14.jar:/opt/mapr/hadoop/hadoop-0.20.2/bin/../lib/slf4j-api-1.4.3.jar:/opt/mapr/hadoop/hadoop-0.20.2/bin/../lib/slf4j-log4j12-1.4.3.jar:/opt/mapr/hadoop/hadoop-0.20.2/bin/../lib/xmlenc-0.52.jar:/opt/mapr/hadoop/hadoop-0.20.2/bin/../lib/zookeeper-3.3.2.jar:/opt/mapr/hadoop/hadoop-0.20.2/bin/../lib/jsp-2.1/jsp-2.1.jar:/opt/mapr/hadoop/hadoop-0.20.2/bin/../lib/jsp-2.1/jsp-api-2.1.jar



hadoop daemonlog

The command gets and sets the log level for each daemon.hadoop daemonlog

Hadoop daemons all produce logfiles that you can use to learn about what is happening on the system. You can use the cohadoop daemonlogmmand to temporarily change the log level of a component when debugging the system.

Syntax

hadoop daemonlog -getlevel | -setlevel <host>:<port> <name> [ <level> ]

Parameters

The following command options are supported for command:hadoop daemonlog


-getlevel <host:port><name> Prints the log level of the daemon running at the specified host and port, by querying

http://<host>:<port>/logLevel?log=<name>

<host>: The host on which to get the log level.<port>: The port by which to get the log level.<name>: The daemon on which to get the log level. Usually the fully qualified classname of thedaemon doing the logging.For example, for the JobTracker daemon.org.apache.hadoop.mapred.JobTracker

-setlevel <host:port> <name><level>

Sets the log level of the daemon running at the specified host and port, by querying

http://<host>:<port>/logLevel?log=<name>

* : The host on which to set the log level.<host>

<port>: The port by which to set the log level.<name>: The daemon on which to set the log level.<level: The log level to set the daemon.

Examples

Getting the log levels of a daemon

To get the log level for each daemon enter a command such as the following:

hadoop daemonlog -getlevel 10.250.1.15:50030 org.apache.hadoop.mapred.JobTracker Connecting to http://10.250.1.15:50030/logLevel?log=org.apache.hadoop.mapred.JobTrackerSubmitted Log Name: org.apache.hadoop.mapred.JobTrackerLog : org.apache.commons.logging.impl.Log4JLoggerClassEffective level: ALL

Setting the log level of a daemon

To temporarily set the log level for a daemon enter a command such as the following:



hadoop daemonlog -setlevel 10.250.1.15:50030 org.apache.hadoop.mapred.JobTracker DEBUGConnecting to http://10.250.1.15:50030/logLevel?log=org.apache.hadoop.mapred.JobTracker&level=DEBUGSubmitted Log Name: org.apache.hadoop.mapred.JobTrackerLog : org.apache.commons.logging.impl.Log4JLoggerClassSubmitted Level: DEBUGSetting Level to DEBUG ...Effective level: DEBUG

Using this method, the log level is automatically reset when the daemon is restarted.

To make the change to log level of a daemon persistent, enter a command such as the following:

hadoop daemonlog -setlevel 10.250.1.15:50030 log4j.logger.org.apache.hadoop.mapred.JobTracker DEBUG



hadoop distcp

The command is a tool used for large inter- and intra-cluster copying. It uses MapReduce to effect its distribution, error handlinghadoop distcpand recovery, and reporting. It expands a list of files and directories into input to map tasks, each of which will copy a partition of the files specifiedin the source list.

Syntax

hadoop [ Generic Options ] distcp <source> <destination> [-p [rbugp] ] [-i ] [-log ] [-m ] [-overwrite ] [-update ] [-f <urilist_uri> ] [-filelimit <n> ] [-sizelimit <n> ] [-delete ]

Parameters

Command Options

The following command options are supported for the command:hadoop distcp


<source> Specify the source URL.

<destination> Specify the destination URL.

-p [rbugp] Preserve : replication number r : block size b : user u : group g : permission p

alone is equivalent to . -p -prbugp

Modification times are not preserved. Also, when is specified, status updates will be synchronized unless the-update notfile sizes also differ (i.e. unless the file is re-created).

-i Ignore failures. As explained in the below, this option will keep more accurate statistics about the copy than the defaultcase. It also preserves logs from failed copies, which can be valuable for debugging. Finally, a failing map will not cause thejob to fail before all splits are attempted.

-log<logdir>

Write logs to . The command keeps logs of each file it attempts to copy as map output. If a<logdir> hadoop distcpmap fails, the log output will not be retained if it is re-executed.

-m <num_maps> Maximum number of simultaneous copies. Specify the number of maps to copy data. Note that more maps may notnecessarily improve throughput. See .Map Sizing

-overwrite Overwrite destination. If a map fails and is not specified, all the files in the split, not only those that failed, will be-irecopied. As discussed in the , it also changes the semantics for generating destinationOverwriting Files Between Clusterspaths, so users should use this carefully.

-update Overwrite if size is different from size. As noted in the preceding, this is not a "sync"<source> <destination>operation. The only criterion examined is the source and destination file sizes; if they differ, the source file replaces thedestination file. See Updating Files Between Clusters

-f<urilist_uri>

Use list at <urilist_uri> as source list. This is equivalent to listing each source on the command line. The urilist_uri listshould be a fully qualified URI.

-filelimit<n>

Limit the total number of files to be <= n. See .Symbolic Representations



-sizelimit<n>

Limit the total size to be <= n bytes. See .Symbolic Representations

-delete Delete the files existing in the but not in The deletion is done by FS Shell. So the trash will be<destination> <source>used, if it is enable.

Generic Options

The following generic options are supported for the command: , , hadoop distcp -conf <configuration file> -D <property=value> -, , , fs <local|file system URI> -jt <local|jobtracker:port> -files <file1,file2,file3,...> -libjars

, and . For more information on generic options,<libjar1,libjar2,libjar3,...> -archives <archive1,archive2,archive3,...>see .Generic Options

Symbolic Representations

The parameter in and can be specified with symbolic representation. For example,<n> -filelimit -sizelimit

1230k = 1230 * 1024 = 1259520891g = 891 * 1024^3 = 956703965184

Map Sizing

The command makes a faint attempt to size each map comparably so that each copies roughly the same number of bytes. Notehadoop distcpthat files are the finest level of granularity, so increasing the number of simultaneous copiers (i.e. maps) may not always increase the number ofsimultaneous copies nor the overall throughput.

If is not specified, will attempt to schedule work for wher-m distcp min (total_bytes / bytes.per.map, 20 * num_task_trackers)e defaults to 256MB.bytes.per.map

Tuning the number of maps to the size of the source and destination clusters, the size of the copy, and the available bandwidth is recommendedfor long-running and regularly run jobs.

Examples

Basic Inter-cluster Copying

The commmand is most often used to copy files between clusters:hadoop distcp

hadoop distcp maprfs:///mapr/cluster1/foo \maprfs:///mapr/cluster2/bar

The command in the example expands the namespace under on cluster1 into a temporary file, partitions its contents among a set of/foo/barmap tasks, and starts a copy on each TaskTracker from cluster1 to cluster2. Note that the command expects absolute paths.hadoop distcp

Only those files that do not already exist in the destination are copied over from the source directory.

Updating Files Between Clusters

Use the command to synchronize changes between clusters.hadoop distcp -update

$ hadoop distcp -update maprfs:///mapr/cluster1/foo maprfs:///mapr/cluster2/bar/foo

Files in the subtree are copied from cluster1 to cluster2 only if the size of the source file is different from that of the size of the destination/foofile. Otherwise, the files are skipped over.

Note that using the option changes distributed copy interprets the source and destination paths making it necessary to add the trailing -update / subdirectory in the second cluster.foo

Overwriting Files Between Clusters

By default, distributed copy skips files that already exist in the destination directory, but you can overwrite those files using the optio-overwriten. In this example, multiple source directories are specified:



$ hadoop distcp -overwrite maprfs:///mapr/cluster1/foo/a \maprfs:///mapr/cluster1/foo/b \maprfs:///mapr/cluster2/bar

As with using the option, using the changes the way that the source and destination paths are interpreted by distributed-update -overwritecopy: the contents of the source directories are compared to the contents of the destination directory. If a conflict existed---for example if bothsources mapped an entry to at the destination, the distributed copy would abort./bar/foo/ab

Migrating Data from HDFS to MapR-FS

The command can be used to migrate data from an HDFS cluster to a MapR-FS where the HDFS cluster uses the samehadoop distcpversion of the RPC protocol as that used by MapR. For a discussion, see .Copying Data from Apache Hadoop

$ hadoop distcp namenode1:50070/foo maprfs:///bar

You must specify the IP address and HTTP port (usually 50070) for the namenode on the HDFS cluster.



hadoop fs

The command runs a generic filesystem user client that interacts with the MapR filesystem (MapR-FS).hadoop fs

Syntax

hadoop [ Generic Options ] fs [-cat <src>] [-chgrp [-R] GROUP PATH...] [-chmod [-R] <MODE[,MODE]... | OCTALMODE> PATH...] [-chown [-R] [OWNER][:[GROUP]] PATH...] [-copyFromLocal <localsrc> ... <dst>] [-copyToLocal [-ignoreCrc] [-crc] <src> <localdst>] [-count[-q] <path>] [-cp <src> <dst>] [-df <path>] [-du <path>] [-dus <path>] [-expunge] [-get [-ignoreCrc] [-crc] <src> <localdst> [-getmerge <src> <localdst> [addnl]] [-help [cmd]] [-ls <path>] [-lsr <path>] [-mkdir <path>] [-moveFromLocal <localsrc> ... <dst>] [-moveToLocal <src> <localdst>] [-mv <src> <dst>] [-put <localsrc> ... <dst>] [-rm [-skipTrash] <src>] [-rmr [-skipTrash] <src>] [-stat [format] <path>] [-tail [-f] <path>] [-test -[ezd] <path>] [-text <path>] [-touchz <path>]

Parameters

Command Options

The following command parameters are supported for :hadoop fs


-cat <src> Fetch all files that match the file pattern defined by the <src>parameter and display their contents on .stdout

-fs [local | <file system URI>] Specify the file system to use.

If not specified, the current configuration is used, taken from the following, in increasing precedence:

inside the hadoop jar file core-default.xml in core-site.xml $HADOOP_CONF_DIR

The option means use the local file system as your DFS. local

specifies a particular file system to <file system URI>contact. This argument is optional but if used must appearappear first on the command line. Exactly one additionalargument must be specified.

-ls <path> List the contents that match the specified file pattern. Ifpath is not specified, the contents of /user/<currentUser>will be listed.

Directory entries are of the form dirName (full path) <dir>and file entries are of the form . size fileName(full path) <r n>where n is the number of replicas specified for the file and size is the size of the file, in bytes.



-lsr <path> Recursively list the contents that match the specifiedfile pattern. Behaves very similarly to ,hadoop fs -lsexcept that the data is shown for all the entries in thesubtree.

-df [<path>] Shows the capacity, free and used space of the filesystem.

If the filesystem has multiple partitions, and no path to a particular partitionis specified, then the status of the root partitions will be shown.

-du <path> Show the amount of space, in bytes, used by the files that match the specified file pattern. Equivalent to the Unixcommand in case of a directory, du -sb <path>/*and to in case of a file. du -b <path>

The output is in the form name(full path) size (in bytes).

-dus <path> Show the amount of space, in bytes, used by the files that match the specified file pattern. Equivalent to the Unixcommand . The output is in the form du -sb

size (in bytes).name(full path)

-mv <src> <dst> Move files that match the specified file pattern <src>to a destination . When moving multiple files, the <dst>destination must be a directory.

-cp <src> <dst> Copy files that match the file pattern to a <src>destination. When copying multiple files, the destinationmust be a directory.

-rm [-skipTrash] <src> Delete all files that match the specified file pattern.Equivalent to the Unix command . rm <src>The option bypasses trash, if enabled, -skipTrashand immediately deletes <src>

-rmr [-skipTrash] <src> Remove all directories which match the specified file pattern. Equivalent to the Unix command rm -rf <src>

The option bypasses trash, if enabled,-skipTrashand immediately deletes <src>

-put <localsrc> ... <dst> Copy files from the local file system into fs.

-copyFromLocal <localsrc> ... <dst> Identical to the command.-put

-moveFromLocal <localsrc> ... <dst> Same as , except that the source is-putdeleted after it's copied.

-get [-ignoreCrc] [-crc] <src> <localdst> Copy files that match the file pattern <src> to the local name. <src> is kept. When copying multiple files, the destination must be a directory.

-getmerge <src> <localdst> Get all the files in the directories that match the source file pattern and merge and sort them to onlyone file on local fs. is kept.<src>

-copyToLocal [-ignoreCrc] [-crc] <src><localdst>

Identical to the command.-get

-moveToLocal <src> <localdst> Not implemented yet

-mkdir <path> Create a directory in specified location.

-tail [-f] <file> Show the last 1KB of the file. The option shows appended data as the file grows.-f

-touchz <path> Write a timestamp in formatyyyy-MM-dd HH:mm:ssin a file at . An error is returned if the file exists with non-zero length.<path>

-test -[ezd] <path> If file exists, has zero length, is a directorythen return 0, else return 1.

-text <src> Takes a source file and outputs the file in text format.The allowed formats are zip and TextRecordInputStream.



-stat [format] <path> Print statistics about the file/directory at <path>in the specified format. Format accepts filesize in blocks (%b), filename (%n),block size (%o), replication (%r), modification date (%y, %Y)

-chmod [-R] <MODE[,MODE]... | OCTALMODE>PATH...

Changes permissions of a file. This works similar to shell's with a few exceptions. chmod

modifies the files recursively. This is the only option currently supported. -R

Mode is same as mode used for shell command.MODE chmodOnly letters recognized are . That is, rwxXt +t,a+r,g-w,+rwx,o=r

Mode specifed in 3 or 4 digits. If 4 digits, the first mayOCTALMODEbe 1 or 0 to turn the sticky bit on or off, respectively. Unlike shell command, it is not possible to specify only part of the modeE.g. 754 is same as u=rwx,g=rx,o=r

If none of 'augo' is specified, 'a' is assumed and unlikeshell command, no umask is applied.

-chown [-R] [OWNER][:[GROUP]] PATH... Changes owner and group of a file. This is similar to shell's with a few exceptions. chown

modifies the files recursively. This is the only option-Rcurrently supported.

If only owner or group is specified then only owner orgroup is modified.The owner and group names may only consists of digits, alphabet, and any of . The names are case-.@/' i.e. [-.@/a-zA-Z0-9]sensitive.

WarningWARNING: Avoid using '.' to separate user name and groupthoughLinux allows it. If user names have dots in them and you areusing local file system, you might see surprising results sinceshell command is used for local files.chown

-chgrp [-R] GROUP PATH... This is equivalent to -chown ... :GROUP ...

-count[-q] <path> Count the number of directories, files and bytes under the pathsthat match the specified file pattern. The output columns are:

orDIR_COUNT FILE_COUNT CONTENT_SIZE FILE_NAME QUOTA REMAINING_QUATA SPACE_QUOTA REMAINING_SPACE_QUOTA

DIR_COUNT FILE_COUNT CONTENT_SIZE FILE_NAME

-help [cmd] Displays help for given command or all commands if noneis specified.

Generic Options

The following generic options are supported for the command: , , hadoop fs -conf <configuration file> -D <property=value> -fs, , , <local|file system URI> -jt <local|jobtracker:port> -files <file1,file2,file3,...> -libjars




hadoop jar

The command runs a program contained in a JAR file. Users can bundle their MapReduce code in a JAR file and execute it usinghadoop jarthis command.

Syntax

hadoop jar <jar> [<arguments>

Parameters

The following commands parameters are supported for <<hadoop jar>>:


<jar> The JAR file.

<arguments> Arguments to the program specified in the JAR file.

Examples

Streaming Jobs

Hadoop streaming jobs are run using the command. The Hadoop streaming utility enables you to create and run MapReduce jobshadoop jarwith any executable or script as the mapper and/or the reducer.

$ hadoop jar $HADOOP_HOME/hadoop-streaming.jar \ -input myInputDirs \ -output myOutputDir \ -mapper org.apache.hadoop.mapred.lib.IdentityMapper \ -reducer /bin/wc

The , , , and streaming command options are all required for streaming jobs. Either an executable or a Java-input -output -mapper -reducerclass may be used for the mapper and the reducer. For more information about and examples of streaming jobs, see .Streaming examples

Word Count

The simple Word Count program is another example of a program that is run using the command. The Word Count program readshadoop jarfiles from an input directory, counts the words, and writes the results of the job to files in an output directory.

$ hadoop jar /opt/mapr/hadoop/hadoop-0.20.2/hadoop-0.20.2-dev-examples.jar wordcount /myvolume/in/myvolume/out

http://hadoop.apache.org/common/docs/current/streaming.html#More+usage+examples



hadoop job

The command enables you to manage MapReduce jobs.hadoop job

Syntax

hadoop job [Generic Options] [-submit <job-file>] [-status <job-id>] [-counter <job-id> <group-name> <counter-name>] [-kill <job-id>] [-unblacklist <job-id> <hostname>] [-set-priority <job-id> <priority>] [-events <job-id> <from-event-#> <#-of-events>] [-history <jobOutputDir>] [-list [all]] [-list-active-trackers] [-list-blacklisted-trackers] [-list-attempt-ids <job-id> <task-type> <task-state>] [-kill-task <task-id>] [-fail-task <task-id>] [-blacklist-tasktracker <hostname>]

Parameters

Command Options

The following command options are supported for :hadoop job


-submit <job-file> Submits the job.

-status <job-id> Prints the map and reduce completion percentage and all job counters.

-counter <job-id> <group-name><counter-name>

Prints the counter value.

-kill <job-id> Kills the job.

-unblacklist <job-id> <hostname> Removes a tasktracker job from the jobtracker's blacklist.

-set-priority <job-id><priority>

Changes the priority of the job. Valid priority values are , , , and VERY_HIGH HIGH, NORMAL LOW. VERY_LOW

The job scheduler uses this property to determine the order in which jobs are run.

-events <job-id> <from-event-#><#-of-events>

Prints the events' details received by jobtracker for the given range.

-history <jobOutputDir> Prints job details, failed and killed tip details.

-list [all] The option displays all jobs. The command without the option displays-list all -list allonly jobs which are yet to complete.

-list-active-trackers Prints all active tasktrackers.

-list-blackisted-trackers Prints blacklisted tasktrackers.

-list-attempt-ids<job-id><task-type>

Lists the IDs of task attempts.

-kill-task <task-id> Kills the task. Killed tasks are counted against failed attempts.not

-fail-task <task-id> Fails the task. Failed tasks are counted against failed attempts.

-blacklist-tasktracker<hostname>

Pauses all current tasktracker jobs and prevent additional jobs from being scheduled on thetasktracker.

Generic Options

The following generic options are supported for the command: , , hadoop job -conf <configuration file> -D <property=value> -fs



, , , <local|file system URI> -jt <local|jobtracker:port> -files <file1,file2,file3,...> -libjars, and . For more information on generic options,<libjar1,libjar2,libjar3,...> -archives <archive1,archive2,archive3,...>

see .Generic Options

Examples

Submitting Jobs

The command enables you to submit a job to the specified jobtracker.hadoop job -submit

$ hadoop job -jt darwin:50020 -submit job.xml

Stopping Jobs Gracefully

Use the command to stop a running or queued job.hadoop kill

$ hadoop job -kill <job-id>

Viewing Job History Logs

Run the command to view the history logs summary in specified directory.hadoop job -history

$ hadoop job -history output-dir

This command will print job details, failed and killed tip details.

Additional details about the job such as successful tasks and task attempts made for each task can be viewed by adding the option:-all

$ hadoop job -history all output-dir

Blacklisting Tasktrackers

The command when run as root or using can be used to manually blacklist tasktrackers:hadoop job sudo

hadoop job -blacklist-tasktracker <hostname>

Manually blacklisting a tasktracker pauses any running jobs and prevents additional jobs from being scheduled.For a detailed discussion see .TaskTracker Blacklisting



hadoop jobtracker

The command runs the MapReduce jobtracker node.hadoop jobtracker

Syntax

hadoop jobtracker [-dumpConfiguration]

Parameters

The command supports the following command options:hadoop jobtracker


-dumpConfiguration Dumps the configuration used by the jobtracker along with queue configuration in JSON format into standard outputused by the jobtracker and exits.



hadoop mfs

The command performs operations on directories in the cluster. The main purposes of are to display directoryhadoop mfs hadoop mfsinformation and contents, to create symbolic links, and to set compression and chunk size on a directory.

hadoop mfs [ -ln <target> <symlink> ] [ -ls <path> ] [ -lsd <path> ] [ -lsr <path> ] [ -lss <path> ] [ -setcompression on|off <dir> ] [ -setchunksize <size> ] [ -help <command> ]

Options

The normal command syntax is to specify a single option from the following table, along with its corresponding arguments. If compression andchunk size are not set explicitly for a given directory, the values are inherited from the parent directory.

Option Description

-ln Creates a symbolic link that points to the target path , similar to the standard Linux command.<symlink> <target> ln -s

-ls Lists files in the directory specified by . The command corresponds to the standard <path> hadoop mfs -ls hadoop fs command, but provides the following additional information: -ls

Blocks used for each fileServer where each block resides

-lsd Lists files in the directory specified by , and also provides information about the specified directory itself:<path>

Whether compression is enabled for the directory (indicated by zThe configured chunk size (in bytes) for the directory.

-lsr Lists files in the directory and subdirectories specified by , recursively. The command<path> hadoop mfs -lsrcorresponds to the standard command, but provides the following additional information: hadoop fs -lsr

Blocks used for each fileServer where each block resides

-lss <path> Lists files in the directory specified by , with an additional column that displays the number of disk blocks per file.<path>Disk blocks are 8192 bytes.

-setcompression Turns compression on or off on the specified directory.

-setchunksize Sets the chunk size in bytes for the specified directory. The parameter must be a multiple of 65536. <size>

-help Displays help for the command. hadoop mfs

Output

When used with , , , or , displays information about files and directories. For each file or directory -ls -lsd -lsr -lss hadoop mfs hadoop mfsdisplays a line of basic information followed by lines listing the chunks that make up the file, in the following format:

mode compression replication owner group size date chunk size name chunk fid host [host...] chunk fid host [host...] ...

Volume links are displayed as follows:

mode compression replication owner group size date chunk size name chunk target volume name writability fid -> fid [host...]

For volume links, the first is the chunk that stores the volume link itself; the after the arrow ( ) is the first chunk in the target volume.fid fid ->

The following table describes the values:



mode A text string indicating the read, write, and execute permissions for the owner, group, and other permissions. See also Mana.ging Permissions

compressionU - directory is not compressedZ - directory is compressed

replication The replication factor of the file (directories display a dash instead)

owner The owner of the file or directory

group The group of the file of directory

size The size of the file or directory

date The date the file or directory was last modified

chunk size The chunk size of the file or directory

name The name of the file or directory

chunk The chunk number. The first chunk is a primary chunk labeled " ", a 64K chunk containing the root of the file. Subsequentpchunks are numbered in order.

fid The chunk's file ID, which consists of three parts:

The ID of the container where the file is storedThe inode of the file within the containerAn internal version number

host The host on which the chunk resides. When several hosts are listed, the first host is the first copy of the chunk andsubsequent hosts are replicas.

target volumename

The name of the volume pointed to by a volume link.

writability Displays whether the volume is writable.



hadoop mradmin

The command runs Map-Reduce administrative commands.hadoop mradmin

Syntax

hadoop [ Generic Options ] mradmin [-refreshServiceAcl] [-refreshQueues] [-refreshNodes] [-refreshUserToGroupsMappings] [-refreshSuperUserGroupsConfiguration] [-help [cmd]]

Parameters

The following command parameters are supported for :hadoop mradmin


-refreshServiceAcl Reload the service-level authorization policy fileJob tracker will reload the authorization policy file.

-refreshQueues Reload the queue acls and stateJobTracker will reload the mapred-queues.xml file.

-refreshUserToGroupsMappings Refresh user-to-groups mappings.

-refreshSuperUserGroupsConfiguration Refresh superuser proxy groups mappings.

-refreshNodes Refresh the hosts information at the job tracker.

-help [cmd] Displays help for the given command or all commands if noneis specified.

The following generic options are supported for :hadoop mradmin

Generic Option Description

-conf <configuration file> Specify an application configuration file.

-D <property=value> Use value for given property.

-fs <local|file system URI> Specify a file system.

-jt <local|jobtracker:port> Specify a job tracker.

-files <comma separated list of files> Specify comma separated files to be copied to the map reduce cluster.

-libjars <comma seperated list of jars> Specify comma separated jar files to include in the classpath.

-archives <comma separated list of archives> Specify comma separated archives to be unarchived on the computer machines.



hadoop pipes

The command runs a pipes job.hadoop pipes

Hadoop Pipes is the C++ interface to Hadoop Reduce. Hadoop Pipes uses sockets to enable tasktrackers to communicate processes running theC++ map or reduce functions. See also .Compiling Pipes Programs

Syntax

hadoop [GENERIC OPTIONS ] pipes [-output <path>] [-jar <jar file>] [-inputformat <class>] [-map <class>] [-partitioner <class>] [-reduce <class>] [-writer <class>] [-program <executable>] [-reduces <num>]

Parameters

Command Options

The following command parameters are supported for :hadoop pipes


-output <path> Specify the output directory.

-jar <jar file> Specify the jar filename.

-inputformat <class> InputFormat class.

-map <class> Specify the Java Map class.

-partitioner <class> Specify the Java Partitioner.

-reduce <class> Specify the Java Reduce class.

-writer <class> Specify the Java RecordWriter.

-program <executable> Specify the URI of the executable.

-reduces <num> Specify the number of reduces.

Generic Options

The following generic options are supported for the command: , , hadoop pipes -conf <configuration file> -D <property=value> -f, , , s <local|file system URI> -jt <local|jobtracker:port> -files <file1,file2,file3,...> -libjars




hadoop queue

The command displays job queue information.hadoop queue

Syntax

hadoop [ Generic Options ] queue [-list] | [-info <job-queue-name> [-showJobs]] | [-showacls]

Parameters

Command Options

The command supports the following command options:hadoop queue


-list Gets list of job queues configured in the system. Along with scheduling information associated with the job queues.

-info<job-queue-name>[-showJobs]

Displays the job queue information and associated scheduling information of particular job queue. If o-showJobsptions is present a list of jobs submitted to the particular job queue is displayed.

-showacls Displays the queue name and associated queue operations allowed for the current user. The list consists of onlythose queues to which the user has access.

Generic Options

The following generic options are supported for the command: , , hadoop queue -conf <configuration file> -D <property=value> -f, , , s <local|file system URI> -jt <local|jobtracker:port> -files <file1,file2,file3,...> -libjars




hadoop tasktracker

The command runs a MapReduce tasktracker node.hadoop tasktracker

Syntax

hadoop tasktracker

Output

mapr@mapr-desktop:~$ hadoop tasktracker12/03/21 21:19:56 INFO mapred.TaskTracker: STARTUP_MSG:/************************************************************STARTUP_MSG: Starting TaskTrackerSTARTUP_MSG: host = mapr-desktop/127.0.1.1STARTUP_MSG: args = []STARTUP_MSG: version = 0.20.2-devSTARTUP_MSG: build = -r ; compiled by 'root' on Thu Dec 8 22:43:13 PST 2011************************************************************/12/03/21 21:19:56 INFO mapred.TaskTracker:/*-------------- TaskTracker Properties ----------------Systemjava.runtime.name: Java(TM) SE EnvironmentRuntimesun.boot.library.path: /usr/lib/jvm/java-6-sun-1.6.0.26/jre/lib/amd64java.vm.version: 20.1-b02hadoop.root.logger: INFO,consolejava.vm.vendor: Sun Microsystems Inc.java.vendor.url: http://java.sun.com/path.separator: :java.vm.name: Java HotSpot(TM) 64-Bit Server VMfile.encoding.pkg: sun.iosun.java.launcher: SUN_STANDARDuser.country: USsun.os.patch.level: unknownjava.vm.specification.name: Java Virtual Machine Specificationuser.dir: /home/maprjava.runtime.version: 1.6.0_26-b03java.awt.graphicsenv: sun.awt.X11GraphicsEnvironmentjava.endorsed.dirs: /usr/lib/jvm/java-6-sun-1.6.0.26/jre/lib/endorsedos.arch: amd64java.io.tmpdir: /tmpline.separator:

hadoop.log.file: hadoop.logjava.vm.specification.vendor: Sun Microsystems Inc.os.name: Linuxhadoop.id.str:sun.jnu.encoding: UTF-8java.library.path: /opt/mapr/hadoop/hadoop-0.20.2/bin/../lib/ /Linux-amd64-64:nativehadoop.home.dir: /opt/mapr/hadoop/hadoop-0.20.2/bin/..java.specification.name: Java Platform API Specificationjava.class.version: 50.0sun.management.compiler: HotSpot 64-Bit Tiered Compilershadoop.pid.dir: /opt/mapr/hadoop/hadoop-0.20.2/bin/../pidsos.version: 2.6.32-33-genericuser.home: /home/mapruser.timezone: America/Los_Angelesjava.awt.printerjob: sun.print.PSPrinterJobfile.encoding: UTF-8java.specification.version: 1.6java.class.path:/opt/mapr/hadoop/hadoop-0.20.2/bin/../conf:/usr/lib/jvm/java-6-sun-1.6.0.26/lib/tools.jar:/opt/mapr/hadoop/hadoop-0.20.2/bin/..:/opt/mapr/hadoop/hadoop-0.20.2/bin/../hadoop*core*.jar:/opt/mapr/hadoop/hadoop-0.20.2/bin/../lib/aspectjrt-1.6.5.jar:/opt/mapr/hadoop/hadoop-0.20.2/bin/../lib/aspectjtools-1.6.5.jar:/opt/mapr/hadoop/hadoop-0.20.2/bin/../lib/commons-cli-1.2.jar:/opt/mapr/hadoop/hadoop-0.20.2/bin/../lib/commons-codec-1.4.jar:/opt/mapr/hadoop/hadoop-0.20.2/bin/../lib/commons-daemon-1.0.1.jar:/o



pt/mapr/hadoop/hadoop-0.20.2/bin/../lib/commons-el-1.0.jar:/opt/mapr/hadoop/hadoop-0.20.2/bin/../lib/commons-httpclient-3.0.1.jar:/opt/mapr/hadoop/hadoop-0.20.2/bin/../lib/commons-logging-1.0.4.jar:/opt/mapr/hadoop/hadoop-0.20.2/bin/../lib/commons-logging-api-1.0.4.jar:/opt/mapr/hadoop/hadoop-0.20.2/bin/../lib/commons-net-1.4.1.jar:/opt/mapr/hadoop/hadoop-0.20.2/bin/../lib/core-3.1.1.jar:/opt/mapr/hadoop/hadoop-0.20.2/bin/../lib/eval-0.5.jar:/opt/mapr/hadoop/hadoop-0.20.2/bin/../lib/hadoop-0.20.2-dev-capacity-scheduler.jar:/opt/mapr/hadoop/hadoop-0.20.2/bin/../lib/hadoop-0.20.2-dev-core.jar:/opt/mapr/hadoop/hadoop-0.20.2/bin/../lib/hadoop-0.20.2-dev-fairscheduler.jar:/opt/mapr/hadoop/hadoop-0.20.2/bin/../lib/hsqldb-1.8.0.10.jar:/opt/mapr/hadoop/hadoop-0.20.2/bin/../lib/jackson-core-asl-1.5.2.jar:/opt/mapr/hadoop/hadoop-0.20.2/bin/../lib/jackson-mapper-asl-1.5.2.jar:/opt/mapr/hadoop/hadoop-0.20.2/bin/../lib/jasper-compiler-5.5.12.jar:/opt/mapr/hadoop/hadoop-0.20.2/bin/../lib/jasper-runtime-5.5.12.jar:/opt/mapr/hadoop/hadoop-0.20.2/bin/../lib/jets3t-0.6.1.jar:/opt/mapr/hadoop/hadoop-0.20.2/bin/../lib/jetty-6.1.14.jar:/opt/mapr/hadoop/hadoop-0.20.2/bin/../lib/jetty-servlet-tester-6.1.14.jar:/opt/mapr/hadoop/hadoop-0.20.2/bin/../lib/jetty-util-6.1.14.jar:/opt/mapr/hadoop/hadoop-0.20.2/bin/../lib/junit-4.5.jar:/opt/mapr/hadoop/hadoop-0.20.2/bin/../lib/kfs-0.2.2.jar:/opt/mapr/hadoop/hadoop-0.20.2/bin/../lib/log4j-1.2.15.jar:/opt/mapr/hadoop/hadoop-0.20.2/bin/../lib/logging-0.1.jar:/opt/mapr/hadoop/hadoop-0.20.2/bin/../lib/maprfs-0.1.jar:/opt/mapr/hadoop/hadoop-0.20.2/bin/../lib/maprfs-test-0.1.jar:/opt/mapr/hadoop/hadoop-0.20.2/bin/../lib/mockito-all-1.8.2.jar:/opt/mapr/hadoop/hadoop-0.20.2/bin/../lib/mysql-connector-java-5.0.8-bin.jar:/opt/mapr/hadoop/hadoop-0.20.2/bin/../lib/oro-2.0.8.jar:/opt/mapr/hadoop/hadoop-0.20.2/bin/../lib/servlet-api-2.5-6.1.14.jar:/opt/mapr/hadoop/hadoop-0.20.2/bin/../lib/slf4j-api-1.4.3.jar:/opt/mapr/hadoop/hadoop-0.20.2/bin/../lib/slf4j-log4j12-1.4.3.jar:/opt/mapr/hadoop/hadoop-0.20.2/bin/../lib/xmlenc-0.52.jar:/opt/mapr/hadoop/hadoop-0.20.2/bin/../lib/zookeeper-3.3.2.jar:/opt/mapr/hadoop/hadoop-0.20.2/bin/../lib/jsp-2.1/jsp-2.1.jar:/opt/mapr/hadoop/hadoop-0.20.2/bin/../lib/jsp-2.1/jsp-api-2.1.jaruser.name: maprjava.vm.specification.version: 1.0sun.java.command: org.apache.hadoop.mapred.TaskTrackerjava.home: /usr/lib/jvm/java-6-sun-1.6.0.26/jresun.arch.data.model: 64user.language: enjava.specification.vendor: Sun Microsystems Inc.hadoop.log.dir: /opt/mapr/hadoop/hadoop-0.20.2/bin/../logsjava.vm.info: mixed modejava.version: 1.6.0_26java.ext.dirs: /usr/lib/jvm/java-6-sun-1.6.0.26/jre/lib/ext:/usr/java/packages/lib/extsun.boot.class.path:/usr/lib/jvm/java-6-sun-1.6.0.26/jre/lib/resources.jar:/usr/lib/jvm/java-6-sun-1.6.0.26/jre/lib/rt.jar:/usr/lib/jvm/java-6-sun-1.6.0.26/jre/lib/sunrsasign.jar:/usr/lib/jvm/java-6-sun-1.6.0.26/jre/lib/jsse.jar:/usr/lib/jvm/java-6-sun-1.6.0.26/jre/lib/jce.jar:/usr/lib/jvm/java-6-sun-1.6.0.26/jre/lib/charsets.jar:/usr/lib/jvm/java-6-sun-1.6.0.26/jre/lib/modules/jdk.boot.jar:/usr/lib/jvm/java-6-sun-1.6.0.26/jre/classesjava.vendor: Sun Microsystems Inc.file.separator: /java.vendor.url.bug: http://java.sun.com/cgi-bin/bugreport.cgisun.io.unicode.encoding: UnicodeLittlesun.cpu.endian: littlehadoop.policy.file: hadoop-policy.xmlsun.desktop: gnomesun.cpu.isalist:------------------------------------------------------------*/12/03/21 21:19:57 INFO mapred.TaskTracker: /tmp is not tmpfs or ramfs. Java Hotspot Instrumentationwill be disabled by default12/03/21 21:19:57 INFO mapred.TaskTracker: Cleaning up config files from the job history folder12/03/21 21:19:57 INFO mapred.TaskTracker: TT local config is/opt/mapr/hadoop/hadoop-0.20.2/conf/mapred-site.xml12/03/21 21:19:57 INFO mapred.TaskTracker: Loading resource properties file : /opt/mapr//logs/cpu_mem_disk12/03/21 21:19:57 INFO mapred.TaskTracker: Physical memory reserved mapreduce tasks = 2105540608forbytes12/03/21 21:19:57 INFO mapred.TaskTracker: CPUS: 112/03/21 21:19:57 INFO mapred.TaskTracker: Total MEM: 1.9610939GB12/03/21 21:19:57 INFO mapred.TaskTracker: Reserved MEM: 2008MB12/03/21 21:19:57 INFO mapred.TaskTracker: Reserved MEM Ephemeral slots 0for12/03/21 21:19:57 INFO mapred.TaskTracker: DISKS: 212/03/21 21:19:57 INFO mapred.TaskTracker: Map slots 1, Default heapsize map task 873 mbfor12/03/21 21:19:57 INFO mapred.TaskTracker: Reduce slots 1, Default heapsize reduce task 1135 mbfor12/03/21 21:19:57 INFO mapred.TaskTracker: Ephemeral slots 0, memory given each ephemeral slot 200formb12/03/21 21:19:57 INFO mapred.TaskTracker: Prefetch map slots 112/03/21 21:20:07 INFO mortbay.log: Logging to org.slf4j.impl.Log4jLoggerAdapter(org.mortbay.log) viaorg.mortbay.log.Slf4jLog



12/03/21 21:20:08 INFO http.HttpServer: Added global filtersafety(class=org.apache.hadoop.http.HttpServer$QuotingInputFilter)12/03/21 21:20:08 WARN mapred.TaskTracker: Error writing to TaskController configwhilefilejava.io.FileNotFoundException: /opt/mapr/hadoop/hadoop-0.20.2/bin/../conf/taskcontroller.cfg(Permission denied)12/03/21 21:20:08 ERROR mapred.TaskTracker: Can not start TaskTracker because java.io.IOException:Cannot run program :"/opt/mapr/hadoop/hadoop-0.20.2/bin/../bin/Linux-amd64-64/bin/task-controller"java.io.IOException: error=13, Permission denied at java.lang.ProcessBuilder.start(ProcessBuilder.java:460) at org.apache.hadoop.util.Shell.runCommand(Shell.java:267) at org.apache.hadoop.util.Shell.run(Shell.java:249) at org.apache.hadoop.util.Shell$ShellCommandExecutor.execute(Shell.java:442) at org.apache.hadoop.mapred.LinuxTaskController.setup(LinuxTaskController.java:142) at org.apache.hadoop.mapred.TaskTracker.<init>(TaskTracker.java:2149) at org.apache.hadoop.mapred.TaskTracker.main(TaskTracker.java:5216)Caused by: java.io.IOException: java.io.IOException: error=13, Permission denied at java.lang.UNIXProcess.<init>(UNIXProcess.java:148) at java.lang.ProcessImpl.start(ProcessImpl.java:65) at java.lang.ProcessBuilder.start(ProcessBuilder.java:453) ... 6 more

12/03/21 21:20:08 INFO mapred.TaskTracker: SHUTDOWN_MSG:/************************************************************SHUTDOWN_MSG: Shutting down TaskTracker at mapr-desktop/127.0.1.1************************************************************/





hadoop version

The command prints the hadoop software version.hadoop version

Syntax

hadoop version

Output

mapr@mapr-desktop:~$ hadoop version Hadoop 0.20.2-devSubversion -r Compiled by root on Thu Dec 8 22:43:13 PST 2011From source with checksum 19fa44df0cb831c45ef984f21feb7110



API Reference

Overview

This guide provides information about the MapR command API. Most commands can be run on the command-line interface (CLI), or by makingREST requests programmatically or in a browser. To run CLI commands, use a machine or an ssh connection to any node in the cluster.ClientTo use the REST interface, make HTTP requests to a node that is running the WebServer service.

Each command reference page includes the command syntax, a table that describes the parameters, and examples of command usage. In eachparameter table, required parameters are in text. For output commands, the reference pages include tables that describe the output fields.boldValues that do not apply to particular combinations are marked .NA

REST API Syntax

MapR REST calls use the following format:

https://<host>:<port>/rest/<command>[/<subcommand>...]?<parameters>

Construct the list from the required and optional parameters, in the format separated by the<parameters> <parameter>=<value>ampersand ( ) character. Example:&

https://r1n1.qa.sj.ca.us:8443/api/volume/mount?name=test-volume&path=/test

Values in REST API calls must be URL-encoded. For readability, the values in this document are presented using the actual characters, ratherthan the URL-encoded versions.

Authentication

To make REST calls using or , provide the username and password.curl wget

Curl Syntax

curl -k -u <username>:<password> https://<host>:<port>/ /<command>...rest

Wget Syntax

wget --no-check-certificate --user <username> --password <password> https://<host>:<port>/ /<commanrestd>...

Command-Line Interface (CLI) Syntax

The MapR CLI commands are documented using the following conventions:

[Square brackets] indicate an optional parameter<Angle brackets> indicate a value to enter

The following syntax example shows that the command requires the parameter, for which you must enter a list ofvolume mount -namevolumes, and all other parameters are optional:

maprcli volume mount [ -cluster <cluster> ] -name <volume list> [ -path <path list> ]

For clarity, the syntax examples show each parameter on a separate line; in practical usage, the command and all parameters and options aretyped on a single line. Example:

maprcli volume mount -name test-volume -path /test

Common Parameters

The following parameters are available for many commands in both the REST and command-line contexts.




cluster The cluster on which to run the command. If this parameter is omitted, the command is run on the same cluster where it is issued.In multi-cluster contexts, you can use this parameter to specify a different cluster on which to run the command.

zkconnect A ZooKeeper connect string, which specifies a list of the hosts running ZooKeeper, and the port to use on each, in the format: '<h Default: In most cases the ZooKeeper connect string canost>[:<port>][,<host>[:<port>]...]' 'localhost:5181'

be omitted, but it is useful in certain cases when the CLDB is not running.

Common Options

The following options are available for most commands in the command-line context.

Option Description

-noheader When displaying tabular output from a command, omits the header row.

-long Shows the entire value. This is useful when the command response contains complex information. When -long is omitted, complexinformation is displayed as an ellipsis (...).

-json Displays command output in JSON format. When -json is omitted, the command output is displayed in tabular format.

Filters

Some MapR CLI commands use , which let you specify large numbers of nodes or volumes by matching specified values in specified fieldsfiltersrather than by typing each name explicitly.

Filters use the following format:

[<field><operator>"<value>"]<and|or>[<field><operator>"<value>"] ...

field Field on which to filter. The field depends on the command with which the filter is used.

operator An operator for that field:

== - Exact match!= - Does not match> - Greater than< - Less than

>= - Greater than or equal to<= - Less than or equal to |

value Value on which to filter. Wildcards (using ) are allowed for operators and . There is a special value that matches all* == != allvalues.

You can use the wildcard ( ) for partial matches. For example, you can display all volumes whose owner is and whose name begins with * root t as follows:est

maprcli volume list -filter [n=="test*"]and[on=="root"]

Response

The commands return responses in JSON or in a tabular format. When you run commands from the command line, the response is returned intabular format unless you specify JSON using the -json option; when you run commands through the REST interface, the response is returned inJSON.

Success

On a successful call, each command returns the error code zero (OK) and any data requested. When JSON output is specified, the data isreturned as an array of records along with the status code and the total number of records. In the tabular format, the data is returned as asequence of rows, each of which contains the fields in the record separated by tabs.



JSON "status":"OK", "total":<number of records>, "data":[ <record> ... ]

Tabularstatus0

Or

<heading> <heading> <heading> ...<field> <field> <field> ......

Error

When an error occurs, the command returns the error code and descriptive message.

JSON "status":"ERROR", "errors":[ "id":<error code>, "desc":"<command>: <error message>" ]

TabularERROR (<error code>) - <command>: <error message>



acl

The acl commands let you work with (ACLs):access control lists

acl edit - modifies a specific user's access to a cluster or volumeacl set - modifies the ACL for a cluster or volumeacl show - displays the ACL associated with a cluster or volume

In order to use the command, you must have full control ( ) permission on the cluster or volume for which you are running theacl edit fccommand.

Specifying Permissions

Specify permissions for a user or group with a string that lists the permissions for that user or group. To specify permissions for multiple users orgroups, use a string for each, separated by spaces. The format is as follows:

Users - <user>:<action>[,<action>...][ <user>:<action>[,<action...]]Groups - <group>:<action>[,<action>...][ <group>:<action>[,<action...]]

The following tables list the permission codes used by the commands.acl

Cluster Permission Codes



cv


cv Create volumes



Volume Permission Codes

Code Allowed Action




d Delete a volume




acl edit

The command grants one or more specific volume or cluster permissions to a user. To use the command, you must haveacl edit acl editfull control ( ) permissions on the volume or cluster for which you are running the command.fc

The permissions are specified as a comma-separated list of permission codes. See . You must specify either a or a . When the acl user group ty is , a volume name must be specified using the parameter.pe volume name

Syntax

CLImaprcli acl edit [ -cluster <cluster name> ] [ -group <group> ] [ -name <name> ] -type cluster|volume [ -user <user> ]

RESThttp[s]://<host:port>/rest/acl/edit?<parameters>

Parameters


cluster The cluster on which to run the command.

group Groups and allowed actions for each group. See . Format: acl <group>:<action>[,<action>...][<group>:<action>[,<action...]]

name The object name.

type The object type ( or ). cluster volume

user Users and allowed actions for each user. See . Format: acl <user>:<action>[,<action>...][<user>:<action>[,<action...]]

Examples

Give the user jsmith dump, restore, and delete permissions for "test-volume":

CLImaprcli acl edit -type volume -name test-volume -user jsmith:dump,restore,d



acl set

The command specifies the entire ACL for a cluster or volume. Any previous permissions are overwritten by the new values, and anyacl setpermissions omitted are removed. To use the command, you must have full control ( ) permissions on the volume or cluster for whichacl set fcyou are running the command.

The permissions are specified as a comma-separated list of permission codes. See . You must specify either a or a . When the acl user group ty is , a volume name must be specified using the parameter.pe volume name

The command removes any previous ACL values. If you wish to preserve some of the permissions, you should eitheracl setuse the command instead of , or use to list the values before overwriting them.acl edit acl set acl show

Syntax

CLImaprcli acl set [ -cluster <cluster name> ] [ -group <group> ] [ -name <name> ] -type cluster|volume [ -user <user> ]

RESThttp[s]://<host:port>/rest/acl/edit?<parameters>

Parameters



group Groups and allowed actions for each group. See . Format: acl <group>:<action>[,<action>...][<group>:<action>[,<action...]]

name The object name.

type The object type ( or ). cluster volume

user Users and allowed actions for each user. See . Format: acl <user>:<action>[,<action>...][ <user>:<action>[,<action...]]

Examples

Give the user full control of the cluster and remove all permissions for all other users:root my.cluster.com

CLImaprcli acl set -type cluster -cluster my.cluster.com -userroot:fc

Usage Example



# maprcli acl show -type clusterPrincipal Allowed actionsUser root [login, ss, cv, a, fc]User lfedotov [login, ss, cv, a, fc]User mapr [login, ss, cv, a, fc]

# maprcli acl set -type cluster -cluster my.cluster.com -user root:fc# maprcli acl show -type clusterPrincipal Allowed actionsUser root [login, ss, cv, a, fc]

Notice that the specified permissions have overwritten the existing ACL.

Give multiple users specific permissions for the volume and remove all permissions for all other users:test-volume

CLImaprcli acl set -type volume -name test-volume -user jsmith:dump,restore,m rjones:fc



acl show

Displays the ACL associated with an object (cluster or a volume). An ACL contains the list of users who can perform specific actions.

Syntax

CLImaprcli acl show [ -cluster <cluster> ] [ -group <group> ] [ -name <name> ] [ -output long|short|terse ] [ -perm ] -type cluster|volume [ -user <user> ]

RESThttp[s]://<host:port>/rest/acl/show?<parameters>

Parameters


cluster The name of the cluster on which to run the command

group The group for which to display permissions

name The cluster or volume name

output The output format:

longshortterse

perm When this option is specified, displays the permissions available for the object type specified in the parameter.acl show type

type Cluster or volume.

user The user for which to display permissions

Output

The actions that each user or group is allowed to perform on the cluster or the specified volume. For information about each allowed action, see a.cl

Principal Allowed actions User root [r, ss, cv, a, fc] Group root [r, ss, cv, a, fc] All users [r]

Examples

Show the ACL for "test-volume":



CLImaprcli acl show -type volume -name test-volume

Show the permissions that can be set on a cluster:

CLImaprcli acl show -type cluster -perm



alarm

The alarm commands perform functions related to system alarms:

alarm clear - clears one or more alarmsalarm clearall - clears all alarmsalarm config load - displays the email addresses to which alarm notifications are to be sentalarm config save - saves changes to the email addresses to which alarm notifications are to be sentalarm list - displays alarms on the clusteralarm names - displays all alarm namesalarm raise - raises a specified alarm

Alarm Notification Fields

The following fields specify the configuration of alarm notifications.

Field Description

alarm The named alarm.

individual Specifies whether individual alarm notifications are sent to the default email address for the alarm type.

0 - do not send notifications to the default email address for the alarm type1 - send notifications to the default email address for the alarm type

email A custom email address for notifications about this alarm type. If specified, alarm notifications are sent to this email address,regardless of whether they are sent to the default email address

Alarm Types

See .Troubleshooting Alarms

Alarm History

To see a history of alarms that have been raised, look at the file on the master CLDB node. Example:/opt/mapr/logs/cldb.log

grep ALARM /opt/mapr/logs/cldb.log



alarm clear

Clears one or more alarms. Permissions required: or fc a

Syntax

CLImaprcli alarm clear -alarm <alarm> [ -cluster <cluster> ] [ -entity <host, volume, user, or group name> ]

RESThttp[s]://<host>:<port>/rest/alarm/clear?<parameters>

Parameters


alarm The named alarm to clear. See .Alarm Types


entity The entity on which to clear the alarm.

Examples

Clear a specific alarm:

CLImaprcli alarm clear -alarm NODE_ALARM_DEBUG_LOGGING

RESThttps://r1n1.sj.us:8443/rest/alarm/clear?alarm=NODE_ALARM_DEBUG_LOGGING



alarm clearall

Clears all alarms. Permissions required: or fc a

Syntax

CLImaprcli alarm clearall [ -cluster <cluster> ]

RESThttp[s]://<host>:<port>/rest/alarm/clearall?<parameters>

Parameters



Examples

Clear all alarms:

CLImaprcli alarm clearall

RESThttps://r1n1.sj.us:8443/rest/alarm/clearall



alarm config load

Displays the configuration of alarm notifications. Permissions required: or fc a

Syntax

CLImaprcli alarm config load [ -cluster <cluster> ] [ -output terse|verbose ]

RESThttp[s]://<host>:<port>/rest/alarm/config/load

Parameters



output Whether the output should be terse or verbose.

Output

A list of configuration values for alarm notifications.

Output Fields

See .Alarm Notification Fields

Sample output



alarm individual email CLUSTER_ALARM_BLACKLIST_TTS 1 CLUSTER_ALARM_UPGRADE_IN_PROGRESS 1 CLUSTER_ALARM_UNASSIGNED_VIRTUAL_IPS 1 VOLUME_ALARM_SNAPSHOT_FAILURE 1 VOLUME_ALARM_MIRROR_FAILURE 1 VOLUME_ALARM_DATA_UNDER_REPLICATED 1 VOLUME_ALARM_DATA_UNAVAILABLE 1 VOLUME_ALARM_ADVISORY_QUOTA_EXCEEDED 1 VOLUME_ALARM_QUOTA_EXCEEDED 1NODE_ALARM_CORE_PRESENT 1 NODE_ALARM_DEBUG_LOGGING 1 NODE_ALARM_DISK_FAILURE 1 NODE_ALARM_OPT_MAPR_FULL 1 NODE_ALARM_VERSION_MISMATCH 1 NODE_ALARM_TIME_SKEW 1 NODE_ALARM_SERVICE_CLDB_DOWN 1 NODE_ALARM_SERVICE_FILESERVER_DOWN 1 NODE_ALARM_SERVICE_JT_DOWN 1 NODE_ALARM_SERVICE_TT_DOWN 1 NODE_ALARM_SERVICE_HBMASTER_DOWN 1 NODE_ALARM_SERVICE_HBREGION_DOWN 1 NODE_ALARM_SERVICE_NFS_DOWN 1 NODE_ALARM_SERVICE_WEBSERVER_DOWN 1 NODE_ALARM_SERVICE_HOSTSTATS_DOWN 1NODE_ALARM_ROOT_PARTITION_FULL 1AE_ALARM_AEADVISORY_QUOTA_EXCEEDED 1 AE_ALARM_AEQUOTA_EXCEEDED 1

Examples

Display the alarm notification configuration:

CLImaprcli alarm config load

RESThttps://r1n1.sj.us:8443/rest/alarm/config/load



alarm config save

Sets notification preferences for alarms. Permissions required: or fc a

Alarm notifications can be sent to the default email address and a specific email address for each named alarm. If is set to for aindividual 1specific alarm, then notifications for that alarm are sent to the default email address for the alarm type. If a custom email address is provided,notifications are sent there regardless of whether they are also sent to the default email address.

Syntax

CLImaprcli alarm config save [ -cluster <cluster> ] -values <values>

RESThttp[s]://<host>:<port>/rest/alarm/config/save?<parameters>

Parameters



values A comma-separated list of configuration values for one or more alarms, in the following format:

<alarm>,<individual>,<email> See .Alarm Notification Fields

Examples

Send alert emails for the AE_ALARM_AEQUOTA_EXCEEDED alarm to the default email address and a custom email address:

CLImaprcli alarm config save -values "AE_ALARM_AEQUOTA_EXCEEDED,1,[email protected]"

RESThttps://r1n1.sj.us:8443/rest/alarm/config/save?values=AE_ALARM_AEQUOTA_EXCEEDED,1,[email protected]



alarm list

Lists alarms in the system. Permissions required: or fc a

You can list all alarms, alarms by type (Cluster, Node or Volume), or alarms on a particular node or volume. To retrieve a count of all alarm types,pass in the parameter. You can specify the alarms to return by filtering on type and entity. Use and to retrieve only a1 summary start limitspecified window of data.

Syntax

CLImaprcli alarm list [ -alarm <alarm ID> ] [ -cluster <cluster> ] [ -entity <host or volume> ] [ -limit <limit> ] [ -output (terse|verbose) ] [ -start <offset> ] [ -summary (0|1) ] [ -type <alarm type> ]

RESThttp[s]://<host>:<port>/rest/alarm/list?<parameters>

Parameters


alarm The alarm type to return. See . Alarm Types

cluster The cluster on which to list alarms.

entity The name of the cluster, node, volume, user, or group to check for alarms.

limit The number of records to retrieve. Default: 2147483647


start The list offset at which to start.

summary Specifies the type of data to return:

1 = count by alarm type0 = List of alarms

type The entity type:

clusternodevolumeae

Output

Information about one or more named alarms on the cluster, or for a specified node, volume, user, or group.

Output Fields



Field Description

alarm state State of the alarm:

0 = Clear1 = Raised

description A description of the condition that raised the alarm

entity The name of the volume, node, user, or group.

alarm name The name of the alarm.

alarm statechange time The date and time the alarm was most recently raised.

Sample Output

alarm state description entity alarm name alarm statechange time1 Volume desired replication is 1, current replication is 0 mapr.qa-node173.qa.prv.local.logs VOLUME_ALARM_DATA_UNDER_REPLICATED 12967077078721 Volume data unavailable mapr.qa-node173.qa.prv.local.logs VOLUME_ALARM_DATA_UNAVAILABLE 12967077078711 Volume desired replication is 1, current replication is 0 mapr.qa-node235.qa.prv.local.mapred VOLUME_ALARM_DATA_UNDER_REPLICATED 12967082833551 Volume data unavailable mapr.qa-node235.qa.prv.local.mapred VOLUME_ALARM_DATA_UNAVAILABLE 12967082830991 Volume desired replication is 1, current replication is 0 mapr.qa-node175.qa.prv.local.logs VOLUME_ALARM_DATA_UNDER_REPLICATED 1296706343256

Examples

List a summary of all alarms

CLImaprcli alarm list -summary 1

RESThttps://r1n1.sj.us:8443/rest/alarm/list?summary=1

List cluster alarms

CLImaprcli alarm list -type 0

RESThttps://r1n1.sj.us:8443/rest/alarm/list?type=0



alarm names

Displays a list of alarm names. Permissions required or .fc a

Syntax

CLImaprcli alarm names

RESThttp[s]://<host>:<port>/rest/alarm/names

Examples

Display all alarm names:

CLImaprcli alarm names

RESThttps://r1n1.sj.us:8443/rest/alarm/names



alarm raise

Raises a specified alarm or alarms. Permissions required or .fc a

Syntax

CLImaprcli alarm raise -alarm <alarm> [ -cluster <cluster> ] [ -description <description> ] [ -entity <cluster, entity, host, node, or volume> ]

RESThttp[s]://<host>:<port>/rest/alarm/raise?<parameters>

Parameters


alarm The alarm type to raise. See .Alarm Types


description A brief description.

entity The entity on which to raise alarms.

Examples

Raise a specific alarm:

CLImaprcli alarm raise -alarm NODE_ALARM_DEBUG_LOGGING

RESThttps://r1n1.sj.us:8443/rest/alarm/raise?alarm=NODE_ALARM_DEBUG_LOGGING



config

The config commands let you work with configuration values for the MapR cluster:

config load displays the valuesconfig save makes changes to the stored values

Configuration Fields

Field Default Value Description

cldb.balancer.disk.max.switches.in.nodes.percentage 10

cldb.balancer.disk.paused 0

cldb.balancer.disk.sleep.interval.sec 2 * 60

cldb.balancer.disk.threshold.percentage 70

cldb.balancer.logging 0

cldb.balancer.role.max.switches.in.nodes.percentage 10

cldb.balancer.role.paused 0

cldb.balancer.role.sleep.interval.sec 15 * 60

cldb.balancer.startup.interval.sec 30 * 60

cldb.cluster.almost.full.percentage 90 The percentage at which theCLUSTER_ALARM_CLUSTER_ALMOST_FULL alarm istriggered.

cldb.container.alloc.selector.algo 0

cldb.container.assign.buffer.sizemb 1 * 1024

cldb.container.create.diskfull.threshold 80

cldb.container.sizemb 16 * 1024

cldb.default.chunk.sizemb 256

cldb.default.volume.topology The default topology for new volumes.

cldb.dialhome.metrics.rotation.period 365

cldb.fileserver.activityreport.interval.hb.multiplier 3

cldb.fileserver.containerreport.interval.hb.multiplier 1800

cldb.fileserver.heartbeat.interval.sec 1

cldb.force.master.for.container.minutes 1

cldb.fs.mark.inactive.sec 5 * 60

cldb.fs.mark.rereplicate.sec 60 * 60 The number of seconds a node can fail to heartbeat before it isconsidered dead. Once a node is considered dead, the CLDBre-replicates any data contained on the node.

cldb.fs.workallocator.num.volume.workunits 20

cldb.fs.workallocator.num.workunits 80

cldb.ganglia.cldb.metrics 0

cldb.ganglia.fileserver.metrics 0

cldb.heartbeat.monitor.sleep.interval.sec 60

cldb.log.fileserver.timeskew.interval.mins 60

cldb.max.parallel.resyncs.star 2



cldb.min.containerid 1

cldb.min.fileservers 1 The minimum CLDB fileservers.

cldb.min.snap.containerid 1

cldb.min.snapid 1

cldb.replication.manager.start.mins 15 The delay between CLDB startup and replication managerstartup, to allow all nodes to register and heartbeat

cldb.replication.process.num.containers 60

cldb.replication.sleep.interval.sec 15

cldb.replication.tablescan.interval.sec 2 * 60

cldb.restart.wait.time.sec 180

cldb.snapshots.inprogress.cleanup.minutes 30

cldb.topology.almost.full.percentage 90

cldb.volume.default.replication The default replication for the CLDB volumes.

cldb.volume.epoch

cldb.volumes.default.min.replication 2

cldb.volumes.default.replication 3

mapr.domainname The domain name MapR uses to get operating system users andgroups (in domain mode).

mapr.entityquerysource Sets MapR to get user information from LDAP (LDAP mode) orfrom the operating system of a domain (domain mode):

ldapdomain

mapr.eula.user

mapr.eula.time

mapr.fs.nocompression "bz2,gz,tgz,tbz2,zip,z,Z,mp3,jpg,jpeg,mpg,mpeg,avi,gif,png"

The file types that should not be compressed. See Extensions.Not Compressed

mapr.fs.permissions.supergroup The of the MapR-FS layer.super group

mapr.fs.permissions.superuser The of the MapR-FS layer.super user

mapr.ldap.attribute.group The LDAP server group attribute.

mapr.ldap.attribute.groupmembers The LDAP server groupmembers attribute.

mapr.ldap.attribute.mail The LDAP server mail attribute.

mapr.ldap.attribute.uid The LDAP server uid attribute.

mapr.ldap.basedn The LDAP server Base DN.

mapr.ldap.binddn The LDAP server Bind DN.

mapr.ldap.port The port MapR is to use on the LDAP server.

mapr.ldap.server The LDAP server MapR uses to get users and groups (in LDAPmode).

mapr.ldap.sslrequired Specifies whether the LDAP server requires SSL:

0 == no1 == yes

http://www.mapr.com/doc/display/MapR12/Extensions+Not+Compressed

http://www.mapr.com/doc/display/MapR12/Extensions+Not+Compressed



mapr.license.exipry.notificationdays 30

mapr.quota.group.advisorydefault The default group advisory quota; see . Managing Quotas

mapr.quota.group.default The default group quota; see . Managing Quotas

mapr.quota.user.advisorydefault The default user advisory quota; see Managing Quotas.

mapr.quota.user.default The default user quota; see Managing Quotas.

mapr.smtp.port The port MapR uses on the SMTP server ( -1 65535).

mapr.smtp.sender.email The reply-to email address MapR uses when sendingnotifications.

mapr.smtp.sender.fullname The full name MapR uses in the Sender field when sendingnotifications.

mapr.smtp.sender.password The password MapR uses to log in to the SMTP server whensending notifications.

mapr.smtp.sender.username The username MapR uses to log in to the SMTP server whensending notifications.

mapr.smtp.server The SMTP server that MapR uses to send notifications.

mapr.smtp.sslrequired Specifies whether SSL is required when sending email:

0 == no1 == yes

mapr.targetversion

mapr.webui.http.port The port MapR uses for the MapR Control System over HTTP(0-65535); if 0 is specified, disables HTTP access.

mapr.webui.https.certpath The HTTPS certificate path.

mapr.webui.https.keypath The HTTPS key path.

mapr.webui.https.port The port MapR uses for the MapR Control System over HTTPS(0-65535); if 0 is specified, disables HTTPS access.

mapr.webui.timeout The number of seconds the MapR Control System allows toelapse before timing out.

mapreduce.cluster.permissions.supergroup The of the MapReduce layer.super group

mapreduce.cluster.permissions.superuser The of the MapReduce layer.super user



config load

Displays information about the cluster configuration. You can use the parameter to specify which information to display.keys

Syntax

CLImaprcli config load [ -cluster <cluster> ] -keys <keys>

RESThttp[s]://<host>:<port>/rest/config/load?<parameters>

Parameters


cluster The cluster for which to display values.

keys The fields for which to display values; see the tableConfiguration Fields

Output

Information about the cluster configuration. See the table.Configuration Fields

Sample Output

: ,"status" "OK" :1,"total" :["data" : ,"mapr.webui.http.port" "8080" : ,"mapr.fs.permissions.superuser" "root" : ,"mapr.smtp.port" "25" :"mapr.fs.permissions.supergroup" "supergroup" ]

Examples

Display several keys:

CLImaprcli config load -keys mapr.webui.http.port,mapr.webui.https.port,mapr.webui.https.keystorepath,mapr.webui.https.keystorepassword,mapr.webui.https.keypassword,mapr.webui.timeout



RESThttps://r1n1.sj.us:8443/rest/config/load?keys=mapr.webui.http.port,mapr.webui.https.port,mapr.webui.https.keystorepath,mapr.webui.https.keystorepassword,mapr.webui.https.keypassword,mapr.webui.timeout



config save

Saves configuration information, specified as key/value pairs. Permissions required: or .fc a

See the table.Configuration Fields

Syntax

CLImaprcli config save [ -cluster <cluster> ] -values <values>

RESThttp[s]://<host>:<port>/rest/config/save?<parameters>

Parameters



values A JSON object containing configuration fields; see the table.Configuration Fields

Examples

Configure MapR SMTP settings:

CLImaprcli config save -values '"mapr.smtp.provider":"gmail","mapr.smtp.server":"smtp.gmail.com","mapr.smtp.sslrequired":"true","mapr.smtp.port":"465","mapr.smtp.sender.fullname":"Ab Cd","mapr.smtp.sender.email":"[email protected]","mapr.smtp.sender.username":"[email protected]","mapr.smtp.sender.password":"abc"'

RESThttps://r1n1.sj.us:8443/rest/config/save?values="mapr.smtp.provider":"gmail","mapr.smtp.server":"smtp.gmail.com","mapr.smtp.sslrequired":"true","mapr.smtp.port":"465","mapr.smtp.sender.fullname":"AbCd","mapr.smtp.sender.email":"[email protected]","mapr.smtp.sender.username":"[email protected]","mapr.smtp.sender.password":"abc"



dashboard

The command displays a summary of information about the cluster.dashboard info



dashboard info

Displays a summary of information about the cluster. For best results, use the option when running from the command-json dashboard infoline.

Syntax

CLImaprcli dashboard info [ -cluster <cluster> ] [ -multi_cluster_info true|false. default: false ] [ -version true|false. default: false ] [ -zkconnect <ZooKeeper connect string> ]

RESThttp[s]://<host>:<port>/rest/dashboard/info?<parameters>

Parameters



multi_cluster_info Specifies whether to display cluster information from multiple clusters.

version Specifies whether to display the version.

zkconnect ZooKeeper Connect String

Output

A summary of information about the services, volumes, mapreduce jobs, health, and utilization of the cluster.

Output Fields

Field Description

Timestamp The time at which the data was retrieved, expressed as a Unix epoch time.dashboard info

Status The success status of the command.dashboard info

Total The number of clusters for which data was queried in the command.dashboard info

Version The MapR software version running on the cluster.

Cluster The following information about the cluster:

name — the cluster nameip — the IP address of the active CLDBid — the cluster ID



services The number of active, stopped, failed, and total installed services on the cluster:

CLDBFile serverJob trackerTask trackerHB masterHB region server

volumes The number and size (in GB) of volumes that are:

MountedUnmounted

mapreduce The following mapreduce information:

Queue timeRunning jobsQueued jobsRunning tasksBlacklisted jobs

maintenance The following information about system health:

Failed disk nodesCluster alarmsNode alarmsVersions

utilization The following utilization information:

CPU:MemoryDisk spacecompression

Sample Output

# maprcli dashboard info -json :1336760972531,"timestamp" : ,"status" "OK" :1,"total" :["data" : ,"version" "2.0.0" :"cluster" : ,"name" "mega-cluster" : ,"ip" "192.168.50.50" :"id" "7140172612740778586" , :"volumes" :"mounted" :76,"total" :88885376"size" , :"unmounted" :1,"total" :6"size" , :"utilization" :"cpu" :14,"util"



:528,"total" :75"active" , :"memory" :2128177,"total" :896194"active" , :"disk_space" :707537,"total" :226848"active" , :"compression" :86802,"compressed" :116655"uncompressed" , :"services" :"fileserver" :22,"active" :0,"stopped" :0,"failed" :22"total" , :"nfs" :1,"active" :0,"stopped" :0,"failed" :1"total" , :"webserver" :1,"active" :0,"stopped" :0,"failed" :1"total" , :"cldb" :1,"active" :0,"stopped" :0,"failed" :1"total" , :"tasktracker" :21,"active" :0,"stopped" :0,"failed" :21"total" , :"jobtracker" :1,"active" :0,"standby" :0,"stopped" :0,"failed" :1"total" , :"hoststats" :22,"active" :0,"stopped" :0,"failed" :22"total" , :"mapreduce" :1,"running_jobs" :0,"queued_jobs" :537,"running_tasks" :0"blacklisted"



]

Examples

Display dashboard information:

CLImaprcli dashboard info -json

RESThttps://r1n1.sj.us:8443/rest/dashboard/info



dialhome

The commands let you change the Dial Home status of your cluster:dialhome

dialhome ackdial - acknowledges a successful Dial Home transmission.dialhome enable - enables or disables Dial Home.dialhome lastdialed - displays the last Dial Home transmission.dialhome metrics - displays the metrics collected by Dial Home.dialhome status - displays the current Dial Home status.



dialhome ackdial

Acknowledges the most recent Dial Home on the cluster. Permissions required: or fc a

Syntax

CLImaprcli dialhome ackdial [ -forDay <date> ]

RESThttp[s]://<host>:<port>/rest/dialhome/ackdial[?parameters]

Parameters


forDay Date for which the recorded metrics were successfully dialed home. Accepted values: UTC timestamp or a UTC date inMM/DD/YY format. Default: yesterday

Examples

Acknowledge Dial Home:

CLImaprcli dialhome ackdial

RESThttps://r1n1.sj.us:8443/rest/dialhome/ackdial



dialhome enable

Enables Dial Home on the cluster. Permissions required: or fc a

Syntax

CLImaprcli dialhome enable -enable 0|1

RESThttp[s]://<host>:<port>/rest/dialhome/enable

Parameters


enable Specifies whether to enable or disable Dial Home:

0 - Disable1 - Enable

Output

A success or failure message.

Sample output

pconrad@s1-r1-sanjose-ca-us:~$ maprcli dialhome enable -enable 1Successfully enabled dialhome pconrad@s1-r1-sanjose-ca-us:~$ maprcli dialhome statusDial home status is: enabled

Examples

Enable Dial Home:

CLImaprcli dialhome enable -enable 1

RESThttps://r1n1.sj.us:8443/rest/dialhome/enable?enable=1



dialhome lastdialed

Displays the date of the last successful Dial Home call. Permissions required: or fc a

Syntax

CLImaprcli dialhome lastdialed

RESThttp[s]://<host>:<port>/rest/dialhome/lastdialed

Output

The date of the last successful Dial Home call.

Sample output

$ maprcli dialhome lastdialeddate 1322438400000

Examples

Show the date of the most recent Dial Home:

CLImaprcli dialhome lastdialed

RESThttps://r1n1.sj.us:8443/rest/dialhome/lastdialed



dialhome metrics

Returns a compressed metrics object. Permissions required: or fc a

Syntax

CLImaprcli dialhome metrics [ -forDay <date> ]

RESThttp[s]://<host>:<port>/rest/dialhome/metrics

Parameters


forDay Date for which the recorded metrics were successfully dialed home. Accepted values: UTC timestamp or a UTC date inMM/DD/YY format. Default: yesterday

Output

Sample output

$ maprcli dialhome metricsmetrics [B@48067064

Examples

Show the Dial Home metrics:

CLImaprcli dialhome metrics

RESThttps://r1n1.sj.us:8443/rest/dialhome/metrics



dialhome status

Displays the Dial Home status. Permissions required: or fc a

Syntax

CLImaprcli dialhome status

RESThttp[s]://<host>:<port>/rest/dialhome/status

Output

The current Dial Home status.

Sample output

$ maprcli dialhome statusenabled 1

Examples

Display the Dial Home status:

CLImaprcli dialhome status

RESThttps://r1n1.sj.us:8443/rest/dialhome/status



disk

The disk commands lets you work with disks:

disk add adds a disk to a nodedisk list lists disksdisk listall lists all disksdisk remove removes a disk from a node

Disk Fields

The following table shows the fields displayed in the output of the disk list and disk listall commands. You can choose which fields (columns) todisplay and sort in ascending or descending order by any single field.

Field Description

hn Hostname of node which owns this disk/partition.

n Name of the disk or partition.

st Disk status:

0 = Good1 = Bad disk

pst Disk power status:

0 = Active/idle (normal operation)1 = Standby (low power mode)2 = Sleeping (lowest power mode, drive is completely shut down)

mt Disk mount status

0 = unmounted1 = mounted

fs File system type

mn Model number

sn Serial number

fw Firmware version

ven Vendor name

dst Total disk space, in MB

dsu Disk space used, in MB

dsa Disk space available, in MB

err Disk error message, in english. Note that this will be translated. Only sent if st == 1.not

ft Disk failure time, MapR disks only. Only sent if st == 1.



1.

2. 3.

disk add

Adds one or more disks to the specified node. Permissions required: or fc a

If you are running MapR 1.2.2 or earlier, do not use the command or the MapR Control System to add disks todisk addMapR-FS. You must either upgrade to MapR 1.2.3 before adding or replacing a disk, or use the following procedure (whichavoids the command):disk add

Use the to the failed disk. All other disks in the same storage pool are removed at theMapR Control System removesame time. Make a note of which disks have been removed.Create a text file containing a list of the disks you just removed. See ./tmp/disks.txt Setting Up Disks for MapRAdd the disks to MapR-FS by typing the following command (as or with ):root sudo/opt/mapr/server/disksetup -F /tmp/disks.txt

Syntax

CLImaprcli disk add [ -cluster ] -disks <disk names> -host <host>

RESThttp[s]://<host>:<port>/rest/disk/add?<parameters>

Parameters


cluster The cluster on which to add disks.

disks A comma-separated list of disk names.Examples:

["disk"] ["disk","disk","disk"...]

host The hostname or IP address of the machine on which to add the disk.

Output

Output Fields

Field Description

ip The IP address of the machine that owns the disk(s).

disk The name of a disk or partition. Example "sca" or "sca/sca1"

all The string , meaning all unmounted disks for this node.all



Examples

Add a disk:

CLImaprcli disk add -disks ["/dev/sda1"] -host 10.250.1.79

RESThttps://r1n1.sj.us:8443/rest/disk/add?disks=["/dev/sda1"]



disk list

The command lists the disks on a node.maprcli disk list

Syntax

CLImaprcli disk list -host <host> [ -output terse|verbose ] [ -system 1|0 ]

RESThttp[s]://<host>:<port>/rest/disk/list?<parameters>

Parameters


host The node on which to list the disks.

output Whether the output should be or .terse verbose

system Show only operating system disks:

0 - shows only MapR-FS disks1 - shows only operating system disksNot specified - shows both MapR-FS and operating system disks

Output

Information about the specified disks. See the table.Disk Fields

Examples

List disks on a host:

CLImaprcli disk list -host 10.10.100.22

RESThttps://r1n1.sj.us:8443/rest/disk/list?host=10.10.100.22



disk listall

Lists all disks

Syntax

CLImaprcli disk listall [ -cluster <cluster> ] [ -columns <columns>] [ -filter <filter>] [ -limit <limit>] [ -output terse|verbose ] [ -start <offset>]

RESThttp[s]://<host>:<port>/rest/disk/listall?<parameters>

Parameters



columns A comma-separated list of fields to return in the query. See the table.Disk Fields

filter A filter specifying snapshots to preserve. See for more information.Filters

limit The number of rows to return, beginning at start. Default: 0

output Always the string .terse

start The offset from the starting row according to sort. Default: 0

Output

Information about all disks. See the table.Disk Fields

Examples

List all disks:

CLImaprcli disk listall

RESThttps://r1n1.sj.us:8443/rest/disk/listall



disk remove

Removes a disk from MapR-FS. Permissions required: or fc a

The command does not remove a disk containing unreplicated data unless forced. To force disk removal, specify with thedisk remove -forcevalue .1

Only use the option if you are sure that you do not need the data on the disk. This option removes the disk without-force 1regard to replication factor or other data protection mechanisms, and may result in permanent data loss.

Syntax

CLImaprcli disk remove [ -cluster <cluster> ] -disks <disk names> [ -force 0|1 ] -host <host>

RESThttp[s]://<host>:<port>/rest/disk/remove?<parameters>

Parameters



disks A list of disks in the form:

["disk"]or["disk","disk","disk"...]or[]

force Whether to force

0 (default) - do not remove the disk or disks if there is unreplicated data on the disk1 - remove the disk or disks regardless of data loss or other consequences

host The hostname or ip address of the node from which to remove the disk.

Output

Output Fields

Field Description

disk The name of a disk or partition. Example: or sca sca/sca1

all The string , meaning all unmounted disks attached to the node.all

disks A comma-separated list of disks which have non-replicated volumes.<eg> "sca" or "sca/sca1,scb"</eg>



Examples

Remove a disk:

CLImaprcli disk remove -disks ["sda1"]

RESThttps://r1n1.sj.us:8443/rest/disk/remove?disks=["sda1"]



entity

The entity commands let you work with (users and groups):entities

entity info shows information about a specified user or groupentity list lists users and groups in the clusterentity modify edits information about a specified user or group



entity info

Displays information about an entity.

Syntax

CLImaprcli entity info [ -cluster <cluster> ] -name <entity name> [ -output terse|verbose ] -type <type>

RESThttp[s]://<host>:<port>/rest/entity/info?<parameters>

Parameters



name The entity name.

output Whether to display terse or verbose output.

type The entity type

Output

DiskUsage EntityQuota EntityType EntityName VolumeCount EntityAdvisoryquota EntityId 864415 0 0 root 208 0 0

Output Fields

Field Description

DiskUsage Disk space used by the user or group

EntityQuota The user or group quota

EntityType The entity type

EntityName The entity name

VolumeCount The number of volumes associated with the user or group

EntityAdvisoryquota The user or group advisory quota

EntityId The ID of the user or group



Examples

Display information for the user 'root':

CLImaprcli entity info -type 0 -name root

RESThttps://r1n1.sj.us:8443/rest/entity/info?type=0&name=root



entity list

Syntax

CLImaprcli entity list [ -alarmedentities true|false ] [ -cluster <cluster> ] [ -columns <columns> ] [ -filter <filter> ] [ -limit <rows> ] [ -output terse|verbose ] [ -start <start> ]

RESThttp[s]://<host>:<port>/rest/entity/list?<parameters>

Parameters


alarmedentities Specifies whether to list only entities that have exceeded a quota or advisory quota.


columns A comma-separated list of fields to return in the query. See the table below.Fields

filter A filter specifying entities to display. See for more information.Filters


output Specifies whether output should be or .terse verbose


Output

Information about the users and groups.

Fields

Field Description

EntityType Entity type

0 = User1 = Group

EntityName User or Group name

EntityId User or Group id

EntityQuota Quota, in MB. = no quota.0

EntityAdvisoryquota Advisory quota, in MB. = no advisory quota.0

VolumeCount The number of volumes this entity owns.



DiskUsage Disk space used for all entity's volumes, in MB.

Sample Output

DiskUsage EntityQuota EntityType EntityName VolumeCount EntityAdvisoryquota EntityId 5859220 0 0 root 209 0 0

Examples

List all entities:

CLImaprcli entity list

RESThttps://r1n1.sj.us:8443/rest/entity/list



entity modify

Modifies a user or group quota or email address. Permissions required: or fc a

Syntax

CLImaprcli entity modify [ -advisoryquota <advisory quota> [ -cluster <cluster> ] [ -email <email>] [ -entities <entities> ] -name <entityname> [ -quota <quota> ] -type <type>

RESThttp[s]://<host>:<port>/rest/entity/modify?<parameters>

Parameters


advisoryquota The advisory quota.


email Email address.

entities A comma-separated list of entities, in the format . Example: <type>:<name> 0:<user1>,0:<user2>,1:<group1>,1:<group2>...

name The entity name.

quota The quota for the entity.

type The entity type:

0=user1-group

Examples

Modify the email address for the user 'root':

CLImaprcli entity modify -name root -type 0 -email [email protected]

RESThttps://r1n1.sj.us:8443/rest/entity/modify?name=root&type=0&[email protected]



license

The license commands let you work with MapR licenses:

license add - adds a licenselicense addcrl - adds a certificate revocation list (CRL)license apps - displays the features included in the current licenselicense list - lists licenses on the clusterlicense listcrl - lists CRLslicense remove - removes a licenselicense showid - displays the cluster ID



license add

Adds a license. Permissions required: or fc a

The license can be specified either by passing the license string itself to , or by specifying a file containing the license string.license add

Syntax

CLImaprcli license add [ -cluster <cluster> ] [ -is_file true|false ] -license <license>

RESThttp[s]://<host>:<port>/rest/license/add?<parameters>

Parameters



is_file Specifies whether the specifies a file. If , the parameter contains a long license string.license false license

license The license to add to the cluster. If is true, specifies the filename of a license file. Otherwise, conta-is_file license licenseins the license string itself.

Examples

Adding a License from a File

Assuming a file containing a license string, the following command adds the license to the cluster./tmp/license.txt

CLImaprcli license add -is_file true -license /tmp/license.txt



license addcrl

Adds a certificate revocation list (CRL). Permissions required: or fc a

Syntax

CLImaprcli license addcrl [ -cluster <cluster> ] -crl <crl> [ -is_file true|false ]

RESThttp[s]://<host>:<port>/rest/license/addcrl?<parameters>

Parameters



crl The CRL to add to the cluster. If file is set, crl specifies the filename of a CRL file. Otherwise, crl contains the CRL string itself.

is_file Specifies whether the license is contained in a file.



license apps

Displays the features authorized for the current license. Permissions required: login

Syntax

CLImaprcli license apps[ -cluster <cluster> ]

RESThttp[s]://<host>:<port>/rest/license/apps?<parameters>

Parameters





license list

Lists licenses on the cluster. Permissions required: login

Syntax

CLImaprcli license list[ -cluster <cluster> ]

RESThttp[s]://<host>:<port>/rest/license/list?<parameters>

Parameters





license listcrl

Lists certificate revocation lists (CRLs) on the cluster. Permissions required: login

Syntax

CLImaprcli license listcrl[ -cluster <cluster> ]

RESThttp[s]://<host>:<port>/rest/license/listcrl?<parameters>

Parameters





license remove

Adds a license. Permissions required: or fc a

Syntax

CLImaprcli license remove[ -cluster <cluster> ]-license_id <license>

RESThttp[s]://<host>:<port>/rest/license/remove?<parameters>

Parameters



license_id The license to remove.



license showid

Displays the cluster ID for use when creating a new license. Permissions required: login

Syntax

CLImaprcli license showid[ -cluster <cluster> ]

RESThttp[s]://<host>:<port>/rest/license/showid?<parameters>

Parameters





nagios

The command generates a topology script for Nagiosnagios generate



nagios generate

Generates a Nagios Object Definition file that describes the cluster nodes and the services running on each.

Syntax

CLImaprcli nagios generate [ -cluster <cluster> ]

RESThttp[s]://<host>:<port>/rest/nagios/generate?<parameters>

Parameters



Output

Sample Output



############# Commands #############

define command command_name check_fileserver_proc command_line $USER1$/check_tcp -p 5660

define command command_name check_cldb_proc command_line $USER1$/check_tcp -p 7222

define command command_name check_jobtracker_proc command_line $USER1$/check_tcp -p 50030

define command command_name check_tasktracker_proc command_line $USER1$/check_tcp -p 50060

define command command_name check_nfs_proc command_line $USER1$/check_tcp -p 2049

define command command_name check_hbmaster_proc command_line $USER1$/check_tcp -p 60000

define command command_name check_hbregionserver_proc command_line $USER1$/check_tcp -p 60020

define command command_name check_webserver_proc command_line $USER1$/check_tcp -p 8443

################# HOST: host1 ###############

define host use linux-server host_name host1 address 192.168.1.1 check_command check-host-alive

################# HOST: host2 ###############

define host use linux-server host_name host2 address 192.168.1.2 check_command check-host-alive

Examples

Generate a nagios configuration, specifying cluster name and ZooKeeper nodes:



CLImaprcli nagios generate -cluster cluster-1

RESThttps://host1:8443/rest/nagios/generate?cluster=cluster-1

Generate a nagios configuration and save to the file :nagios.conf

CLImaprcli nagios generate >nagios.conf



nfsmgmt

The command refreshes the NFS exports on the specified host and/or port.nfsmgmt refreshexports



nfsmgmt refreshexports

Refreshes the NFS exports. Permissions required: or fc a

Syntax

CLImaprcli nfsmgmt refreshexports [ -nfshost <host> ] [ -nfsport <port> ]

RESThttp[s]://<host>:<port>/rest/license/nfsmgmt/refreshexports?<parameters>

Parameters


nfshost The host on which to refresh NFS exports.

nfsport The port to use.



node

The node commands let you work with nodes in the cluster:

node heatmapnode listnode pathnode removenode servicesnode topo



node heatmap

Displays a heatmap for the specified nodes.

Syntax

CLImaprcli node heatmap [ -cluster <cluster> ] [ -filter <filter> ] [ -view <view> ]

RESThttp[s]://<host>:<port>/rest/node/heatmap?<parameters>

Parameters




view Name of the heatmap view to show:

status = Node status:0 = Healthy1 = Needs attention2 = Degraded3 = Maintenance4 = Critical

cpu = CPU utilization, as a percent from 0-100.memory = Memory utilization, as a percent from 0-100. diskspace = MapR-FS disk space utilization, as a percent from 0-100. DISK_FAILURE = Status of the DISK_FAILURE alarm. if clear, if raised.0 1SERVICE_NOT_RUNNING = Status of the SERVICE_NOT_RUNNING alarm. if clear, if raised.0 1CONFIG_NOT_SYNCED = Status of the CONFIG_NOT_SYNCED alarm. if clear, if raised.0 1

Output

Description of the output.



status: ,"OK" data: [ : "rackTopology" : heatmapValue,"nodeName" : heatmapValue,"nodeName" : heatmapValue,"nodeName" ... , : "rackTopology" : heatmapValue,"nodeName" : heatmapValue,"nodeName" : heatmapValue,"nodeName" ... , ... ]

Output Fields

Field Description

rackTopology The topology for a particular rack.

nodeName The name of the node in question.

heatmapValue The value of the metric specified in the view parameterfor this node, as an integer.

Examples

Display a heatmap for the default rack:

CLImaprcli node heatmap

RESThttps://r1n1.sj.us:8443/rest/node/heatmap

Display memory usage for the default rack:

CLImaprcli node heatmap -view memory

RESThttps://r1n1.sj.us:8443/rest/node/heatmap?view=memory



node list

Lists nodes in the cluster.

Syntax

CLImaprcli node list [ -alarmednodes 1 ] [ -cluster <cluster ] [ -columns <columns>] [ -filter <filter> ] [ -limit <limit> ] [ -nfsnodes 1 ] [ -output terse|verbose ] [ -start <offset> ] [ -zkconnect <ZooKeeper Connect String> ]

RESThttp[s]://<host>:<port>/rest/node/list?<parameters>

Parameters


alarmednodes If set to 1, displays only nodes with raised alarms. Cannot be used if nfsnodes is set.



filter A filter specifying nodes on which to start or stop services. See for more information.Filters


nfsnodes If set to 1, displays only nodes running NFS. Cannot be used if alarmednodes is set.

output Specifies whether the output should be terse or verbose.



Output

Information about the nodes.See the table above.Fields

Sample Output



bytesSent dreads davail TimeSkewAlarm servicesHoststatsDownAlarm ServiceHBMasterDownNotRunningAlarm ServiceNFSDownNotRunningAlarm ttmapUsed DiskFailureAlarm mused id mtotal cpus utilization rpcout ttReduceSlots ServiceFileserverDownNotRunningAlarm ServiceCLDBDownNotRunningAlarm dtotal jt-heartbeat ttReduceUsed dwriteK ServiceTTDownNotRunningAlarm ServiceJTDownNotRunningAlarm ttmapSlots dused uptime hostname health disks faileddisks fs-heartbeat rpcin ip dreadK dwrites ServiceWebserverDownNotRunningAlarm rpcs LogLevelAlarm ServiceHBRegionDownNotRunningAlarm bytesReceived service topo(rack) MapRfs disks ServiceMiscDownNotRunningAlarm VersionMismatchAlarm8300 0 269 0 0 0 0 75 0 4058 6394230189818826805 7749 4 3 141 50 0 0 286 2 10 32 0 0 100 16 Thu Jan 15 16:58:57 PST 1970 whatsup 0 1 0 0 51 10.250.1.48 0 2 0 0 0 0 8236 /third/rack/whatsup 1 0 0

Fields

Field Description

bytesReceived Bytes received by the node since the last CLDB heartbeat.

bytesSent Bytes sent by the node since the last CLDB heartbeat.

corePresentAlarm Cores Present Alarm (NODE_ALARM_CORE_PRESENT):

0 = Clear1 = Raised

cpus The total number of CPUs on the node.

davail Disk space available on the node.

DiskFailureAlarm Failed Disks alarm (DISK_FAILURE):

0 = Clear1 = Raised

disks Total number of disks on the node.

dreadK Disk Kbytes read since the last heartbeat.

dreads Disk read operations since the last heartbeat.

dtotal Total disk space on the node.

dused Disk space used on the node.

dwriteK Disk Kbytes written since the last heartbeat.

dwrites Disk write ops since the last heartbeat.

faileddisks Number of failed MapR-FS disks on the node.

failedDisksAlarm Disk Failure Alarm (NODE_ALARM_DISK_FAILURE)

0 = Clear1 = Raised

fs-heartbeat Time since the last heartbeat to the CLDB, in seconds.



health Overall node health, calculated from various alarm states:

0 = Healthy1 = Needs attention2 = Degraded3 = Maintenance4 = Critical

hostname The host name.

id The node ID.

ip A list of IP addresses associated with the node.

jt-heartbeat Time since the last heartbeat to the JobTracker, in seconds.

logLevelAlarm Excessive Logging Alarm (NODE_ALARM_DEBUG_LOGGING):

0 = Clear1 = Raised

MapRfs disks

mtotal Total memory, in GB.

mused Memory used, in GB.

optMapRFullAlarm Installation Directory Full Alarm (ODE_ALARM_OPT_MAPR_FULL):

0 = Clear1 = Raised

rootPartitionFullAlarm Root Partition Full Alarm (NODE_ALARM_ROOT_PARTITION_FULL):

0 = Clear1 = Raised

rpcin RPC bytes received since the last heartbeat.

rpcout RPC bytes sent since the last heartbeat.

rpcs Number of RPCs since the last heartbeat.

service A comma-separated list of services running on the node:

cldb - CLDBfileserver - MapR-FSjobtracker - JobTrackertasktracker - TaskTrackerhbmaster - HBase Masterhbregionserver - HBase RegionServernfs - NFS GatewayExample: "cldb,fileserver,nfs"

ServiceCLDBDownAlarm CLDB Service Down Alarm (NODE_ALARM_SERVICE_CLDB_DOWN)

0 = Clear1 = Raised

ServiceFileserverDownNotRunningAlarm Fileserver Service Down Alarm (NODE_ALARM_SERVICE_FILESERVER_DOWN)

0 = Clear1 = Raised

serviceHBMasterDownAlarm HBase Master Service Down Alarm (NODE_ALARM_SERVICE_HBMASTER_DOWN)

0 = Clear1 = Raised



serviceHBRegionDownAlarm HBase Regionserver Service Down Alarm" (NODE_ALARM_SERVICE_HBREGION_DOWN)

0 = Clear1 = Raised

servicesHoststatsDownAlarm Hoststats Service Down Alarm (NODE_ALARM_SERVICE_HOSTSTATS_DOWN)

0 = Clear1 = Raised

serviceJTDownAlarm Jobtracker Service Down Alarm (NODE_ALARM_SERVICE_JT_DOWN)

0 = Clear1 = Raised

ServiceMiscDownNotRunningAlarm0 = Clear1 = Raised

serviceNFSDownAlarm NFS Service Down Alarm (NODE_ALARM_SERVICE_NFS_DOWN):

0 = Clear1 = Raised

serviceTTDownAlarm Tasktracker Service Down Alarm (NODE_ALARM_SERVICE_TT_DOWN):

0 = Clear1 = Raised

servicesWebserverDownAlarm Webserver Service Down Alarm (NODE_ALARM_SERVICE_WEBSERVER_DOWN)

0 = Clear1 = Raised

timeSkewAlarm Time Skew alarm (NODE_ALARM_TIME_SKEW):

0 = Clear1 = Raised

topo(rack) The rack path.

ttmapSlots TaskTracker map slots.

ttmapUsed TaskTracker map slots used.

ttReduceSlots TaskTracker reduce slots.

ttReduceUsed TaskTracker reduce slots used.

uptime The number of seconds the machine has been up since the last restart.

utilization CPU use percentage since the last heartbeat.

versionMismatchAlarm Software Version Mismatch Alarm (NODE_ALARM_VERSION_MISMATCH):

0 = Clear1 = Raised

Examples

List all nodes:



CLImaprcli node list

RESThttps://r1n1.sj.us:8443/rest/node/list



node move

Moves one or more nodes to a different topology. Permissions required: or fc a

Syntax

CLImaprcli node move [ -cluster <cluster> ] -serverids <server IDs> -topology <topology>

RESThttp[s]://<host>:<port>/rest/node/move?<parameters>

Parameters



serverids The server IDs of the nodes to move.

topology The new topology.



node path

Changes the path of the specified node or nodes. Permissions required: or fc a

Syntax

CLImaprcli node path [ -cluster <cluster> ] [ -filter <filter> ] [ -nodes <node names> ] -path <path> [ -which switch|rack|both ] [ -zkconnect <ZooKeeper Connect String> ]

RESThttp[s]://<host>:<port>/rest/node/path?<parameters>

Parameters




nodes A list of node names, separated by spaces.

path The path to change.

which Which path to change: switch, rack or both. Default: rack

zkconnect . ZooKeeper Connect String



node remove

The command removes one or more server nodes from the system. Permissions required: or node remove fc a

After issuing the command, wait several minutes to ensure that the node has been properly and completely removed.node remove

Syntax

CLImaprcli node remove [ -filter <filter> ] [ -force true|false ] [ -nodes <node names> ] [ -zkconnect <ZooKeeper Connect String> ]

RESThttp[s]://<host>:<port>/rest/node/remove?<parameters>

Parameters



force Forces the service stop operations. Default: false


zkconnect . Example: 'host:port,host:port,host:port,...'. default: localhost:5181ZooKeeper Connect String



node services

Starts, stops, restarts, suspends, or resumes services on one or more server nodes. Permissions required: , or ss fc a

The same set of services applies to all specified nodes; to manipulate different groups of services differently, send multiple requests.

Note: the suspend and resume actions have not yet been implemented.

Syntax

CLImaprcli node services [ -action restart|resume|start|stop|suspend ] [ -cldb restart|resume|start|stop|suspend ] [ -cluster <cluster> ] [ -fileserver restart|resume|start|stop|suspend ] [ -filter <filter> ] [ -hbmaster restart|resume|start|stop|suspend ] [ -hbregionserver restart|resume|start|stop|suspend ] [ -jobtracker restart|resume|start|stop|suspend ] [ -name <service> ] [ -nfs restart|resume|start|stop|suspend ] [ -nodes <node names> ] [ -tasktracker restart|resume|start|stop|suspend ] [ -zkconnect <ZooKeeper Connect String> ]

RESThttp[s]://<host>:<port>/rest/node/services?<parameters>

Parameters

When used together, the and parameters specify an action to perform on a service. To start the JobTracker, for example, you canaction nameeither specify for the and for the , or simply specify on the .start action jobtracker name start jobtracker


action An action to perform on a service specified in the parameter: restart, resume, start, stop, or suspendname

cldb Starts or stops the cldb service. Values: restart, resume, start, stop, or suspend


fileserver Starts or stops the fileserver service. Values: restart, resume, start, stop, or suspend


hbmaster Starts or stops the hbmaster service. Values: restart, resume, start, stop, or suspend

hbregionserver Starts or stops the hbregionserver service. Values: restart, resume, start, stop, or suspend

jobtracker Starts or stops the jobtracker service. Values: restart, resume, start, stop, or suspend

name A service on which to perform an action specified by the parameter.action

nfs Starts or stops the nfs service. Values: restart, resume, start, stop, or suspend


tasktracker Starts or stops the tasktracker service. Values: restart, resume, start, stop, or suspend




node topo

Lists cluster topology information.

Lists internal nodes only (switches/racks/etc) and not leaf nodes (server nodes).

Syntax

CLImaprcli node topo [ -cluster <cluster> ] [ -path <path> ]

RESThttp[s]://<host>:<port>/rest/node/topo?<parameters>

Parameters



path The path on which to list node topology.

Output

Node topology information.

Sample output

status: ,"OK" total:recordCount, data: [ path:'path', status:[errorChildCount,OKChildCount,configChildCount], , ...additional structures above each topology node...for ]

Output Fields

Field Description

path The physical topology path to the node.

errorChildCount The number of descendants of the node which have overall status 0.

OKChildCount The number of descendants of the node which have overall status 1.

configChildCount The number of descendants of the node which have overall status 2.



schedule

The schedule commands let you work with schedules:

schedule create creates a scheduleschedule list lists schedulesschedule modify modifies the name or rules of a schedule by IDschedule remove removes a schedule by ID

A schedule is a JSON object that specifies a single or recurring time for volume snapshot creation or mirror syncing. For a schedule to be useful,it must be associated with at least one volume. See and .volume create volume modify

Schedule Fields

The schedule object contains the following fields:

Field Value

id The ID of the schedule.

name The name of the schedule.

inuse Indicates whether the schedule is associated with an action.

rules An array of JSON objects specifying how often the scheduled action occurs. See below.Rule Fields

Rule Fields

The following table shows the fields to use when creating a rules object.

Field Values

frequency How often to perform the action:

once - Onceyearly - Yearlymonthly - Monthlyweekly - Weeklydaily - Dailyhourly - Hourlysemihourly - Every 30 minutesquarterhourly - Every 15 minutesfiveminutes - Every 5 minutesminute - Every minute

retain How long to retain the data resulting from the action. For example, if the schedule creates a snapshot, the retain field sets thesnapshot's expiration. The retain field consists of an integer and one of the following units of time:

mi - minutesh - hoursd - daysw - weeksm - monthsy - years

time The time of day to perform the action, in 24-hour format: HH

date The date on which to perform the action:

For single occurrences, specify month, day and year: MM/DD/YYYYFor yearly occurrences, specify the month and day: MM/DDFor monthly occurrences occurrences, specify the day: DD Daily and hourly occurrences do not require the date field.



Example

The following example JSON shows a schedule called "snapshot," with three rules.

:8,"id" : ,"name" "snapshot" :0,"inuse" :["rules" : ,"frequency" "monthly" : ,"date" "8" :14,"time" :"retain" "1m" , : ,"frequency" "weekly" : ,"date" "sat" :14,"time" :"retain" "2w" , : ,"frequency" "hourly" :"retain" "1d" ]



schedule create

Creates a schedule. Permissions required: or fc a

A schedule can be associated with a volume to automate mirror syncing and snapshot creation. See and .volume create volume modify

Syntax

CLImaprcli schedule create [ -cluster <cluster> ] -schedule <JSON>

RESThttp[s]://<host>:<port>/rest/schedule/create?<parameters>

Parameters



schedule A JSON object describing the schedule. See for more information.Schedule Objects

Examples

Scheduling a Single Occurrence

CLImaprcli schedule create -schedule '"name":"Schedule-1","rules":["frequency":"once","retain":"1w","time":13,"date":"12/5/2010"]'

RESThttps://r1n1.sj.us:8443/rest/schedule/create?schedule="name":"Schedule-1","rules":["frequency":"once","retain":"1w","time":13,"date":"12/5/2010"]

A Schedule with Several Rules

CLImaprcli schedule create -schedule '"name":"Schedule-1","rules":["frequency":"weekly","date":"sun","time":7,"retain":"2w","frequency":"daily","time":14,"retain":"1w","frequency":"hourly","retain":"1w","frequency":"yearly","date":"11/5","time":14,"retain":"1w"]'

RESThttps://r1n1.sj.us:8443/rest/schedule/create?schedule="name":"Schedule-1","rules":["frequency":"weekly","date":"sun","time":7,"retain":"2w","frequency":"daily","time":14,"retain":"1w","frequency":"hourly","retain":"1w","frequency":"yearly","date":"11/5","time":14,"retain":"1w"]



schedule list

Lists the schedules on the cluster.

Syntax

CLImaprcli schedule list [ -cluster <cluster> ] [ -output terse|verbose ]

RESThttp[s]://<host>:<port>/rest/schedule/list?<parameters>

Parameters



output Specifies whether the output should be terse or verbose.

Output

A list of all schedules on the cluster. See for more information.Schedule Objects

Examples

List schedules:

CLImaprcli schedule list

RESThttps://r1n1.sj.us:8443/rest/schedule/list



1. 2. 3.

schedule modify

Modifies an existing schedule, specified by ID. Permissions required: or fc a

To find a schedule's ID:

Use the command to list the schedules.schedule listSelect the schedule to modifyPass the selected schedule's ID in the -id parameter to the command.schedule modify

Syntax

CLImaprcli schedule modify [ -cluster <cluster> ] -id <schedule ID> [ -name <schedule name ] [ -rules <JSON>]

RESThttp[s]://<host>:<port>/rest/schedule/modify?<parameters>

Parameters



id The ID of the schedule to modify.

name The new name of the schedule.

rules A JSON object describing the rules for the schedule. If specified, replaces the entire existing rules object in the schedule. Forinformation about the fields to use in the JSON object, see .Rule Fields

Examples

Modify a schedule

CLImaprcli schedule modify -id 0 -name Newname -rules '["frequency":"weekly","date":"sun","time":7,"retain":"2w","frequency":"daily","time":14,"retain":"1w"]'

RESThttps://r1n1.sj.us:8443/rest/schedule/modify?id=0&name=Newname&rules=["frequency":"weekly","date":"sun","time":7,"retain":"2w","frequency":"daily","time":14,"retain":"1w"]



schedule remove

Removes a schedule.

A schedule can only be removed if it is not associated with any volumes. See .volume modify

Syntax

CLImaprcli schedule remove [ -cluster <cluster> ] -id <schedule ID>

RESThttp[s]://<host>:<port>/rest/schedule/remove?<parameters>

Parameters



id The ID of the schedule to remove.

Examples

Remove schedule with ID 0:

CLImaprcli schedule remove -id 0

RESThttps://r1n1.sj.us:8443/rest/schedule/remove?id=0



service list

Lists all services on the specified node, along with the state and log path for each service.

Syntax

CLImaprcli service list -node <node name>

RESThttp[s]://<host>:<port>/rest/service/list?<parameters>

Parameters


node The node on which to list the services

Output

Information about services on the specified node. For each service, the status is reported numerically:

0 - NOT_CONFIGURED: the package for the service is not installed and/or the service is not configured ( has not run)configure.sh2 - RUNNING: the service is installed, has been started by the warden, and is currently executing3 - STOPPED: the service is installed and has run, but the service is currently not executingconfigure.sh5 - STAND_BY: the service is installed and is in standby mode, waiting to take over in case of failure of another instance (mainly used forJobTracker warm standby)



setloglevel

The setloglevel commands set the log level on individual services:

setloglevel cldb - Sets the log level for the CLDB.setloglevel hbmaster - Sets the log level for the HB Master.setloglevel hbregionserver - Sets the log level for the HBase RegionServer.setloglevel jobtracker - Sets the log level for the JobTracker.setloglevel fileserver - Sets the log level for the FileServer.setloglevel nfs - Sets the log level for the NFS.setloglevel tasktracker - Sets the log level for the TaskTracker.



setloglevel cldb

Sets the log level on the CLDB service. Permissions required: or fc a

Syntax

CLImaprcli setloglevel cldb -classname <class> -loglevel DEBUG|ERROR|FATAL|INFO|TRACE|WARN -node <node> -port <port>

RESThttp[s]://<host>:<port>/rest/setloglevel/cldb?<parameters>

Parameters


classname The name of the class for which to set the log level.

loglevel The log level to set:

DEBUGERRORFATALINFOTRACEWARN

node The node on which to set the log level.

port The CLDB port



setloglevel fileserver

Sets the log level on the FileServer service. Permissions required: or fc a

Syntax

CLImaprcli setloglevel fileserver -classname <class> -loglevel DEBUG|ERROR|FATAL|INFO|TRACE|WARN -node <node> -port <port>

RESThttp[s]://<host>:<port>/rest/setloglevel/fileserver?<parameters>

Parameters






port The MapR-FS port



setloglevel hbmaster

Sets the log level on the HBase Master service. Permissions required: or fc a

Syntax

CLImaprcli setloglevel hbmaster -classname <class> -loglevel DEBUG|ERROR|FATAL|INFO|TRACE|WARN -node <node> -port <port>

RESThttp[s]://<host>:<port>/rest/setloglevel/hbmaster?<parameters>

Parameters






port The HBase Master webserver port



setloglevel hbregionserver

Sets the log level on the HBase RegionServer service. Permissions required: or fc a

Syntax

CLImaprcli setloglevel hbregionserver -classname <class> -loglevel DEBUG|ERROR|FATAL|INFO|TRACE|WARN -node <node> -port <port>

RESThttp[s]://<host>:<port>/rest/setloglevel/hbregionserver?<parameters>

Parameters






port The Hbase Region Server webserver port



setloglevel jobtracker

Sets the log level on the JobTracker service. Permissions required: or fc a

Syntax

CLImaprcli setloglevel jobtracker -classname <class> -loglevel DEBUG|ERROR|FATAL|INFO|TRACE|WARN -node <node> -port <port>

RESThttp[s]://<host>:<port>/rest/setloglevel/jobtracker?<parameters>

Parameters






port The JobTracker webserver port



setloglevel nfs

Sets the log level on the NFS service. Permissions required: or fc a

Syntax

CLImaprcli setloglevel nfs -classname <class> -loglevel DEBUG|ERROR|FATAL|INFO|TRACE|WARN -node <node> -port <port>

RESThttp[s]://<host>:<port>/rest/setloglevel/nfs?<parameters>

Parameters






port The NFS port



setloglevel tasktracker

Sets the log level on the TaskTracker service. Permissions required: or fc a

Syntax

CLImaprcli setloglevel tasktracker -classname <class> -loglevel DEBUG|ERROR|FATAL|INFO|TRACE|WARN -node <node> -port <port>

RESThttp[s]://<host>:<port>/rest/setloglevel/tasktracker?<parameters>

Parameters






port The TaskTracker port



trace

The trace commands let you view and modify the trace buffer, and the trace levels for the system modules. The valid trace levels are:

DEBUGINFOERRORWARNFATAL

The following pages provide information about the trace commands:

trace dumptrace infotrace printtrace resettrace resizetrace setleveltrace setmode



trace dump

Dumps the contents of the trace buffer into the MapR-FS log.

Syntax

CLImaprcli trace dump [ -host <host> ] [ -port <port> ]

REST None.

Parameters


host The IP address of the node from which to dump the trace buffer. Default: localhost

port The port to use when dumping the trace buffer. Default: 5660

Examples

Dump the trace buffer to the MapR-FS log:

CLImaprcli trace dump



trace info

Displays the trace level of each module.

Syntax

CLImaprcli trace info [ -host <host> ] [ -port <port> ]

REST None.

Parameters


host The IP address of the node on which to display the trace level of each module. Default: localhost

port The port to use. Default: 5660

Output

A list of all modules and their trace levels.

Sample Output



RPC Client Initialize**Trace is in DEFAULT mode.**Allowed Trace Levels are:FATALERRORWARNINFODEBUG**Trace buffer size: 2097152**Modules and levels:Global : INFORPC : ERRORMessageQueue : ERRORCacheMgr : INFOIOMgr : INFOTransaction : ERRORLog : INFOCleaner : ERRORAllocator : ERRORBTreeMgr : ERRORBTree : ERRORBTreeDelete : ERRORBTreeOwnership : INFOMapServerFile : ERRORMapServerDir : INFOContainer : INFOSnapshot : INFOUtil : ERRORReplication : INFOPunchHole : ERRORKvStore : ERRORTruncate : ERROROrphanage : INFOFileServer : INFODefer : ERRORServerCommand : INFONFSD : INFOCidcache : ERRORClient : ERRORFidcache : ERRORFidmap : ERRORInode : ERRORJniCommon : ERRORShmem : ERRORTable : ERRORFctest : ERRORDONE

Examples

Display trace info:

CLImaprcli trace info



trace print

Manually dumps the trace buffer to stdout.

Syntax

CLImaprcli trace print [ -host <host> ] [ -port <port> ] -size <size>

REST None.

Parameters


host The IP address of the node from which to dump the trace buffer to stdout. Default: localhost


size The number of kilobytes of the trace buffer to print. Maximum: 64

Output

The most recent bytes of the trace buffer.<size>

-----------------------------------------------------2010-10-04 13:59:31,0000 Program: mfs on Host: fakehost IP: 0.0.0.0, Port: 0, PID: 0

-----------------------------------------------------DONE

Examples

Display the trace buffer:

CLImaprcli trace print



trace reset

Resets the in-memory trace buffer.

Syntax

CLImaprcli trace reset [ -host <host> ] [ -port <port> ]

REST None.

Parameters


host The IP address of the node on which to reset the trace parameters. Default: localhost


Examples

Reset trace parameters:

CLImaprcli trace reset



trace resize

Resizes the trace buffer.

Syntax

CLImaprcli trace resize [ -host <host> ] [ -port <port> ] -size <size>

REST None.

Parameters


host The IP address of the node on which to resize the trace buffer. Default: localhost


size The size of the trace buffer, in kilobytes. Default: Minimum: 2097152 1

Examples

Resize the trace buffer to 1000

CLImaprcli trace resize -size1000



trace setlevel

Sets the trace level on one or more modules.

Syntax

CLImaprcli trace setlevel [ -host <host> ] -level <trace level> -module <module name> [ -port <port> ]

REST None.

Parameters


host The node on which to set the trace level. Default: localhost

module The module on which to set the trace level. If set to , sets the trace level on all modules.all

level The new trace level. If set to , sets the trace level to its default. default


Examples

Set the trace level of the log module to INFO:

CLImaprcli trace setlevel -module log -levelinfo

Set the trace levels of all modules to their defaults:

CLImaprcli trace setlevel -module all -leveldefault



trace setmode

Sets the trace mode. There are two modes:

DefaultContinuous

In default mode, all trace messages are saved in a memory buffer. If there is an error, the buffer is dumped to stdout. In continuous mode, everyallowed trace message is dumped to stdout in real time.

Syntax

CLImaprcli trace setmode [ -host <host> ] -mode default|continuous [ -port <port> ]

REST None.

Parameters


host The IP address of the host on which to set the trace mode

mode The trace mode.

port The port to use.

Examples

Set the trace mode to continuous:

CLImaprcli trace setmode -modecontinuous



urls

The urls command displays the status page URL for the specified service.

Syntax

CLImaprcli urls [ -cluster <cluster> ] -name <service name> [ -zkconnect <zookeeper connect string>]

RESThttp[s]://<host>:<port>/rest/urls/<name>

Parameters


cluster The name of the cluster on which to save the configuration.

name The name of the service for which to get the status page:

cldbjobtrackertasktracker


Examples

Display the URL of the status page for the CLDB service:

CLImaprcli urls -name cldb

RESThttps://r1n1.sj.us:8443/rest/maprcli/urls/cldb



virtualip

The virtualip commands let you work with virtual IP addresses for NFS nodes:

virtualip add adds a ranges of virtual IP addressesvirtualip list lists virtual IP addressesvirtualip remove removes a range of virtual IP addresses

Virtual IP Fields

Field Description

macaddress The MAC address of the virtual IP.

netmask The netmask of the virtual IP.

virtualipend The virtual IP range end.



virtualip add

Adds a virtual IP address. Permissions required: or fc a

Syntax

CLImaprcli virtualip add [ -cluster <cluster> ] [ -gateway <gateway> ] [ -macs <MAC address> ] -netmask <netmask> -virtualip <virtualip> [ -virtualipend <virtual IP range end> ]

RESThttp[s]://<host>:<port>/rest/virtualip/add?<parameters>

Parameters



gateway The NFS gateway IP or address

macs The MAC address of the virtual IP.

netmask The netmask of the virtual IP.

virtualip The virtual IP, or the start of the virtual IP range.

virtualipend The virtual IP range end.



virtualip edit

Edits a virtual IP (VIP) range. Permissions required: or fc a

Syntax

CLImaprcli virtualip edit [ -cluster <cluster> ] [ -macs <mac address(es)> ] -netmask <netmask> -virtualip <virtualip> [ -virtualipend <range end> ]

RESThttp[s]://<host>:<port>/rest/virtualip/edit?<parameters>

Parameters



macs The MAC address or addresses to associate with the VIP or VIP range.

netmask The netmask for the VIP or VIP range.

virtualip The start of the VIP range, or the VIP if only one VIP is used.

virtualipend The end of the VIP range if more than one VIP is used.



virtualip list

Lists the virtual IP addresses in the cluster.

Syntax

CLImaprcli virtualip list [ -cluster <cluster> ] [ -columns <columns> ] [ -filter <filter> ] [ -limit <limit> ] [ -nfsmacs <NFS macs> ] [ -output <output> ] [ -range <range> ] [ -start <start> ]

RESThttp[s]://<host>:<port>/rest/virtualip/list?<parameters>

Parameters



columns The columns to display.

filter A filter specifying VIPs to list. See for more information.Filters

limit The number of records to return.

nfsmacs The MAC addresses of servers running NFS.

output Whether the output should be or .terse verbose

range The VIP range.

start The index of the first record to return.



virtualip remove

Removes a virtual IP (VIP) or a VIP range. Permissions required: or fc a

Syntax

CLImaprcli virtualip remove [ -cluster <cluster> ] -virtualip <virtual IP> [ -virtualipend <Virtual IP Range End> ]

RESThttp[s]://<host>:<port>/rest/virtualip/remove?<parameters>

Parameters



virtualip The virtual IP or the start of the VIP range to remove.

virtualipend The end of the VIP range to remove.



volume

The volume commands let you work with volumes, snapshots and mirrors:

volume create creates a volumevolume dump create creates a volume dumpvolume dump restore restores a volume from a volume dumpvolume info displays information about a volumevolume link create creates a symbolic linkvolume link remove removes a symbolic linkvolume list lists volumes in the clustervolume mirror push pushes a volume's changes to its local mirrorsvolume mirror start starts mirroring a volumevolume mirror stop stops mirroring a volumevolume modify modifies a volumevolume mount mounts a volumevolume move moves a volumevolume remove removes a volumevolume rename renames a volumevolume snapshot create creates a volume snapshotvolume snapshot list lists volume snapshotsvolume snapshot preserve prevents a volume snapshot from expiringvolume snapshot remove removes a volume snapshotvolume unmount unmounts a volume



volume create

Creates a volume. Permissions required: , , or cv fc a

Syntax

CLImaprcli volume create[ -advisoryquota <advisory quota> ][ -ae <accounting entity> ][ -aetype <accounting entity type> ][ -canBackup <list of users and groups> ][ -canDeleteAcl <list of users and groups> ][ -canDeleteVolume <list of users and groups> ][ -canEditconfig <list of users and groups> ][ -canMirror <list of users and groups> ][ -canMount <list of users and groups> ][ -canSnapshot <list of users and groups> ][ -canViewConfig <list of users and groups> ][ -cluster <cluster> ][ -groupperm <group:allowMask list> ][ -localvolumehost <localvolumehost> ][ -localvolumeport <localvolumeport> ][ -minreplication <minimum replication factor> ][ -mount 0|1 ]-name <volume name>[ -path <mount path> ][ -quota <quota> ][ -readonly <read-only status> ][ -replication <replication factor> ][ -rereplicationtimeoutsec <seconds> ][ -rootdirperms <root directory permissions> ][ -schedule <ID> ][ -source <source> ][ -topology <topology> ][ -type 0|1 ][ -user <user:allowMask list> ]

RESThttp[s]://<host>:<port>/rest/volume/create?<parameters>

Parameters


advisoryquota The advisory quota for the volume as plus , , , , , integer unit. Example: quota=500G; Units: B K M G T P

ae The accounting entity that owns the volume.

aetype The type of accounting entity:

0=user1=group

canBackup The list of users and groups who can back up the volume.

canDeleteAcl The list of users and groups who can delete the volume access control list (ACL)

canDeleteVolume The list of users and groups who can delete the volume.

canEditconfig The list of users and groups who can edit the volume properties.

canMirror The list of users and groups who can mirror the volume.



canMount The list of users and groups who can mount the volume.

canSnapshot The list of users and groups who can create a snapshot of the volume.

canViewConfig The list of users and groups who can view the volume properties.

cluster The cluster on which to create the volume.

groupperm List of permissions in the format group:allowMask

localvolumehost The local volume host.

localvolumeport The local volume port. Default: 5660

minreplication The minimum replication level. Default: 0

mount Specifies whether the volume is mounted at creation time.

name The name of the volume to create.

path The path at which to mount the volume.

quota The quota for the volume as plus , , , , , integer unit. Example: quota=500G; Units: B K M G T P

readonly Specifies whether the volume is read-only:

0 = read/write1 = read-only

replication The desired replication level. Default: 0

rereplicationtimeoutsec The re-replication timeout, in seconds.

rootdirperms Permissions on the volume root directory.

schedule The ID of a schedule. If a schedule ID is provided, then the volume will automatically create snapshots (normalvolume) or sync with its source volume (mirror volume) on the specified schedule.

source For mirror volumes, the source volume to mirror, in the format (Required when<source volume>@<cluster>creating a mirror volume).

topology The rack path to the volume.

type The type of volume to create:

0 - standard volume1 - mirror volume

user List of permissions in the format .user:allowMask

Examples

Create the volume "test-volume" mounted at "/test/test-volume":

CLImaprcli volume create -name test-volume -path /test/test-volume

RESThttps://r1n1.sj.us:8443/rest/volume/create?name=test-volume&path=/test/test-volume

Create Volume with a Quota and an Advisory Quota

This example creates a volume with the following parameters:

advisoryquota: 100Mname: volumename



path: /volumepathquota: 500Mreplication: 3schedule: 2topology: /East Coasttype: 0

CLImaprcli volume create -name volumename -path /volumepath -advisoryquota 100M -quota 500M -replication 3 -schedule 2 -topology "/East Coast" -type 0

RESThttps://r1n1.sj.us:8443/rest/volume/create?advisoryquota=100M&name=volumename&path=/volumepath&quota=500M&replication=3&schedule=2&topology=/East%20Coast&type=0

Create the mirror volume "test-volume.mirror" from source volume "test-volume" and mount at "/test/test-volume-mirror":

CLImaprcli volume create -name test-volume.mirror -source test-volume@src-cluster-name -path /test/test-volume-mirror

RESThttps://r1n1.sj.us:8443/rest/volume/create?name=test-volume.mirror&sourcetest-volume@src-cluster-name&path=/test/test-volume-mirror



1.

2.

volume dump create

The volume dump create command creates a volume containing data from a volume for distribution or restoration. Permissionsdump filerequired: , , or dump fc a

You can use volume dump create to create two types of files:

full dump files containing all data in a volumeincremental dump files that contain changes to a volume between two points in time

A full dump file is useful for restoring a volume from scratch. An incremental dump file contains the changes necessary to take an existing (orrestored) volume from one point in time to another. Along with the dump file, a full or incremental dump operation can produce a filestate(specified by the ?-e parameter) that contains a table of the version number of every container in the volume at the time the dump file wascreated. This represents the of the dump file, which is used as the of the next incremental dump. The main differenceend point start pointbetween creating a full dump and creating an incremental dump is whether the -s parameter is specified; if -s is not specified, the volume createcommand includes all volume data and creates a full dump file. If you create a full dump followed by a series of incremental dumps, the result is asequence of dump files and their accompanying state files:

dumpfile1 statefile1



...

To maintain an up-to-date dump of a volume:

Create a full dump file. Example:

maprcli volume dump create -name cli-created -dumpfile fulldump1 -e statefile1

Periodically, add an incremental dump file. Examples:

maprcli volume dump create -s statefile1 -e statefile2 -name cli-created -dumpfile incrdump1maprcli volume dump create -s statefile2 -e statefile3 -name cli-created -dumpfile incrdump2maprcli volume dump create -s statefile3 -e statefile4 -name cli-created -dumpfile incrdump3

...and so on.

You can restore the volume from scratch, using the command with the full dump file, followed by each dump file involume dump restoresequence.

Syntax

CLImaprcli volume dump create [ -cluster <cluster> ] -dumpfile <dump file> [-e <end state file> ] -name volumename [-o ] [-s <start state file>]anchor:cli-syntax-end

REST None.

Parameters





dumpfile The name of the dump file (ignored if -o is used).

e The name of the state file to create for the end point of the dump.

name A volume name.

o This option dumps the volume to stdout instead of to a file.

s The start point for an incremental dump.

Examples

Create a full dump:

CLImaprcli volume create -e statefile1 -dumpfile fulldump1 -name volume-n

Create an incremental dump:

CLImaprcli volume dump -s statefile1 -e statefile2 -name volume -dumpfile incrdump1



1.

2.

volume dump restore

The command restores or updates a volume from a dump file. Permissions required: , , or volume dump restore dump fc a

There are two ways to use :volume dump restore

With a full dump file, recreates a volume from scratch from volume data stored in the dump file.volume dump restoreWith an incremental dump file, updates a volume using incremental changes stored in the dump file. volume dump restore

The volume that results from a operation is a mirror volume whose source is the volume from which the dump wasvolume dump restorecreated. After the operation, this volume can perform mirroring from the source volume.

When you are updating a volume from an incremental dump file, you must specify an existing volume and an incremental dump file. To restorefrom a sequence of previous dump files would involve first restoring from the volume's full dump file, then applying each subsequent incrementaldump file.

A restored volume may contain mount points that represent volumes that were mounted under the original source volume from which the dumpwas created. In the restored volume, these mount points have no meaning and do not provide access to any volumes that were mounted underthe source volume. If the source volume still exists, then the mount points in the restored volume will work if the restored volume is associatedwith the source volume as a mirror.

To restore from a full dump plus a sequence of incremental dumps:

Restore from the full dump file, using the option to create a new mirror volume and the option to specify the name. Example:-n -name

maprcli volume dump restore -dumpfile fulldump1 -name restore1 -n

Restore from each incremental dump file in order, specifying the same volume name. Examples:

maprcli volume dump restore -dumpfile incrdump1 -name restore1maprcli volume dump restore -dumpfile incrdump2 -name restore1maprcli volume dump restore -dumpfile incrdump3 -name restore1

...and so on.

Syntax

CLImaprcli volume dump restore [ -cluster <cluster> ] -dumpfile dumpfilename [ -i ] [ -n ] -name <volume name>

REST None.

Parameters



dumpfile The name of the dumpfile (ignored if is used).-i

i This option reads the dump file from . stdin

n This option creates a new volume if it doesn't exist.

name A volume name, in the form volumename



Examples

Restore a volume from a full dump file:

CLImaprcli volume dump restore -name volume -dumpfilefulldump1

Apply an incremental dump file to a volume:

CLImaprcli volume dump restore -name volume -dumpfileincrdump1



volume fixmountpath

Corrects the mount path of a volume. Permissions required: or fc a

The CLDB maintains information about the mount path of every volume. If a directory in a volume's path is renamed (by a command,hadoop fsfor example) the information in the CLDB will be out of date. The command does a reverse path walk from the volumevolume fixmountpathand corrects the mount path information in the CLDB.

Syntax

CLImaprcli volume fixmountpath-name <name>

RESThttp[s]://<host>:<port>/rest/volume/fixmountpath?<parameters>

Parameters


name The volume name.

Examples

Fix the mount path of volume v1:

CLImaprcli volume fixmountpath -name v1

RESThttps://r1n1.sj.us:8443/rest/volume/fixmountpath?name=v1



volume info

Displays information about the specified volume.

Syntax

CLImaprcli volume info [ -cluster <cluster> ] [ -name <volume name> ] [ -output terse|verbose ] [ -path <path> ]

RESThttp[s]://<host>:<port>/rest/volume/info?<parameters>

Parameters

You must specify either name or path.



name The volume for which to retrieve information.


path The mount path of the volume for which to retrieve information.



volume link create

Creates a link to a volume. Permissions required: or fc a

Syntax

CLImaprcli volume link create -path <path> -type <type> -volume <volume>

RESThttp[s]://<host>:<port>/rest/volume/link/remove?<parameters>

Parameters


path The path parameter specifies the link path and other information, using the following syntax:

/link/[maprfs::][volume::]<volume type>::<volume name>

link - the link pathmaprfs - a keyword to indicate a special MapR-FS linkvolume - a keyword to indicate a link to a volumevolume type - writeable or mirrorvolume name - the name of the volume Example:

/abc/maprfs::mirror::abc

type The volume type: or .writeable mirror

volume The volume name.

Examples

Create a link to v1 at the path v1. mirror:

CLImaprcli volume link create -volume v1 -type mirror -path /v1.mirror

RESThttps://r1n1.sj.us:8443/rest/volume/link/create?path=/v1.mirror&type=mirror&volume=v1



volume link remove

Removes the specified symbolic link. Permissions required: or fc a

Syntax

CLImaprcli volume link remove -path <path>

RESThttp[s]://<host>:<port>/rest/volume/link/remove?<parameters>

Parameters


path The symbolic link to remove. The path parameter specifies the link path and other information about the symbolic link, using thefollowing syntax:

/link/[maprfs::][volume::]<volume type>::<volume name>

link - the symbolic link path* - a keyword to indicate a special MapR-FS linkmaprfsvolume - a keyword to indicate a link to a volumevolume type - or writeable mirrorvolume name - the name of the volumeExample:

/abc/maprfs::mirror::abc

Examples

Remove the link /abc:

CLImaprcli volume link remove -path /abc/maprfs::mirror::abc

RESThttps://r1n1.sj.us:8443/rest/volume/link/remove?path=/abc/maprfs::mirror::abc



volume list

Lists information about volumes specified by name, path, or filter.

Syntax

CLImaprcli volume list [ -alarmedvolumes 1 ] [ -cluster <cluster> ] [ -columns <columns> ] [ -filter <filter> ] [ -limit <limit> ] [ -nodes <nodes> ] [ -output terse | verbose ] [ -start <offset> ]

RESThttp[s]://<host>:<port>/rest/volume/list?<parameters>

Parameters


alarmedvolumes Specifies whether to list alarmed volumes only.



filter A filter specifying volumes to list. See for more information.Filters


nodes A list of nodes. If specified, only lists volumes on the specified nodes.volume list

output Specifies whether the output should be or .terse verbose


Field Description

volumeid Unique volume ID.

volumetype Volume type:

0 = normal volume1 = mirror volume

volumename Unique volume name.

mountdir Unique volume path (may be null if the volume is unmounted).



mounted Volume mount status:

0 = unmounted1 = mounted

rackpath Rack path.

creator Username of the volume creator.

aename Accountable entity name.

aetype Accountable entity type:

0=user1=group

uacl Users ACL (comma-separated list of user names.

gacl Group ACL (comma-separated list of group names).

quota Quota, in MB; = no quota.0

advisoryquota Advisory quota, in MB; = no advisory quota.0

used Disk space used, in MB, not including snapshots.

snapshotused Disk space used for all snapshots, in MB.

totalused Total space used for volume and snapshots, in MB.

readonly Read-only status:


numreplicas Desired replication factor (number of replications).

minreplicas Minimum replication factor (number of replications)

actualreplication The actual current replication factor by percentage of the volume, as a zero-based array of integers from 0 to100. For each position in the array, the value is the percentage of the volume that is replicated index numberof times. Example: means that 5% is not replicated, 10% is replicated once, 85% isarf=[5,10,85]replicated twice.

schedulename The name of the schedule associated with the volume.

scheduleid The ID of the schedule associated with the volume.

mirrorSrcVolumeId Source volume ID (mirror volumes only).

mirrorSrcVolume Source volume name (mirror volumes only).

mirrorSrcCluster The cluster where the source volume resides (mirror volumes only).

lastSuccessfulMirrorTime Last successful Mirror Time, milliseconds since 1970 (mirror volumes only).

mirrorstatus Mirror Status (mirror volumes only:

0 = success1 = running2 = error

mirror-percent-complete Percent completion of last/current mirror (mirror volumes only).

snapshotcount Snapshot count .

SnapshotFailureAlarm Status of SNAPSHOT_FAILURE alarm:

0 = Clear1 = Raised



AdvisoryQuotaExceededAlarm Status of VOLUME_ALARM_ADVISORY_QUOTA_EXCEEDED alarm:

0 = Clear1 = Raised

QuotaExceededAlarm Status of VOLUME_QUOTA_EXCEEDED alarm:

0 = Clear1 = Raised

MirrorFailureAlarm Status of MIRROR_FAILURE alarm:

0 = Clear1 = Raised

DataUnderReplicatedAlarm Status of DATA_UNDER_REPLICATED alarm:

0 = Clear1 = Raised

DataUnavailableAlarm Status of DATA_UNAVAILABLE alarm:

0 = Clear1 = Raised

Output

Information about the specified volumes.



mirrorstatus QuotaExceededAlarm numreplicas schedulename DataUnavailableAlarm volumeid rackpath volumename used volumetype SnapshotFailureAlarm mirrorDataSrcVolumeId advisoryquota aetype creator snapshotcount quota mountdir scheduleid snapshotused MirrorFailureAlarm AdvisoryQuotaExceededAlarm minreplicas mirrorDataSrcCluster actualreplication aename mirrorSrcVolumeId mirrorId mirrorSrcCluster lastSuccessfulMirrorTime nextMirrorId mirrorDataSrcVolume mirrorSrcVolume mounted logicalUsed readonly totalused DataUnderReplicatedAlarm mirror-percent-complete0 0 3 every15min 0 362 / ATS-Run-2011-01-31-160018 864299 0 0 0 0 0 root 3 0 /ATS-Run-2011-01-31-160018 4 1816201 0 0 1 ... root 0 0 0 0 1 2110620 0 2680500 0 00 0 3 0 12 / mapr.cluster.internal 0 0 0 0 0 0 root 0 0 / /mapr/cluster 0 var0 0 0 1 ... root 0 0 0 0 1 0 0 0 0 00 0 3 0 11 / mapr.cluster.root 1 0 0 0 0 0 root 0 0 / 0 0 0 0 1 ... root 0 0 0 0 1 1 0 1 0 00 0 10 0 21 / mapr.jobtracker.volume 1 0 0 0 0 0 root 0 0 / /mapr/cluster/mapred/jobTracker 0 var0 0 0 1 ... root 0 0 0 0 1 1 0 1 0 00 0 3 0 1 / mapr.kvstore.table 0 0 0 0 0 0 root 0 0 0 0 0 0 1 ... root 0 0 0 0 0 0 0 0 0 0

Output Fields

See the table above.Fields



volume mirror push

Pushes the changes in a volume to all of its mirror volumes in the same cluster, and waits for each mirroring operation to complete. Use thiscommand when you need to push recent changes.

Syntax

CLImaprcli volume mirror push [ -cluster <cluster> ] -name <volume name> [ -verbose true|false ]

REST None.

Parameters



name The volume to push.

verbose Specifies whether the command output should be verbose. Default: true

Output

Sample Output

Starting mirroring of volume mirror1Mirroring complete volume mirror1forSuccessfully completed mirror push to all local mirrors of volume volume1

Examples

Push changes from the volume "volume1" to its local mirror volumes:

CLImaprcli volume mirror push -name volume1 -clustermycluster



volume mirror start

Starts mirroring on the specified volume from its source volume. License required: M5 Permissions required: or fc a

When a mirror is started, the mirror volume is synchronized from a hidden internal snapshot so that the mirroring process is not affected by anyconcurrent changes to the source volume. The command does not wait for mirror completion, but returns immediately.volume mirror startThe changes to the mirror volume occur atomically at the end of the mirroring process; deltas transmitted from the source volume do not appearuntil mirroring is complete.

To provide rollback capability for the mirror volume, the mirroring process creates a snapshot of the mirror volume before starting the mirror, withthe following naming format: .<volume>.mirrorsnap.<date>.<time>

Normally, the mirroring operation transfers only deltas from the last successful mirror. Under certain conditions (mirroring a volume repaired by fs, for example), the source and mirror volumes can become out of sync. In such cases, it is impossible to transfer deltas, because the state isck

not the same for both volumes. Use the option to force the mirroring operation to transfer all data to bring the volumes back in sync.-full

Syntax

CLImaprcli volume mirror start [ -cluster <cluster> ] [ -full true|false ] -name <volume name>

RESThttp[s]://<host>:<port>/rest/volume/mirror/start?<parameters>

Parameters



full Specifies whether to perform a full copy of all data. If false, only the deltas are copied.

name The volume for which to start the mirror.

Output

Sample Output

messages Started mirror operation volumes 'test-mirror' for

Examples

Start mirroring the mirror volume "test-mirror":

CLImaprcli volume mirror start -nametest-mirror



volume mirror stop

Stops mirroring on the specified volume. License required: M5 Permissions required: or fc a

The command lets you stop mirroring (for example, during a network outage). You can use the volume mirror stop volume mirror startcommand to resume mirroring.

Syntax

CLImaprcli volume mirror stop [ -cluster <cluster> ] -name <volume name>

RESThttp[s]://<host>:<port>/rest/volume/mirror/stop?<parameters>

Parameters



name The volume for which to stop the mirror.

Output

Sample Output

messages Stopped mirror operation volumes 'test-mirror' for

Examples

Stop mirroring the mirror volume "test-mirror":

CLImaprcli volume mirror stop -nametest-mirror



volume modify

Modifies an existing volume. Permissions required: , , or m fc a

An error occurs if the name or path refers to a non-existent volume, or cannot be resolved.

Syntax

CLImaprcli volume modify [ -advisoryquota <advisory quota> ] [ -ae <accounting entity> ] [ -aetype <aetype> ] [ -canBackup <list of users and groups> ] [ -canDeleteAcl <list of users and groups> ] [ -canDeleteVolume <list of users and groups> ] [ -canEditconfig <list of users and groups> ] [ -canMirror <list of users and groups> ] [ -canMount <list of users and groups> ] [ -canSnapshot <list of users and groups> ] [ -canViewConfig <list of users and groups> ] [ -cluster <cluster> ] [ -groupperm <list of group:allowMask> ] [ -minreplication <minimum replication> ] -name <volume name> [ -quota <quota> ] [ -readonly <readonly> ] [ -replication <replication> ] [ -schedule <schedule ID> ] [ -source <source> ] [ -userperm <list of user:allowMask> ]

REST http[s]://<host>:<port>/rest/volume/modify?<parameters>

Parameters


advisoryquota The advisory quota for the volume as plus , , , , , integer unit. Example: quota=500G; Units: B K M G T P

ae The accounting entity that owns the volume.

aetype The type of accounting entity:

0=user1=group

canBackup The list of users and groups who can back up the volume.

canDeleteAcl The list of users and groups who can delete the volume access control list (ACL).

canDeleteVolume The list of users and groups who can delete the volume.

canEditconfig The list of users and groups who can edit the volume properties.

canMirror The list of users and groups who can mirror the volume.

canMount The list of users and groups who can mount the volume.

canSnapshot The list of users and groups who can create a snapshot of the volume.

canViewConfig The list of users and groups who can view the volume properties.




groupperm A list of permissions in the format group:allowMask

minreplication The minimum replication level. Default: 0

name The name of the volume to modify.

quota The quota for the volume as plus , , , , , integer unit. Example: quota=500G; Units: B K M G T P

readonly Specifies whether the volume is read-only.


replication The desired replication level. Default: 0

schedule A schedule ID. If a schedule ID is provided, then the volume will automatically create snapshots (normal volume) or syncwith its source volume (mirror volume) on the specified schedule.

source (Mirror volumes only) The source volume from which a mirror volume receives updates, specified in the format <volume>@.<cluster>

userperm List of permissions in the format .user:allowMask

Examples

Change the source volume of the mirror "test-mirror":

CLImaprcli volume modify -name test-mirror -source volume-2@my-cluster

RESThttps://r1n1.sj.us:8443/rest/volume/modify?name=test-mirror&source=volume-2@my-cluster



volume mount

Mounts one or more specified volumes. Permissions required: , , or mnt fc a

Syntax

CLImaprcli volume mount [ -cluster <cluster> ] -name <volume list> [ -path <path list> ]

RESThttp[s]://<host>:<port>/rest/volume/mount?<parameters>

Parameters



name The name of the volume to mount.

path The path at which to mount the volume.

Examples

Mount the volume "test-volume" at the path "/test":

CLImaprcli volume mount -name test-volume -path /test

RESThttps://r1n1.sj.us:8443/rest/volume/mount?name=test-volume&path=/test



volume move

Moves the specified volume or mirror to a different topology. Permissions required: , , or m fc a

Syntax

CLImaprcli volume move [ -cluster <cluster> ] -name <volume name> -topology <path>

RESThttp[s]://<host>:<port>/rest/volume/move?<parameters>

Parameters




topology The new rack path to the volume.



volume remove

Removes the specified volume or mirror. Permissions required: , , or d fc a

Syntax

CLImaprcli volume remove [ -cluster <cluster> ] [ -force ] -name <volume name>

RESThttp[s]://<host>:<port>/rest/volume/remove?<parameters>

Parameters



force Forces the removal of the volume, even if it would otherwise be prevented.




volume rename

Renames the specified volume or mirror. Permissions required: , , or m fc a

Syntax

CLImaprcli volume rename [ -cluster <cluster> ] -name <volume name> -newname <new volume name>

RESThttp[s]://<host>:<port>/rest/volume/rename?<parameters>

Parameters




newname The new volume name.



volume snapshot create

Creates a snapshot of the specified volume, using the specified snapshot name. License required: M5 Permissions required: , , or snap fc a

Syntax

CLImaprcli volume snapshot create [ -cluster <cluster> ] -snapshotname <snapshot> -volume <volume>

RESThttp[s]://<host>:<port>/rest/volume/snapshot/create?<parameters>

Parameters



snapshotname The name of the snapshot to create.

volume The volume for which to create a snapshot.

Examples

Create a snapshot called "test-snapshot" for volume "test-volume":

CLImaprcli volume snapshot create -snapshotname test-snapshot -volume test-volume

RESThttps://r1n1.sj.us:8443/rest/volume/snapshot/create?volume=test-volume&snapshotname=test-snapshot



volume snapshot list

Displays info about a set of snapshots. You can specify the snapshots by volumes or paths, or by specifying a filter to select volumes with certaincharacteristics.

Syntax

CLImaprcli volume snapshot list [ -cluster <cluster> ] [ -columns <fields> ] ( -filter <filter> | -path <volume path list> | -volume <volume list> ) [ -limit <rows> ] [ -output (terse\|verbose) ] [ -start <offset> ]

RESThttp[s]://<host>:<port>/rest/volume/snapshot/list?<parameters>

Parameters

Specify exactly one of the following parameters: , , or .volume path filter



columns A comma-separated list of fields to return in the query. See the table below. Default: noneFields



output Specifies whether the output should be or . Default: terse verbose verbose

path A comma-separated list of paths for which to preserve snapshots.


volume A comma-separated list of volumes for which to preserve snapshots.

Fields

The following table lists the fields used in the sort and columns parameters, and returned as output.

Field Description

snapshotid Unique snapshot ID.

snapshotname Snapshot name.

volumeid ID of the volume associated with the snapshot.

volumename Name of the volume associated with the snapshot.

volumepath Path to the volume associated with the snapshot.

ownername Owner (user or group) associated with the volume.



ownertype Owner type for the owner of the volume:

0=user1=group

dsu Disk space used for the snapshot, in MB.

creationtime Snapshot creation time, milliseconds since 1970

expirytime Snapshot expiration time, milliseconds since 1970; = never expires.0

Output

The specified columns about the specified snapshots.

Sample Output

creationtime ownername snapshotid snapshotname expirytime diskspaceused volumeid volumename ownertype volumepath1296788400768 dummy 363 ATS-Run-2011-01-31-160018.2011-02-03.19-00-00 1296792000001 1063191 362 ATS-Run-2011-01-31-160018 1 /dummy1296789308786 dummy 364 ATS-Run-2011-01-31-160018.2011-02-03.19-15-02 1296792902057 753010 362 ATS-Run-2011-01-31-160018 1 /dummy1296790200677 dummy 365 ATS-Run-2011-01-31-160018.2011-02-03.19-30-00 1296793800001 0 362 ATS-Run-2011-01-31-160018 1 /dummydummy 1 14 test-volume-2 /dummy 102 test-volume-2.2010-11-07.10:00:00 0 1289152800001 1289239200001

Output Fields

See the table above.Fields

Examples

List all snapshots:

CLImaprcli volume snapshot list

RESThttps://r1n1.sj.us:8443/rest/volume/snapshot/list



volume snapshot preserve

Preserves one or more snapshots from expiration. Specify the snapshots by volumes, paths, filter, or IDs. License required: M5 Permissionsrequired: , , or snap fc a

Syntax

CLImaprcli volume snapshot preserve [ -cluster <cluster> ] ( -filter <filter> | -path <volume path list> | -snapshots <snapshot list> | -volume <volumelist> )

RESThttp[s]://<host>:<port>/rest/volume/snapshot/preserve?<parameters>

Parameters

Specify exactly one of the following parameters: volume, path, filter, or snapshots.




path A comma-separated list of paths for which to preserve snapshots.

snapshots A comma-separated list of snapshot IDs to preserve.

volume A comma-separated list of volumes for which to preserve snapshots.

Examples

Preserve two snapshots by ID:

First, use to get the IDs of the snapshots you wish to preserve. Example:volume snapshot list

# maprcli volume snapshot listcreationtime ownername snapshotid snapshotname expirytime diskspaceused volumeid volumename ownertype volumepath1296788400768 dummy 363 ATS-Run-2011-01-31-160018.2011-02-03.19-00-00 1296792000001 1063191 362 ATS-Run-2011-01-31-160018 1 /dummy1296789308786 dummy 364 ATS-Run-2011-01-31-160018.2011-02-03.19-15-02 1296792902057 753010 362 ATS-Run-2011-01-31-160018 1 /dummy1296790200677 dummy 365 ATS-Run-2011-01-31-160018.2011-02-03.19-30-00 1296793800001 0 362 ATS-Run-2011-01-31-160018 1 /dummydummy 1 14 test-volume-2 /dummy 102 test-volume-2.2010-11-07.10:00:00 0 1289152800001 1289239200001

Use the IDs in the command. For example, to preserve the first two snapshots in the above list, run thevolume snapshot preservecommands as follows:

CLImaprcli volume snapshot preserve -snapshots 363,364



RESThttps://r1n1.sj.us:8443/rest/volume/snapshot/preserve?snapshots=363,364



volume snapshot remove

Removes one or more snapshots. License required: M5 Permissions required: , , or snap fc a

Syntax

CLImaprcli volume snapshot remove [ -cluster <cluster> ] ( -snapshotname <snapshot name> | -snapshots <snapshots> | -volume <volume name>)

RESThttp[s]://<host>:<port>/rest/volume/snapshot/remove?<parameters>

Parameters

Specify exactly one of the following parameters: snapshotname, snapshots, or volume.



snapshotname The name of the snapshot to remove.

snapshots A comma-separated list of IDs of snapshots to remove.

volume The name of the volume from which to remove the snapshot.

Examples

Remove the snapshot "test-snapshot":

CLImaprcli volume snapshot remove -snapshotname test-snapshot

RESThttps://10.250.1.79:8443/api/volume/snapshot/remove?snapshotname=test-snapshot

Remove two snapshots by ID:

First, use to get the IDs of the snapshots you wish to remove. Example:volume snapshot list

# maprcli volume snapshot listcreationtime ownername snapshotid snapshotname expirytime diskspaceused volumeid volumename ownertype volumepath1296788400768 dummy 363 ATS-Run-2011-01-31-160018.2011-02-03.19-00-00 1296792000001 1063191 362 ATS-Run-2011-01-31-160018 1 /dummy1296789308786 dummy 364 ATS-Run-2011-01-31-160018.2011-02-03.19-15-02 1296792902057 753010 362 ATS-Run-2011-01-31-160018 1 /dummy1296790200677 dummy 365 ATS-Run-2011-01-31-160018.2011-02-03.19-30-00 1296793800001 0 362 ATS-Run-2011-01-31-160018 1 /dummydummy 1 14 test-volume-2 /dummy 102 test-volume-2.2010-11-07.10:00:00 0 1289152800001 1289239200001



Use the IDs in the command. For example, to remove the first two snapshots in the above list, run the commandsvolume snapshot removeas follows:

CLImaprcli volume snapshot remove -snapshots 363,364

RESThttps://r1n1.sj.us:8443/rest/volume/snapshot/remove?snapshots=363,364



volume unmount

Unmounts one or more mounted volumes. Permissions required: , , or mnt fc a

Syntax

CLImaprcli volume unmount [ -cluster <cluster> ] [ -force 1 ] -name <volume name>

RESThttp[s]://<host>:<port>/rest/volume/unmount?<parameters>

Parameters



force Specifies whether to force the volume to unmount.

name The name of the volume to unmount.

Examples

Unmount the volume "test-volume":

CLImaprcli volume unmount -name test-volume

RESThttps://r1n1.sj.us:8443/rest/volume/unmount?name=test-volume



Utilities

This section contains information about the following scripts and commands:

configure.sh - configures a node or client to work with the clusterdisksetup - sets up disks for use by MapR storagemapr-support-collect.sh - collects cluster information for use by MapR Supportrollingupgrade.sh - upgrades software on a MapR cluster



configure.sh

Sets up a MapR cluster or client, creates , and updates the corresponding and files./opt/mapr/conf/mapr-clusters.conf *.conf *.xml

The normal use of is to set up a MapR cluster, or to set up a MapR client for communication with one or more clusters.configure.sh

To set up a cluster, run on all nodes specifying the cluster's CLDB and ZooKeeper nodes, and a cluster name if desired.configure.shIf setting up a cluster on virtual machines, use the parameter.-isvmTo set up a client, run on the client machine, specifying the CLDB and ZooKeeper nodes of the cluster or clusters. On aconfigure.shclient, use both the and parameters.-c -CTo change services (other than the CLDB and ZooKeeper) running on a node, run with the option. If you change theconfigure.sh -Rlocation or number of CLDB or ZooKeeper services in a cluster, run and specify the new lineup of CLDB and ZooKeeperconfigure.shnodes.To rename a cluster, run on all nodes with the option (and the option, if you have previously specified the CLDBconfigure.sh -N -Rand ZooKeeper nodes).

On a Windows client, the script is named but otherwise works in a similar way.configure.sh configure.bat

Syntax

/opt/mapr/server/configure.sh -C <host>[:<port>][,<host>[:<port>]...] -Z <host>[:<port>][,<host>[:<port>]...] [ -c ] [ --isvm ] [ -J <CLDB JMX port> ] [ -L <log file> ] [ -N <cluster name> ] [ -R ]

Parameters


-C A list of the CLDB nodes in the cluster.

-Z A list of the ZooKeeper nodes in the cluster. The option is required unless (lowercase) is specified.-Z -c

--isvm Specifies virtual machine setup. Required when is run on a virtual machine.configure.sh

-c Specifies client setup. See .Setting Up the Client

-J Specifies the port for the CLDB. Default: JMX 7220

-L Specifies a log file. If not specified, logs errors to .configure.sh /opt/mapr/logs/configure.log

-N Specifies the cluster name, to prevent ambiguity in multiple-cluster environments.

-R After initial node configuration, specifies that should use the previously configured ZooKeeper and CLDB nodes.configure.shWhen is specified, the CLDB credentials are read from-R

and the ZooKeeper credentials are read from . The option is useful for makingmapr-clusters.conf warden.conf -Rchanges to the services configured on a node without changing the CLDB and ZooKeeper nodes, or for renaming a cluster. The -

and parameters are not required when is specified.C -Z -R

If those can not be extracted configure.sh will complain. |

Examples

Add a node (not CLDB or ZooKeeper) to a cluster that is running the CLDB and ZooKeeper on three nodes:

On the new node, run the following command:

/opt/mapr/server/configure.sh -C 10.10.100.1,10.10.100.2,10.10.100.3 -Z 10.10.100.1,10.10.100.2,10.10.100.3

http://en.wikipedia.org/wiki/Java_Management_Extensions



Configure a client to work with MyCluster, which has one CLDB at 10.10.100.1:

On a Linux client, run the following command:


On a Windows 7 client, run the following command:

/opt/mapr/server/configure.bat -N MyCluster -c -C 10.10.100.1:7222

Rename the cluster to Cluster1 without changing the specified CLDB and ZooKeeper nodes:


/opt/mapr/server/configure.sh -N Cluster1 -R



disksetup

Formats specified disks for use by MapR storage, and adds them to the file .disktab

For information about when and how to use , see .disksetup Setting Up Disks for MapR

To specify disks:





To test without formatting physical disks:

If you do not have physical partitions or disks available for reformatting, you can test MapR by creating a flat file and including a path to the file inthe disk list file. You should create at least a 16GB file or larger.

The following example creates a 20 GB flat file ( specifies 1 gigabyte, multiply by ):bs=1G count=20

$ dd =/dev/zero of=/root/storagefile bs=1G count=20if

Using the above example, you would add the following to :/tmp/disks.txt

/root/storagefile

Syntax

/opt/mapr/server/disksetup <disk list file> [-F] [-G] [-M] [-W <stripe_width>]

Parameters


-F Forces formatting of all specified disks. If not specified, does not re-format disks that have already been formatted fordisksetupMapR.

-G Generates contents from input disk list, but does not format disks. This option is useful if disk names change after adisktabreboot, or if the file is damaged.disktab

-M Uses the maximum available number of disks per storage pool.

-W Specifies the number of disks per storage pool.



Examples

Set up disks specified in the file /tmp/disks.txt:




mapr-support-collect.sh

Collects information about a cluster's recent activity, to help MapR Support diagnose problems.

The "mini-dump" option limits the size of the support output. When the or option is specified along with a size, -m --mini-dump support-dump collects only a head and tail, each limited to the specified size, from any log file that is larger than twice the specified size. The total size of.sh

the output is therefore limited to approximately 2 * size * number of logs. The size can be specified in bytes, or using the following suffixes:

b - bytesk - kilobytes (1024 bytes)m - megabytes (1024 kilobytes)

Syntax

/opt/mapr/support/tools/mapr-support-collect.sh [ -h|--hosts <host file> ] [ -H|--host <host entry> ] [ -Q|--no-cldb ] [ -n|--name <name> ] [ -d|--output-dir <path> ] [ -l|--no-logs ] [ -s|--no-statistics ] [ -c|--no-conf ] [ -i|--no-sysinfo ] [ -x|--exclude-cluster ] [ -u|--user <user> ] [ -m|--mini-dump <size> ] [ -O|--online ] [ -p|--par <par> ] [ -t|--dump-timeout <dump timeout> ] [ -T|--scp-timeout <SCP timeout> ] [ -C|--cluster-timeout <cluster timeout> ] [ -y|--yes ] [ -S|--scp-port <SCP port> ] [ --collect-cores ] [ --move-cores ] [ --port <port> ] [ -?|--help ]

Parameters


-h or --hosts A file containing a list of hosts. Each line contains one host entry, in the format [user@]host[:port]

-H or --host One or more hosts in the format [user@]host[:port]

-Q or --no-cldb If specified, the command does not query the CLDB for list of nodes

-n or --name Specifies the name of the output file. If not specified, the default is a date-named file in the formatYYYY-MM-DD-hh-mm-ss.tar

-d or --output-dir The absolute path to the output directory. If not specified, the default is /opt/mapr/support/collect/

-l or --no-logs If specified, the command output does not include log files

-s or --no-statistics If specified, the command output does not include statistics

-c or --no-conf If specified, the command output does not include configurations

-i or --no-sysinfo If specified, the command output does not include system information

-x or--exclude-cluster

If specified, the command output does not collect cluster diagnostics



-u or --user The username for ssh connections

-m, --mini-dump<size>

For any log file greater than 2 * <size>, collects only a head and tail each of the specified size. The <size> may have asuffix specifying units:


-O or --online Specifies a space-separated list of nodes from which to gather support output, and uses the warden instead of ssh fortransmitting the support data.

-p or --par The maximum number of nodes from which support dumps will be gathered concurrently (default: 10)

-t or--dump-timeout

The timeout for execution of the command on a node (default: 120 seconds or 0 = no limit)mapr-support-dump

-T or --scp-timeout The timeout for copy of support dump output from a remote node to the local file system (default: 120 seconds or 0 = nolimit)

-C or--cluster-timeout

The timeout for collection of cluster diagnostics (default: 300 seconds or 0 = no limit)

-y or --yes If specified, the command does not require acknowledgement of the number of nodes that will be affected

-S or --scp-port The local port to which remote nodes will establish an SCP session

--collect-cores If specified, the command collects cores of running mfs processes from all nodes (off by default)

--move-cores If specified, the command moves mfs and nfs cores from /opt/cores from all nodes (off by default)

--port The port number used by FileServer (default: 5660)

-? or --help Displays usage help text

Examples

Collect support information and dump it to the file /opt/mapr/support/collect/mysupport-output.tar:

/opt/mapr/support/tools/mapr-support-collect.sh -n mysupport-output



rollingupgrade.sh

Upgrades a MapR cluster to a specified version of the MapR software, or to a specific set of MapR packages.

By default, any node on which upgrade fails is rolled back to the previous version. To disable rollback, use the option. To force installation-nregardless of the existing version on each node, use the option.-rFor more information about using , see .rollingupgrade.sh Cluster Upgrade

Syntax

/opt/upgrade-mapr/rollingupgrade.sh [-c <cluster name>] [-d] [-h] [-i <identity file>] [-n] [-p <directory>] [-r] [-s] [-u <username>] [-v <version>] [-x]

Parameters


-c Cluster name.

-d If specified, performs a dry run without upgrading the cluster.

-h Displays help text.

-i Specifies an identity file for SSH. See the .SSH man page

-n Specifies that the node should not be rolled back to the previous version if upgrade fails.

-p Specifies a directory containing the upgrade packages.

-r Specifies reinstallation of packages even on nodes that are already at the target version.

-s Specifies SSH to upgrade nodes.

-u A username for SSH.

-v The target upgrade version, using the format to specify the major, minor, and revision numbers. Example: x.y.z 1.2.0

-x Specifies that packages should be copied to nodes via SCP.

http://www.openbsd.org/cgi-bin/man.cgi?query=ssh



Environment Variables

The following table describes environment variables specific to MapR.

Variable Example Values Description

JAVA_HOME /usr/lib/jvm/java-6-sun The directory where the correct version of Java is installed.

MAPR_HOME /opt/mapr The directory in which MapR is installed.

MAPR_SUBNETS 1.2.3.4/12, 5.6/24 If you do not want MapR to use all NICs on each node, use the environment variableMAPR_SUBNETS to restrict MapR traffic to specific NICs. Set MAPR_SUBNETS to acomma-separated list of up to four subnets in CIDR notation with no spaces. If MAPR_SUBNETSis not set, MapR uses all NICs present on the node. When MAPR_SUBNETS is set, make surethe node can reach all nodes in the cluster (servers and clients) using the specified subnets.



Configuration Files

This guide contains reference information about the following configuration files:

.dfs_attributes - Controls compression and chunk size for each directorycldb.conf - Specifies configuration parameters for the CLDB and cluster topologycore-site.xml - Specifies the default filesystemdisktab - Lists the disks in use by MapR-FShadoop-metrics.properties - Specifies where to output service metric reportsmapr-clusters.conf - Specifies the CLDB nodes for one or more clusters that can be reached from the node or clientmapred-default.xml - Contains MapReduce default settings that can be overridden using mapred-site.xml. Not to be edited directly byusers.mapred-site.xml - Core MapReduce settingsmfs.conf - Specifies parameters about MapR-FS server on each nodetaskcontroller.cfg - Specifies TaskTracker configuration parameterswarden.conf - Specifies parameters related to MapR services and the warden. Not to be edited directly by users.



.dfs_attributes

Each directory in MapR storage contains a hidden file called that controls compression and chunk size. To change these.dfs_attributesattributes, change the corresponding values in the file.

Example:

# lines beginning with # are treated as commentsCompression=lz4ChunkSize=268435456

Valid values:

Compression: , , , or lz4 lzf zlib falseChunk size (in bytes): a multiple of 65535 (64 K) or zero (no chunks). Example: 131072

You can also set compression and chunksize using the command.hadoop mfs



cldb.conf

The file specifies configuration parameters for the CLDB and for cluster topology./opt/mapr/conf/cldb.conf

Field Value Description

cldb.min.fileservers 1 Number of fileservers that must register with the CLDB before the root volume is created

cldb.port 7222 The port on which the CLDB listens.

cldb.numthreads 10 The number of threads reserved for use by the CLDB.

cldb.web.port 7221 The port the CLDB uses for the webserver.

cldb.containers.cache.entries 1000000 The maximum number of read/write containers available in the CLDB cache.

net.topology.script.file.name The path to a script that associates IP addresses with physical topology paths. The script takes theIP address of a single node as input and returns the physical topology that should be associated withthe specified node.

net.topology.table.file.name The path to a text file that associates IP addresses with physical topology paths. Each line of the textfile contains the IP address or hostname of one node, followed by the topology path that should beassociated with the node.

cldb.zookeeper.servers The nodes that are running ZooKeeper, in the format .\<host:port\>

hadoop.version The version of Hadoop supported by the cluster.

cldb.jmxremote.port 7220 The CLDB JMX remote port

Example cldb.conf file



## CLDB Config file. # Properties defined in file are loaded during startupthis# and are valid only CLDB which loaded the config.for# These parameters are not persisted anywhere .else## Wait until minimum number of fileserver register with # CLDB before creating Root Volumecldb.min.fileservers=1# CLDB listening portcldb.port=7222# of worker threadsNumbercldb.numthreads=10# CLDB webportcldb.web.port=7221# of RW containers in cacheNumber#cldb.containers.cache.entries=1000000## Topology script to be used to determine# Rack topology of node# Script should take an IP address as input and print rack path # on STDOUT. eg# $>/home/mapr/topo.pl 10.10.10.10# $>/mapr-rack1# $>/home/mapr/topo.pl 10.10.10.20# $>/mapr-rack2#net.topology.script.file.name=/home/mapr/topo.pl## Topology mapping file used to determine# Rack topology of node# File is of a 2 column format (space separated)# 1st column is an IP address or hostname# 2nd column is the rack path# Line starting with '#' is a comment# Example file contents# 10.10.10.10 /mapr-rack1# 10.10.10.20 /mapr-rack2# host.foo.com /mapr-rack3#net.topology.table.file.name=/home/mapr/topo.txt## ZooKeeper addresscldb.zookeeper.servers=zoink:5181# Hadoop metrics jar versionhadoop.version=0.20.2# CLDB JMX remote portcldb.jmxremote.port=7220



core-site.xml

The file specifies the default filesystem./opt/mapr/hadoop/hadoop-<version>/conf/core-site.xml

core-site.xml

<?xml version= ?>"1.0"<?xml-stylesheet type= href= ?>"text/xsl" "configuration.xsl"

this



<configuration>

<property> <name>fs. .name</name>default <value>maprfs:///</value><description>The name of the file system. A URI whosedefault scheme and authority determine the FileSystem implementation. The uri's scheme determines the config property (fs.SCHEME.impl) naming the FileSystem implementation class. The uri's authority is used to determine the host, port, etc. a filesystem.</description>for</property>

</configuration>



disktab

On each node, the file lists all of the physical drives and partitions that have been added to MapR-FS. The /opt/mapr/conf/disktab diskta file is created by and automatically updated when disks are added or removed (either using the MapR Control System, or with the b disksetup

and commands).disk add disk remove

Sample disktab file

# MapR Disks Mon Nov 28 11:46:16 2011

/dev/sdb 47E4CCDA-3536-E767-CD18-0CB7E4D34E00/dev/sdc 7B6A3E66-6AF0-AF60-AE39-01B8E4D34E00/dev/sdd 27A59ED3-DFD4-C692-68F8-04B8E4D34E00/dev/sde F0BB5FB1-F2AC-CC01-275B-08B8E4D34E00/dev/sdf 678FCF40-926F-0D04-49AC-0BB8E4D34E00/dev/sdg 46823852-E45B-A7ED-8417-02B9E4D34E00/dev/sdh 60A99B96-4CEE-7C46-A749-05B9E4D34E00/dev/sdi 66533D4D-49F9-3CC4-0DF9-08B9E4D34E00/dev/sdj 44CA818A-9320-6BBB-3751-0CB9E4D34E00/dev/sdk 587E658F-EC8B-A3DF-4D74-00BAE4D34E00/dev/sdl 11384F8D-1DA2-E0F3-E6E5-03BAE4D34E00



hadoop-metrics.properties

The files direct MapR where to output service metric reports: to an output file ( ) or to 3.1hadoop-metrics.properties FileContext Ganglia( ). A third context, , disables metrics. To direct metrics to an output file, comment out the linesMapRGangliaContext31 NullContextpertaining to Ganglia and the ; for the chosen service; to direct metrics to Ganglia, comment out the lines pertaining to the metricsNullContextfile and the . See .NullContext Service Metrics

There are two files:hadoop-metrics.properties

/opt/mapr/hadoop/hadoop-<version>/conf/hadoop-metrics.properties specifies output for standard Hadoop services/opt/mapr/conf/hadoop-metrics.properties specifies output for MapR-specific services

The following table describes the parameters for each service in the files.hadoop-metrics.properties

Parameter Example Values Description

<service>.classorg.apache.hadoop.metrics.spi.NullContextWithUpdateThreadapache.hadoop.metrics.file.FileContextcom.mapr.fs.cldb.counters.MapRGangliaContext31

The class that implements the interfaceresponsible for sending the service metrics tothe appropriate handler. When implementing aclass that sends metrics to Ganglia, set thisproperty to the class name.

<service>.period1060

The interval between 2 service metrics dataexports to the appropriate interface. This isindependent of how often are the metricsupdated in the framework.

<service>.fileName /tmp/cldbmetrics.log The path to the file where service metrics areexported when the cldb.class property is set toFileContext.

<service.servers localhost:8649 The location of the gmon or gmeta that isaggregating metrics for this instance of theservice, when the cldb.class property is set toGangliaContext.

<service>.spoof 1 Specifies whether the metrics being sent outfrom the server should be spoofed as comingfrom another server. All our fileserver metricsare also on cldb, but to make it appear to endusers as if these properties were emitted byfileserver host, we spoof the metrics to Gangliausing this property. Currently only used for theFileServer service.

Examples

The files are organized into sections for each service that provides metrics. Each section is divided intohadoop-metrics.propertiessubsections for the three contexts.

/opt/mapr/hadoop/hadoop-<version>/conf/hadoop-metrics.properties



# Configuration of the context "dfs" for nulldfs.class=org.apache.hadoop.metrics.spi.NullContext

# Configuration of the context file"dfs" for#dfs.class=org.apache.hadoop.metrics.file.FileContext#dfs.period=10#dfs.fileName=/tmp/dfsmetrics.log

# Configuration of the context ganglia"dfs" for# Pick one: Ganglia 3.0 (former) or Ganglia 3.1 (latter)# dfs.class=org.apache.hadoop.metrics.ganglia.GangliaContext# dfs.class=org.apache.hadoop.metrics.ganglia.GangliaContext31# dfs.period=10# dfs.servers=localhost:8649

# Configuration of the context "mapred" for nullmapred.class=org.apache.hadoop.metrics.spi.NullContext

# Configuration of the context file"mapred" for#mapred.class=org.apache.hadoop.metrics.file.FileContext#mapred.period=10#mapred.fileName=/tmp/mrmetrics.log

# Configuration of the context ganglia"mapred" for# Pick one: Ganglia 3.0 (former) or Ganglia 3.1 (latter)# mapred.class=org.apache.hadoop.metrics.ganglia.GangliaContext# mapred.class=org.apache.hadoop.metrics.ganglia.GangliaContext31# mapred.period=10# mapred.servers=localhost:8649

# Configuration of the context "jvm" for null#jvm.class=org.apache.hadoop.metrics.spi.NullContext

# Configuration of the context file"jvm" for#jvm.class=org.apache.hadoop.metrics.file.FileContext#jvm.period=10#jvm.fileName=/tmp/jvmmetrics.log

# Configuration of the context ganglia"jvm" for# jvm.class=org.apache.hadoop.metrics.ganglia.GangliaContext# jvm.period=10# jvm.servers=localhost:8649

# Configuration of the context "ugi" for nullugi.class=org.apache.hadoop.metrics.spi.NullContext

# Configuration of the context "fairscheduler" for null#fairscheduler.class=org.apache.hadoop.metrics.spi.NullContext

# Configuration of the context file"fairscheduler" for#fairscheduler.class=org.apache.hadoop.metrics.file.FileContext#fairscheduler.period=10#fairscheduler.fileName=/tmp/fairschedulermetrics.log

# Configuration of the context ganglia"fairscheduler" for# fairscheduler.class=org.apache.hadoop.metrics.ganglia.GangliaContext# fairscheduler.period=10# fairscheduler.servers=localhost:8649#

/opt/mapr/conf/hadoop-metrics.properties



########################################################################################################################### hadoop-metrics.properties###########################################################################################################################

#CLDB metrics config - Pick one out of ,file or ganglia.null#Uncomment all properties in , file or ganglia context, to send cldb metrics to that contextnull

# Configuration of the context "cldb" for null#cldb.class=org.apache.hadoop.metrics.spi.NullContextWithUpdateThread#cldb.period=10

# Configuration of the context file"cldb" for#cldb.class=org.apache.hadoop.metrics.file.FileContext#cldb.period=60#cldb.fileName=/tmp/cldbmetrics.log

# Configuration of the context ganglia"cldb" forcldb.class=com.mapr.fs.cldb.counters.MapRGangliaContext31cldb.period=10cldb.servers=localhost:8649cldb.spoof=1

#FileServer metrics config - Pick one out of ,file or ganglia.null#Uncomment all properties in , file or ganglia context, to send fileserver metrics to that contextnull

# Configuration of the context "fileserver" for null#fileserver.class=org.apache.hadoop.metrics.spi.NullContextWithUpdateThread#fileserver.period=10

# Configuration of the context file"fileserver" for#fileserver.class=org.apache.hadoop.metrics.file.FileContext#fileserver.period=60#fileserver.fileName=/tmp/fsmetrics.log

# Configuration of the context ganglia"fileserver" forfileserver.class=com.mapr.fs.cldb.counters.MapRGangliaContext31fileserver.period=37fileserver.servers=localhost:8649fileserver.spoof=1

################################################################################################################



mapr-clusters.conf

The configuration file specifies the CLDB nodes for one or more clusters that can be reached from/opt/mapr/conf/mapr-clusters.confthe node or client on which it is installed.

Format:

clustername1 <CLDB> <CLDB> <CLDB>[ clustername2 <CLDB> <CLDB> <CLDB> ][ ... ]

The <CLDB> string format is one of the following:

host,ip:port - Host, IP, and port (uses DNS to resolve hostnames, or provided IP if DNS is down)host:port - Hostname and IP (uses DNS to resolve host, specifies port)ip:port - IP and port (avoids using DNS to resolve hosts, specifies port)host - Hostname only (default, uses DNS to resolve host, uses default port)ip - IP only (avoids using DNS to resolve hosts, uses default port)



mapred-default.xml

The configuration file provides defaults that can be overridden using , and is located in the Hadoopmapred-default.xml mapred-site.xmlcore JAR file ( )./opt/mapr/hadoop/hadoop-<version>/lib/hadoop-<version>-dev-core.jar

Do not modify directly. Instead, copy parameters into and modify them there. Ifmapred-default.xml mapred-site.xmlmapred-site.xml does not already exist, create it.

The format for a parameter in both and is:mapred-default.xml mapred-site.xml

<property> <name>io.sort.spill.percent</name> <value>0.99</value> <description>The soft limit in either the buffer or record collection buffers. Once reached, a thread will begin to spill the contents to disk in the background. Note that does not imply any chunking of data tothis the spill. A value less than 0.5 is not recommended.</description></property>

The element contains the parameter name, the element contains the parameter value, and the optional elem<name> <value> <description>ent contains the parameter description. You can create XML for any parameter from the table below, using the example above as a guide.


hadoop.job.history.location If job tracker is static the history files are stored in this single wellknown place on local filesystem. If No value is set here, bydefault, it is in the local file system at $<hadoop.log.dir>/history.History files are moved tomapred.jobtracker.history.completed.location which is onMapRFs JobTracker volume.

hadoop.job.history.user.location User can specify a location to store the history files of a particularjob. If nothing is specified, the logs are stored in output directory.The files are stored in "_logs/history/" in the directory. User canstop logging by giving the value "none".

hadoop.rpc.socket.factory.class.JobSubmissionProtocol SocketFactory to use to connect to a Map/Reduce master(JobTracker). If null or empty, then usehadoop.rpc.socket.class.default.

io.map.index.skip 0 Number of index entries to skip between each entry. Zero bydefault. Setting this to values larger than zero can facilitateopening large map files using less memory.

io.sort.factor 256 The number of streams to merge at once while sorting files. Thisdetermines the number of open file handles.

io.sort.mb 100 Buffer used to hold map outputs in memory before writing finalmap outputs. Setting this value very low may cause spills. If aaverage input to map is "MapIn" bytes then typically value ofio.sort.mb should be '1.25 times MapIn' bytes.

io.sort.record.percent 0.17 The percentage of io.sort.mb dedicated to tracking recordboundaries. Let this value be r, io.sort.mb be x. The maximumnumber of records collected before the collection thread mustblock is equal to (r * x) / 4

io.sort.spill.percent 0.99 The soft limit in either the buffer or record collection buffers. Oncereached, a thread will begin to spill the contents to disk in thebackground. Note that this does not imply any chunking of data tothe spill. A value less than 0.5 is not recommended.

job.end.notification.url http://localhost:8080/jobstatus.php?jobId=$jobId&jobStatus=$jobStatus Indicates url which will be called on completion of job to informend status of job. User can give at most 2 variables with URI :$jobId and $jobStatus. If they are present in URI, then they will bereplaced by their respective values.

http://localhost:8080/jobstatus.php?jobId=$jobId&jobStatus=$jobStatus



job.end.retry.attempts 0 Indicates how many times hadoop should attempt to contact thenotification URL

job.end.retry.interval 30000 Indicates time in milliseconds between notification URL retry calls

jobclient.completion.poll.interval 5000 The interval (in milliseconds) between which the JobClient pollsthe JobTracker for updates about job status. You may want to setthis to a lower value to make tests run faster on a single nodesystem. Adjusting this value in production may lead to unwantedclient-server traffic.

jobclient.output.filter FAILED The filter for controlling the output of the task's userlogs sent tothe console of the JobClient. The permissible options are: NONE,KILLED, FAILED, SUCCEEDED and ALL.

jobclient.progress.monitor.poll.interval 1000 The interval (in milliseconds) between which the JobClient reportsstatus to the console and checks for job completion. You maywant to set this to a lower value to make tests run faster on asingle node system. Adjusting this value in production may leadto unwanted client-server traffic.

map.sort.class org.apache.hadoop.util.QuickSort The default sort class for sorting keys.

mapr.localoutput.dir output The path for local output

mapr.localspill.dir spill The path for local spill

mapr.localvolumes.path /var/mapr/local The path for local volumes

mapred.acls.enabled false Specifies whether ACLs should be checked for authorization ofusers for doing various queue and job level operations. ACLs aredisabled by default. If enabled, access control checks are madeby JobTracker and TaskTracker when requests are made byusers for queue operations like submit job to a queue and kill ajob in the queue and job operations like viewing the job-details(See mapreduce.job.acl-view-job) or for modifying the job (Seemapreduce.job.acl-modify-job) using Map/Reduce APIs, RPCs orvia the console and web user interfaces.

mapred.child.env User added environment variables for the task tracker childprocesses. Example : 1) A=foo This will set the env variable A tofoo 2) B=$B:c This is inherit tasktracker's B env variable.

mapred.child.java.opts Java opts for the task tracker child processes. The followingsymbol, if present, will be interpolated: @taskid@ is replaced bycurrent TaskID. Any other occurrences of '@' will go unchanged.For example, to enable verbose gc logging to a file named for thetaskid in /tmp and to set the heap maximum to be a gigabyte,pass a 'value' of: -Xmx1024m -verbose:gc-Xloggc:/tmp/@[email protected] The configuration variablemapred.child.ulimit can be used to control the maximum virtualmemory of the child processes.

mapred.child.oom_adj 10 Increase the OOM adjust for oom killer (linux specific). We onlyallow increasing the adj value. (valid values: 0-15)

mapred.child.renice 10 Nice value to run the job in. on linux the range is from -20 (mostfavorable) to 19 (least favorable). We only allow reducing thepriority. (valid values: 0-19)

mapred.child.taskset true Run the job in a taskset. man taskset (linux specific) 1-4 CPUs:No taskset 5-8 CPUs: taskset 1- (processor 0 reserved forinfrastructrue processes) 9-n CPUs: taskset 2- (processors 0,1reserved for infrastructrue processes)

mapred.child.tmp ./tmp To set the value of tmp directory for map and reduce tasks. If thevalue is an absolute path, it is directly assigned. Otherwise, it isprepended with task's working directory. The java tasks areexecuted with option -Djava.io.tmpdir='the absolute path of thetmp dir'. Pipes and streaming are set with environment variable,TMPDIR='the absolute path of the tmp dir'



mapred.child.ulimit The maximum virtual memory, in KB, of a process launched bythe Map-Reduce framework. This can be used to control both theMapper/Reducer tasks and applications using Hadoop Pipes,Hadoop Streaming etc. By default it is left unspecified to letcluster admins control it via limits.conf and other such relevantmechanisms. Note: mapred.child.ulimit must be greater than orequal to the -Xmx passed to JavaVM, else the VM might not start.

mapred.cluster.map.memory.mb -1 The size, in terms of virtual memory, of a single map slot in theMap-Reduce framework, used by the scheduler. A job can ask formultiple slots for a single map task viamapred.job.map.memory.mb, upto the limit specified bymapred.cluster.max.map.memory.mb, if the scheduler supportsthe feature. The value of -1 indicates that this feature is turnedoff.

mapred.cluster.max.map.memory.mb -1 The maximum size, in terms of virtual memory, of a single maptask launched by the Map-Reduce framework, used by thescheduler. A job can ask for multiple slots for a single map taskvia mapred.job.map.memory.mb, upto the limit specified bymapred.cluster.max.map.memory.mb, if the scheduler supportsthe feature. The value of -1 indicates that this feature is turnedoff.

mapred.cluster.max.reduce.memory.mb -1 The maximum size, in terms of virtual memory, of a single reducetask launched by the Map-Reduce framework, used by thescheduler. A job can ask for multiple slots for a single reduce taskvia mapred.job.reduce.memory.mb, upto the limit specified bymapred.cluster.max.reduce.memory.mb, if the scheduler supportsthe feature. The value of -1 indicates that this feature is turnedoff.

mapred.cluster.reduce.memory.mb -1 The size, in terms of virtual memory, of a single reduce slot in theMap-Reduce framework, used by the scheduler. A job can ask formultiple slots for a single reduce task viamapred.job.reduce.memory.mb, upto the limit specified bymapred.cluster.max.reduce.memory.mb, if the scheduler supportsthe feature. The value of -1 indicates that this feature is turnedoff.

mapred.compress.map.output false Should the outputs of the maps be compressed before being sentacross the network. Uses SequenceFile compression.

mapred.healthChecker.interval 60000 Frequency of the node health script to be run, in milliseconds

mapred.healthChecker.script.args List of arguments which are to be passed to node health scriptwhen it is being launched comma seperated.

mapred.healthChecker.script.path Absolute path to the script which is periodicallyrun by the nodehealth monitoring service to determine if the node is healthy ornot. If the value of this key is empty or the file does not exist inthe location configured here, the node health monitoring serviceis not started.

mapred.healthChecker.script.timeout 600000 Time after node health script should be killed if unresponsive andconsidered that the script has failed.

mapred.hosts.exclude Names a file that contains the list of hosts that should beexcluded by the jobtracker. If the value is empty, no hosts areexcluded.

mapred.hosts Names a file that contains the list of nodes that may connect tothe jobtracker. If the value is empty, all hosts are permitted.

mapred.inmem.merge.threshold 1000 The threshold, in terms of the number of files for the in-memorymerge process. When we accumulate threshold number of fileswe initiate the in-memory merge and spill to disk. A value of 0 orless than 0 indicates we want to DON'T have any threshold andinstead depend only on the ramfs's memory consumption totrigger the merge.



mapred.job.map.memory.mb -1 The size, in terms of virtual memory, of a single map task for thejob. A job can ask for multiple slots for a single map task, roundedup to the next multiple of mapred.cluster.map.memory.mb andupto the limit specified by mapred.cluster.max.map.memory.mb, ifthe scheduler supports the feature. The value of -1 indicates thatthis feature is turned off iff mapred.cluster.map.memory.mb isalso turned off (-1).

mapred.job.map.memory.physical.mb Maximum physical memory limit for map task of this job. If limit isexceeded task attempt will be FAILED.

mapred.job.queue.name default Queue to which a job is submitted. This must match one of thequeues defined in mapred.queue.names for the system. Also, theACL setup for the queue must allow the current user to submit ajob to the queue. Before specifying a queue, ensure that thesystem is configured with the queue, and access is allowed forsubmitting jobs to the queue.

mapred.job.reduce.input.buffer.percent 0.0 The percentage of memory- relative to the maximum heap size-to retain map outputs during the reduce. When the shuffle isconcluded, any remaining map outputs in memory must consumeless than this threshold before the reduce can begin.

mapred.job.reduce.memory.mb -1 The size, in terms of virtual memory, of a single reduce task forthe job. A job can ask for multiple slots for a single map task,rounded up to the next multiple ofmapred.cluster.reduce.memory.mb and upto the limit specified bymapred.cluster.max.reduce.memory.mb, if the scheduler supportsthe feature. The value of -1 indicates that this feature is turned offiff mapred.cluster.reduce.memory.mb is also turned off (-1).

mapred.job.reduce.memory.physical.mb Maximum physical memory limit for reduce task of this job. If limitis exceeded task attempt will be FAILED..

mapred.job.reuse.jvm.num.tasks -1 How many tasks to run per jvm. If set to -1, there is no limit.

mapred.job.shuffle.input.buffer.percent 0.70 The percentage of memory to be allocated from the maximumheap size to storing map outputs during the shuffle.

mapred.job.shuffle.merge.percent 0.66 The usage threshold at which an in-memory merge will beinitiated, expressed as a percentage of the total memoryallocated to storing in-memory map outputs, as defined bymapred.job.shuffle.input.buffer.percent.

mapred.job.tracker.handler.count 10 The number of server threads for the JobTracker. This should beroughly 4% of the number of tasktracker nodes.

mapred.job.tracker.history.completed.location /var/mapr/cluster/mapred/jobTracker/history/done The completed job history files are stored at this singlewell-known location. If nothing is specified, the files are stored at$<hadoop.job.history.location>/done in local filesystem.

mapred.job.tracker.http.address 0.0.0.0:50030 The job tracker http server address and port the server will listenon. If the port is 0 then the server will start on a free port.

mapred.job.tracker.persist.jobstatus.active false Indicates if persistency of job status information is active or not.

mapred.job.tracker.persist.jobstatus.dir /var/mapr/cluster/mapred/jobTracker/jobsInfo The directory where the job status information is persisted in a filesystem to be available after it drops of the memory queue andbetween jobtracker restarts.

mapred.job.tracker.persist.jobstatus.hours 0 The number of hours job status information is persisted in DFS.The job status information will be available after it drops of thememory queue and between jobtracker restarts. With a zerovalue the job status information is not persisted at all in DFS.

mapred.job.tracker localhost:9001 jobTracker address ip:port or use uri maprfs:/// for default clusteror maprfs:///mapr/san_jose_cluster1 to connect'san_jose_cluster1' cluster.

mapred.jobtracker.completeuserjobs.maximum 100 The maximum number of complete jobs per user to keep aroundbefore delegating them to the job history.

mapred.jobtracker.instrumentation org.apache.hadoop.mapred.JobTrackerMetricsInst Expert: The instrumentation class to associate with eachJobTracker.



mapred.jobtracker.job.history.block.size 3145728 The block size of the job history file. Since the job recovery usesjob history, its important to dump job history to disk as soon aspossible. Note that this is an expert level parameter. The defaultvalue is set to 3 MB.

mapred.jobtracker.jobhistory.lru.cache.size 5 The number of job history files loaded in memory. The jobs areloaded when they are first accessed. The cache is cleared basedon LRU.

mapred.jobtracker.maxtasks.per.job -1 The maximum number of tasks for a single job. A value of -1indicates that there is no maximum.

mapred.jobtracker.plugins Comma-separated list of jobtracker plug-ins to be activated.

mapred.jobtracker.port 9001 Port on which JobTracker listens.

mapred.jobtracker.restart.recover true "true" to enable (job) recovery upon restart, "false" to start afresh

mapred.jobtracker.retiredjobs.cache.size 1000 The number of retired job status to keep in the cache.

mapred.jobtracker.taskScheduler.maxRunningTasksPerJob The maximum number of running tasks for a job before it getspreempted. No limits if undefined.

mapred.jobtracker.taskScheduler org.apache.hadoop.mapred.JobQueueTaskScheduler The class responsible for scheduling the tasks.

mapred.line.input.format.linespermap 1 Number of lines per split in NLineInputFormat.

mapred.local.dir.minspacekill 0 If the space in mapred.local.dir drops under this, do not ask moretasks until all the current ones have finished and cleaned up.Also, to save the rest of the tasks we have running, kill one ofthem, to clean up some space. Start with the reduce tasks, thengo with the ones that have finished the least. Value in bytes.

mapred.local.dir.minspacestart 0 If the space in mapred.local.dir drops under this, do not ask formore tasks. Value in bytes.

mapred.local.dir $<hadoop.tmp.dir>/mapred/local The local directory where MapReduce stores intermediate datafiles. May be a comma-separated list of directories on differentdevices in order to spread disk i/o. Directories that do not existare ignored.

mapred.map.child.env User added environment variables for the task tracker childprocesses. Example : 1) A=foo This will set the env variable A tofoo 2) B=$B:c This is inherit tasktracker's B env variable.

mapred.map.child.java.opts -XX:ErrorFile=/opt/cores/hadoop/java_error%p.log Java opts for the map tasks. The following symbol, if present, willbe interpolated: @taskid@ is replaced by current TaskID. Anyother occurrences of '@' will go unchanged. For example, toenable verbose gc logging to a file named for the taskid in /tmpand to set the heap maximum to be a gigabyte, pass a 'value' of:-Xmx1024m -verbose:gc -Xloggc:/tmp/@[email protected] Theconfiguration variable mapred.<map/reduce>.child.ulimit can beused to control the maximum virtual memory of the childprocesses. MapR: Default heapsize(-Xmx) is determined bymemory reserved for mapreduce at tasktracker. Reduce task isgiven more memory than a map task. Default memory for a maptask = (Total Memory reserved for mapreduce) * (#mapslots/(#mapslots + 1.3*#reduceslots))

mapred.map.child.ulimit The maximum virtual memory, in KB, of a process launched bythe Map-Reduce framework. This can be used to control both theMapper/Reducer tasks and applications using Hadoop Pipes,Hadoop Streaming etc. By default it is left unspecified to letcluster admins control it via limits.conf and other such relevantmechanisms. Note: mapred.<map/reduce>.child.ulimit must begreater than or equal to the -Xmx passed to JavaVM, else the VMmight not start.

mapred.map.max.attempts 4 Expert: The maximum number of attempts per map task. In otherwords, framework will try to execute a map task these manynumber of times before giving up on it.

mapred.map.output.compression.codec org.apache.hadoop.io.compress.DefaultCodec If the map outputs are compressed, how should they becompressed?



mapred.map.tasks.speculative.execution true If true, then multiple instances of some map tasks may beexecuted in parallel.

mapred.map.tasks 2 The default number of map tasks per job. Ignored whenmapred.job.tracker is "local".

mapred.max.tracker.blacklists 4 The number of blacklists for a taskTracker by various jobs afterwhich the task tracker could be blacklisted across all jobs. Thetracker will be given a tasks later (after a day). The tracker willbecome a healthy tracker after a restart.

mapred.max.tracker.failures 4 The number of task-failures on a tasktracker of a given job afterwhich new tasks of that job aren't assigned to it.

mapred.merge.recordsBeforeProgress 10000 The number of records to process during merge before sending aprogress notification to the TaskTracker.

mapred.min.split.size 0 The minimum size chunk that map input should be split into. Notethat some file formats may have minimum split sizes that takepriority over this setting.

mapred.output.compress false Should the job outputs be compressed?

mapred.output.compression.codec org.apache.hadoop.io.compress.DefaultCodec If the job outputs are compressed, how should they becompressed?

mapred.output.compression.type RECORD If the job outputs are to compressed as SequenceFiles, howshould they be compressed? Should be one of NONE, RECORDor BLOCK.

mapred.queue.default.state RUNNING This values defines the state , default queue is in. the values canbe either "STOPPED" or "RUNNING" This value can be changedat runtime.

mapred.queue.names default Comma separated list of queues configured for this jobtracker.Jobs are added to queues and schedulers can configure differentscheduling properties for the various queues. To configure aproperty for a queue, the name of the queue must match thename specified in this value. Queue properties that are commonto all schedulers are configured here with the naming convention,mapred.queue.$QUEUE-NAME.$PROPERTY-NAME, for e.g.mapred.queue.default.submit-job-acl. The number of queuesconfigured in this parameter could depend on the type ofscheduler being used, as specified inmapred.jobtracker.taskScheduler. For example, theJobQueueTaskScheduler supports only a single queue, which isthe default configured here. Before adding more queues, ensurethat the scheduler you've configured supports multiple queues.

mapred.reduce.child.env

mapred.reduce.child.java.opts -XX:ErrorFile=/opt/cores/hadoop/java_error%p.log Java opts for the reduce tasks. MapR: Default heapsize(-Xmx) isdetermined by memory reserved for mapreduce at tasktracker.Reduce task is given more memory than map task. Defaultmemory for a reduce task = (Total Memory reserved formapreduce) * (1.3*#reduceslots / (#mapslots + 1.3*#reduceslots))

mapred.reduce.child.ulimit

mapred.reduce.copy.backoff 300 The maximum amount of time (in seconds) a reducer spends onfetching one map output before declaring it as failed.

mapred.reduce.max.attempts 4 Expert: The maximum number of attempts per reduce task. Inother words, framework will try to execute a reduce task thesemany number of times before giving up on it.

mapred.reduce.parallel.copies 12 The default number of parallel transfers run by reduce during thecopy(shuffle) phase.

mapred.reduce.slowstart.completed.maps 0.95 Fraction of the number of maps in the job which should becomplete before reduces are scheduled for the job.

mapred.reduce.tasks.speculative.execution true If true, then multiple instances of some reduce tasks may beexecuted in parallel.



mapred.reduce.tasks 1 The default number of reduce tasks per job. Typically set to 99%of the cluster's reduce capacity, so that if a node fails the reducescan still be executed in a single wave. Ignored whenmapred.job.tracker is "local".

mapred.skip.attempts.to.start.skipping 2 The number of Task attempts AFTER which skip mode will bekicked off. When skip mode is kicked off, the tasks reports therange of records which it will process next, to the TaskTracker.So that on failures, tasktracker knows which ones are possiblythe bad records. On further executions, those are skipped.

mapred.skip.map.auto.incr.proc.count true The flag which if set to true,SkipBadRecords.COUNTER_MAP_PROCESSED_RECORDS isincremented by MapRunner after invoking the map function. Thisvalue must be set to false for applications which process therecords asynchronously or buffer the input records. For examplestreaming. In such cases applications should increment thiscounter on their own.

mapred.skip.map.max.skip.records 0 The number of acceptable skip records surrounding the badrecord PER bad record in mapper. The number includes the badrecord as well. To turn the feature of detection/skipping of badrecords off, set the value to 0. The framework tries to narrowdown the skipped range by retrying until this threshold is met ORall attempts get exhausted for this task. Set the value toLong.MAX_VALUE to indicate that framework need not try tonarrow down. Whatever records(depends on application) getskipped are acceptable.

mapred.skip.out.dir If no value is specified here, the skipped records are written to theoutput directory at _logs/skip. User can stop writing skippedrecords by giving the value "none".

mapred.skip.reduce.auto.incr.proc.count true The flag which if set to true,SkipBadRecords.COUNTER_REDUCE_PROCESSED_GROUPSis incremented by framework after invoking the reduce function.This value must be set to false for applications which process therecords asynchronously or buffer the input records. For examplestreaming. In such cases applications should increment thiscounter on their own.

mapred.skip.reduce.max.skip.groups 0 The number of acceptable skip groups surrounding the bad groupPER bad group in reducer. The number includes the bad groupas well. To turn the feature of detection/skipping of bad groupsoff, set the value to 0. The framework tries to narrow down theskipped range by retrying until this threshold is met OR allattempts get exhausted for this task. Set the value toLong.MAX_VALUE to indicate that framework need not try tonarrow down. Whatever groups(depends on application) getskipped are acceptable.

mapred.submit.replication 10 The replication level for submitted job files. This should be aroundthe square root of the number of nodes.

mapred.system.dir /var/mapr/cluster/mapred/jobTracker/system The shared directory where MapReduce stores control files.

mapred.task.cache.levels 2 This is the max level of the task cache. For example, if the level is2, the tasks cached are at the host level and at the rack level.

mapred.task.profile.maps 0-2 To set the ranges of map tasks to profile. mapred.task.profile hasto be set to true for the value to be accounted.

mapred.task.profile.reduces 0-2 To set the ranges of reduce tasks to profile. mapred.task.profilehas to be set to true for the value to be accounted.

mapred.task.profile false To set whether the system should collect profiler information forsome of the tasks in this job? The information is stored in the userlog directory. The value is "true" if task profiling is enabled.

mapred.task.timeout 600000 The number of milliseconds before a task will be terminated if itneither reads an input, writes an output, nor updates its statusstring.

mapred.task.tracker.http.address 0.0.0.0:50060 The task tracker http server address and port. If the port is 0 thenthe server will start on a free port.



mapred.task.tracker.report.address 127.0.0.1:0 The interface and port that task tracker server listens on. Since itis only connected to by the tasks, it uses the local interface.EXPERT ONLY. Should only be changed if your host does nothave the loopback interface.

mapred.task.tracker.task-controller org.apache.hadoop.mapred.DefaultTaskController TaskController which is used to launch and manage taskexecution

mapred.tasktracker.dns.interface default The name of the Network Interface from which a task trackershould report its IP address.

mapred.tasktracker.dns.nameserver default The host name or IP address of the name server (DNS) which aTaskTracker should use to determine the host name used by theJobTracker for communication and display purposes.

mapred.tasktracker.expiry.interval 600000 Expert: The time-interval, in miliseconds, after which a tasktrackeris declared 'lost' if it doesn't send heartbeats.

mapred.tasktracker.indexcache.mb 10 The maximum memory that a task tracker allows for the indexcache that is used when serving map outputs to reducers.

mapred.tasktracker.instrumentation org.apache.hadoop.mapred.TaskTrackerMetricsInst Expert: The instrumentation class to associate with eachTaskTracker.

mapred.tasktracker.map.tasks.maximum (CPUS > 2) ? (CPUS * 0.75) : 1 The maximum number of map tasks that will be runsimultaneously by a task tracker.

mapred.tasktracker.memory_calculator_plugin Name of the class whose instance will be used to query memoryinformation on the tasktracker. The class must be an instance oforg.apache.hadoop.util.MemoryCalculatorPlugin. If the value isnull, the tasktracker attempts to use a class appropriate to theplatform. Currently, the only platform supported is Linux.

mapred.tasktracker.reduce.tasks.maximum (CPUS > 2) ? (CPUS * 0.50): 1 The maximum number of reduce tasks that will be runsimultaneously by a task tracker.

mapred.tasktracker.taskmemorymanager.monitoring-interval 5000 The interval, in milliseconds, for which the tasktracker waitsbetween two cycles of monitoring its tasks' memory usage. Usedonly if tasks' memory management is enabled viamapred.tasktracker.tasks.maxmemory.

mapred.tasktracker.tasks.sleeptime-before-sigkill 5000 The time, in milliseconds, the tasktracker waits for sending aSIGKILL to a process, after it has been sent a SIGTERM.

mapred.temp.dir $<hadoop.tmp.dir>/mapred/temp A shared directory for temporary files.

mapred.user.jobconf.limit 5242880 The maximum allowed size of the user jobconf. The default is setto 5 MB

mapred.userlog.limit.kb 0 The maximum size of user-logs of each task in KB. 0 disables thecap.

mapred.userlog.retain.hours 24 The maximum time, in hours, for which the user-logs are to beretained after the job completion.

mapreduce.heartbeat.10 300 heartbeat in milliseconds for small cluster (less than or equal 10nodes)

mapreduce.heartbeat.100 1000 heartbeat in milliseconds for medium cluster (11 - 100 nodes).Scales linearly between 300ms - 1s

mapreduce.heartbeat.1000 10000 heartbeat in milliseconds for medium cluster (101 - 1000 nodes).Scales linearly between 1s - 10s

mapreduce.heartbeat.10000 100000 heartbeat in milliseconds for medium cluster (1001 - 10000nodes). Scales linearly between 10s - 100s



mapreduce.job.acl-modify-job job specific access-control list for 'modifying' the job. It is onlyused if authorization is enabled in Map/Reduce by setting theconfiguration property mapred.acls.enabled to true. This specifiesthe list of users and/or groups who can do modification operationson the job. For specifying a list of users and groups the format touse is "user1,user2 group1,group". If set to '*', it allows allusers/groups to modify this job. If set to ' '(i.e. space), it allowsnone. This configuration is used to guard all the modificationswith respect to this job and takes care of all the followingoperations: o killing this job o killing a task of this job, failing atask of this job o setting the priority of this job Each of theseoperations are also protected by the per-queue level ACL"acl-administer-jobs" configured via mapred-queues.xml. So acaller should have the authorization to satisfy either thequeue-level ACL or the job-level ACL. Irrespective of this ACLconfiguration, job-owner, the user who started the cluster, clusteradministrators configured via mapreduce.cluster.administratorsand queue administrators of the queue to which this job issubmitted to configured viamapred.queue.queue-name.acl-administer-jobs inmapred-queue-acls.xml can do all the modification operations ona job. By default, nobody else besides job-owner, the user whostarted the cluster, cluster administrators and queueadministrators can perform modification operations on a job.

mapreduce.job.acl-view-job job specific access-control list for 'viewing' the job. It is only usedif authorization is enabled in Map/Reduce by setting theconfiguration property mapred.acls.enabled to true. This specifiesthe list of users and/or groups who can view private details aboutthe job. For specifying a list of users and groups the format to useis "user1,user2 group1,group". If set to '*', it allows allusers/groups to modify this job. If set to ' '(i.e. space), it allowsnone. This configuration is used to guard some of the job-viewsand at present only protects APIs that can return possiblysensitive information of the job-owner like o job-level counters otask-level counters o tasks' diagnostic information o task-logsdisplayed on the TaskTracker web-UI and o job.xml showed bythe JobTracker's web-UI Every other piece of information of jobsis still accessible by any other user, for e.g., JobStatus,JobProfile, list of jobs in the queue, etc. Irrespective of this ACLconfiguration, job-owner, the user who started the cluster, clusteradministrators configured via mapreduce.cluster.administratorsand queue administrators of the queue to which this job issubmitted to configured viamapred.queue.queue-name.acl-administer-jobs inmapred-queue-acls.xml can do all the view operations on a job.By default, nobody else besides job-owner, the user who startedthe cluster, cluster administrators and queue administrators canperform view operations on a job.

mapreduce.job.complete.cancel.delegation.tokens true if false - do not unregister/cancel delegation tokens from renewal,because same tokens may be used by spawned jobs

mapreduce.job.split.metainfo.maxsize 10000000 The maximum permissible size of the split metainfo file. TheJobTracker won't attempt to read split metainfo files bigger thanthe configured value. No limits if set to -1.

mapreduce.jobtracker.recovery.dir /var/mapr/cluster/mapred/jobTracker/recovery Recovery Directory

mapreduce.jobtracker.recovery.job.initialization.maxtime Maximum time in seconds JobTracker will wait for initializing jobsbefore starting recovery. By default it is same asmapreduce.jobtracker.recovery.maxtime.

mapreduce.jobtracker.recovery.maxtime 480 Maximum time in seconds JobTracker should stay in recoverymode. JobTracker recovers job after talking to all runningtasktrackers. On large cluster if many jobs are to be recovered,mapreduce.jobtracker.recovery.maxtime should be increased.

mapreduce.jobtracker.staging.root.dir /var/mapr/cluster/mapred/jobTracker/staging The root of the staging area for users' job files In practice, thisshould be the directory where users' home directories are located(usually /user)

mapreduce.maprfs.use.checksum true Deprecated; checksums are always used.

mapreduce.maprfs.use.compression true If true, then mapreduce will use checksums.



mapreduce.reduce.input.limit -1 The limit on the input size of the reduce. If the estimated inputsize of the reduce is greater than this value, job is failed. A valueof -1 means that there is no limit set.

mapreduce.task.classpath.user.precedence false Set to true if user wants to set different classpath.

mapreduce.tasktracker.group Expert: Group to which TaskTracker belongs. IfLinuxTaskController is configured viamapreduce.tasktracker.taskcontroller, the group owner of thetask-controller binary should be same as this group.

mapreduce.tasktracker.heapbased.memory.management false Expert only: If admin wants to prevent swapping by not launchingtoo many tasks use this option. Task's memory usage is based onmax java heap size (-Xmx). By default -Xmx will be computed bytasktracker based on slots and memory reserved for mapreducetasks. Seemapred.map.child.java.opts/mapred.reduce.child.java.opts.

mapreduce.tasktracker.jvm.idle.time 10000 If jvm is idle for more than mapreduce.tasktracker.jvm.idle.time(milliseconds) tasktracker will kill it.

mapreduce.tasktracker.outofband.heartbeat false Expert: Set this to true to let the tasktracker send an out-of-bandheartbeat on task-completion for better latency.

mapreduce.tasktracker.prefetch.maptasks 1.0 How many map tasks should be scheduled in-advance on atasktracker. To be given in % of map slots. Default is 1.0 whichmeans number of tasks overscheduled = total map slots ontasktracker.

mapreduce.tasktracker.reserved.physicalmemory.mb Maximum phyiscal memory tasktracker should reserve formapreduce tasks. If tasks use more than the limit, task usingmaximum memory will be killed. Expert only: Set this value ifftasktracker should use a certain amount of memory formapreduce tasks. In MapR Distro warden figures this numberbased on services configured on a node. Settingmapreduce.tasktracker.reserved.physicalmemory.mb to -1 willdisable physical memory accounting and task management.

mapreduce.tasktracker.volume.healthcheck.interval 60000 How often tasktracker should check for mapreduce volume at$<mapr.localvolumes.path>/mapred/. Value is in milliseconds.

mapreduce.use.fastreduce false Expert only . Reducer won't be able to tolerate failures.

mapreduce.use.maprfs true If true, then mapreduce uses maprfs to store task related datamay be executed in parallel.

keep.failed.task.files false Should the files for failed tasks be kept. This should only be usedon jobs that are failing, because the storage is never reclaimed. Italso prevents the map outputs from being erased from the reducedirectory as they are consumed.

keep.task.files.pattern .*_m_123456_0 Keep all files from tasks whose task names match the givenregular expression. Defaults to none.

tasktracker.http.threads 2 The number of worker threads that for the http server. This isused for map output fetching



mapred-site.xml

The file specifies MapReduce formulas and parameters./opt/mapr/hadoop/hadoop-<version>/conf/mapred-site.xml

Each parameter in the local configuration file overrides the corresponding parameter in the cluster-wide configuration unless the cluster-wide copyof the parameter includes . In general, only job-specific parameters should be set in the local copy of <final>true</final> mapred-site.xm

.l

There are three parts to :mapred-site.xml

JobTracker configurationTaskTracker configurationJob configuration

Jobtracker Configuration

Should be changed by the administrator. When changing any parameters in this section, a JobTracker restart is required.


mapred.job.tracker maprfs:/// JobTracker address ip:port or use uri maprfs:/// for default cluster ormaprfs:///mapr/san_jose_cluster1 to connect 'san_jose_cluster1' cluster. Replace localhost by one or more ip addresses for jobtracker.

mapred.jobtracker.port 9001 Port on which JobTracker listens. Read by JobTracker to start RPC Server.

mapreduce.tasktracker.outofband.heartbeat false Expert: Set this to true to let the tasktracker send an out-of-band heartbeat ontask-completion for better latency.

webinterface.private.actions If set to true, jobs can be killed from JT's web interface.Enable this option if the interfaces are only reachable by those who have the right authorization.

mapreduce.jobtracker.node.labels.file The file that specifies the labels to apply to the nodes in the cluster.

mapreduce.jobtracker.node.labels.monitor.interval Specifies how often to poll the node labels file for changes.

mapred.queue.<queue-name>.label Specifies a label for the queue named in the placeholder.<queue-name>

mapred.queue.<queue-name>.label.policy Specifies a policy for the label applied to the queue named in the <queue-nam placeholder. The policy controls the interaction between the queue labele>

and the job label:

PREFER_QUEUE — always use label set on queuePREFER_JOB — always use label set on jobAND (default) — job label AND node labelOR — job label OR node label

Jobtracker Directories

When changing any parameters in this section, a JobTracker restart is required.

Volume path = mapred.system.dir/../


mapred.system.dir /var/mapr/cluster/mapred/jobTracker/system The shared directory where MapReducestores control files.

mapred.job.tracker.persist.jobstatus.dir /var/mapr/cluster/mapred/jobTracker/jobsInfo The directory where the job statusinformation is persisted in a file system to beavailable after it drops of the memory queueand between jobtracker restarts.

mapreduce.jobtracker.staging.root.dir /var/mapr/cluster/mapred/jobTracker/staging The root of the staging area for users' jobfiles In practice, this should be the directorywhere users' home directories are located(usually /user)



mapreduce.job.split.metainfo.maxsize 10000000 The maximum permissible size of the splitmetainfo file. The JobTracker won't attemptto read split metainfo files bigger than theconfigured value. No limits if set to -1.

mapred.jobtracker.retiredjobs.cache.size 1000 The number of retired job status to keep inthe cache.

mapred.job.tracker.history.completed.location /var/mapr/cluster/mapred/jobTracker/history/done The completed job history files are stored atthis single well known location. If nothing isspecified, the files are stored at$hadoop.job.history.location/done in localfilesystem.

hadoop.job.history.location If job tracker is static the history files arestored in this single well known place onlocal filesystem. If No value is set here, bydefault, it is in the local file system at$hadoop.log.dir/history. History files aremoved tomapred.jobtracker.history.completed.locationwhich is on MapRFs JobTracker volume.

mapred.jobtracker.jobhistory.lru.cache.size 5 The number of job history files loaded inmemory. The jobs are loaded when they arefirst accessed. The cache is cleared basedon LRU.

JobTracker Recovery



mapreduce.jobtracker.recovery.dir /var/mapr/cluster/mapred/jobTracker/recovery Recovery Directory. Stores list of knownTaskTrackers.

mapreduce.jobtracker.recovery.maxtime 120 Maximum time in seconds JobTracker should stay inrecovery mode.

mapred.jobtracker.restart.recover true "true" to enable (job) recovery upon restart, "false" tostart afresh

Enable Fair Scheduler



mapred.fairscheduler.allocation.file conf/pools.xml

mapred.jobtracker.taskScheduler org.apache.hadoop.mapred.FairScheduler

mapred.fairscheduler.assignmultiple true

mapred.fairscheduler.eventlog.enabled false Enable scheduler logging in$HADOOP_LOG_DIR/fairscheduler/

mapred.fairscheduler.smalljob.schedule.enable true Enable small job fast scheduling inside fairscheduler. TaskTrackers should reserve aslot called ephemeral slot which is used forsmalljob if cluster is busy.

mapred.fairscheduler.smalljob.max.maps 10 Small job definition. Max number of mapsallowed in small job.

mapred.fairscheduler.smalljob.max.reducers 10 Small job definition. Max number ofreducers allowed in small job.

mapred.fairscheduler.smalljob.max.inputsize 10737418240 Small job definition. Max input size in bytesallowed for a small job. Default is 10GB.



mapred.fairscheduler.smalljob.max.reducer.inputsize 1073741824 Small job definition. Max estimated inputsize for a reducer allowed in small job.Default is 1GB per reducer.

mapred.cluster.ephemeral.tasks.memory.limit.mb 200 Small job definition. Max memory in mbytesreserved for an ephermal slot. Default is200mb. This value must be same onJobTracker and TaskTracker nodes.


When changing any parameters in this section, a TaskTracker restart is required.

Should be changed by admin


mapred.tasktracker.map.tasks.maximum (CPUS > 2) ?(CPUS *0.75) : 1

The maximum number of map tasks that will be run simultaneouslyby a task tracker.

mapreduce.tasktracker.prefetch.maptasks 1.0 How many map tasks should be scheduled in-advance on atasktracker. To be given in % of map slots. Default is 1.0 whichmeans number of tasks overscheduled = total map slots on TT.

mapred.tasktracker.reduce.tasks.maximum (CPUS > 2) ?(CPUS *0.50): 1

The maximum number of reduce tasks that will be runsimultaneously by a task tracker.

mapred.tasktracker.ephemeral.tasks.maximum 1 Reserved slot for small job scheduling

mapred.tasktracker.ephemeral.tasks.timeout 10000 Maximum time in ms a task is allowed to occupy ephemeral slot

mapred.tasktracker.ephemeral.tasks.ulimit 4294967296> Ulimit (bytes) on all tasks scheduled on an ephemeral slot

mapreduce.tasktracker.reserved.physicalmemory.mb Maximum phyiscal memory tasktracker should reserve formapreduce tasks.If tasks use more than the limit, task using maximum memory willbe killed.Expert only: Set this value iff tasktracker should use a certainamount of memoryfor mapreduce tasks. In MapR Distro warden figures this numberbasedon services configured on a node.Setting mapreduce.tasktracker.reserved.physicalmemory.mb to -1will disablephysical memory accounting and task management.

mapreduce.tasktracker.heapbased.memory.management false Expert only: If admin wants to prevent swapping by not launchingtoo many tasksuse this option. Task's memory usage is based on max java heapsize (-Xmx).By default -Xmx will be computed by tasktracker based on slotsand memory reserved for mapreduce tasks.See mapred.map.child.java.opts/mapred.reduce.child.java.opts.

mapreduce.tasktracker.jvm.idle.time 10000 If jvm is idle for more than mapreduce.tasktracker.jvm.idle.time(milliseconds) tasktracker will kill it.

mapred.max.tracker.failures 4 The number of task-failures on a tasktracker of a given job afterwhich new tasks of that job aren't assigned to it.

mapred.max.tracker.blacklists 4 The number of blacklists for a taskTracker by various jobs afterwhich the task tracker could be blacklisted across all jobs. Thetracker will be given a tasks later (after a day). The tracker willbecome a healthy tracker after a restart.

Job Configuration

Users should set these values on the node from which you plan to submit jobs, before submitting the jobs. If you are using Hadoop examples, youcan set these parameters from the command line. Example:



1. 2. 3.

hadoop jar hadoop-examples.jar terasort -Dmapred.map.child.java.opts="-Xmx1000m"

When you submit a job, the JobClient creates by reading parameters from the following files in the following order:job.xml

mapred-default.xmlThe local - overrides identical parameters in mapred-site.xml mapred-default.xmlAny settings in the job code itself - overrides identical parameters in mapred-site.xml


keep.failed.task.files false Should the files for failed tasks be kept.This should only be used on jobs thatare failing, because the storage is neverreclaimed. It also prevents the mapoutputs from being erased from thereduce directory as they are consumed.

mapred.job.reuse.jvm.num.tasks -1 How many tasks to run per jvm. If set to-1, there is no limit.

mapred.map.tasks.speculative.execution true If true, then multiple instances of somemap tasks may be executed in parallel.

mapred.reduce.tasks.speculative.execution true If true, then multiple instances of somereduce tasks may be executed inparallel.

mapred.job.map.memory.physical.mb Maximum physical memory limit for maptask of this job. If limit is exceeded taskattempt will be FAILED.

mapred.job.reduce.memory.physical.mb Maximum physical memory limit forreduce task of this job. If limit isexceeded task attempt will be FAILED.

mapreduce.task.classpath.user.precedence false Set to true if user wants to set differentclasspath.

mapred.max.maps.per.node -1 Per-node limit on running map tasks forthe job. A value of -1 signifies no limit.

mapred.max.reduces.per.node -1 Per-node limit on running reduce tasksfor the job. A value of -1 signifies nolimit.

mapred.running.map.limit -1 Cluster-wide limit on running map tasksfor the job. A value of -1 signifies nolimit.

mapred.running.reduce.limit -1 Cluster-wide limit on running reducetasks for the job. A value of -1 signifiesno limit.

mapred.reduce.child.java.opts -XX:ErrorFile=/opt/cores/mapreduce_java_error%p.log Java opts for the reduce tasks. MapRDefault heapsize(-Xmx) is determinedby memory reserved for mapreduce attasktracker. Reduce task is given morememory than map task. Default memoryfor a reduce task = (Total Memoryreserved for mapreduce) *(2*#reduceslots / (#mapslots +2*#reduceslots))

mapred.reduce.child.ulimit

io.sort.mb Buffer used to hold map outputs inmemory before writing final mapoutputs. Setting this value very low maycause spills. By default if left emptyvalue is set to 50% of heapsize for map.If a average input to map is "MapIn"bytes then typically value of io.sort.mbshould be '1.25 times MapIn' bytes.



io.sort.factor 256 The number of streams to merge atonce while sorting files. This determinesthe number of open file handles.

io.sort.record.percent 0.17 The percentage of io.sort.mb dedicatedto tracking record boundaries. Let thisvalue be r, io.sort.mb be x. Themaximum number of records collectedbefore the collection thread must blockis equal to (r * x) / 4

mapred.reduce.slowstart.completed.maps 0.95 Fraction of the number of maps in thejob which should be complete beforereduces are scheduled for the job.

mapreduce.reduce.input.limit -1 The limit on the input size of the reduce.If the estimated input size of the reduce is greater thanthis value, job is failed. A value of -1 means that there is no limitset.

mapred.reduce.parallel.copies 12 The default number of parallel transfersrun by reduce during the copy(shuffle)phase.

Oozie


hadoop.proxyuser.root.hosts * comma separated ips/hostnames running Oozie server

hadoop.proxyuser.mapr.groups mapr,staff

hadoop.proxyuser.root.groups root



mfs.conf

The configuration file specifies the following parameters about the MapR-FS server on each node:/opt/mapr/conf/mfs.conf


mfs.server.ip 192.168.10.10 IP address of the FileServer

mfs.server.port 5660 Port used for communication with the server

mfs.cache.lru.sizes inode:6:log:6:meta:10:dir:40:small:15 LRU cache configuration

mfs.on.virtual.machine 0 Specifies whether MapR-FS is running on a virtual machine

mfs.io.disk.timeout 60 Timeout, in seconds, after which a disk is considered failed and taken offline.This parameter can be increased to tolerate slow disks.

mfs.max.disks 48 Maximum number of disks supported on a single node.

mfs.subnets.whitelist A list of subnets that are allowed to make requests to the FileServer serviceand access data on the cluster.

Example

mfs.server.ip=192.168.10.10mfs.server.port=5660mfs.cache.lru.sizes=inode:6:log:6:meta:10:dir:40:small:15mfs.on.virtual.machine=0mfs.io.disk.timeout=60mfs.max.disks=48



taskcontroller.cfg

The file specifies TaskTracker configuration parameters. The/opt/mapr/hadoop/hadoop-<version>/conf/taskcontroller.cfgparameters should be set the same on all TaskTracker nodes. See also .Secured TaskTracker


mapred.local.dir /tmp/mapr-hadoop/mapred/local The local MapReduce directory.

hadoop.log.dir /opt/mapr/hadoop/hadoop-0.20.2/bin/../logs The Hadoop log directory.

mapreduce.tasktracker.group root The group that is allowed to submit jobs.

min.user.id -1 The minimum user ID for submitting jobs:

Set to to disallow from submitting jobs0 rootSet to to disallow all superusers from submitting1000jobs

banned.users (not present by default) Add this parameter with a comma-separated list of usernames toban certain users from submitting jobs



warden.conf

The file controls parameters related to MapR services and the warden. Most of the parameters are not/opt/mapr/conf/warden.confintended to be edited directly by users. The following table shows the parameters of interest:


service.command.jt.heapsize.percent 10 The percentage of heap space reserved for the JobTracker.

service.command.jt.heapsize.max 5000 The maximum heap space that can be used by the JobTracker.

service.command.jt.heapsize.min 256 The minimum heap space for use by the JobTracker.

service.command.tt.heapsize.percent 2 The percentage of heap space reserved for the TaskTracker.

service.command.tt.heapsize.max 325 The maximum heap space that can be used by the TaskTracker.

service.command.tt.heapsize.min 64 The minimum heap space for use by the TaskTracker.

service.command.hbmaster.heapsize.percent 4 The percentage of heap space reserved for the HBase Master.

service.command.hbmaster.heapsize.max 512 The maximum heap space that can be used by the HBase Master.

service.command.hbmaster.heapsize.min 64 The minimum heap space for use by the HBase Master.

service.command.hbregion.heapsize.percent 25 The percentage of heap space reserved for the HBase Region Server.

service.command.hbregion.heapsize.max 4000 The maximum heap space that can be used by the HBase Region Server.

service.command.hbregion.heapsize.min 1000 The minimum heap space for use by the HBase Region Server.

service.command.cldb.heapsize.percent 8 The percentage of heap space reserved for the CLDB.

service.command.cldb.heapsize.max 4000 The maximum heap space that can be used by the CLDB.

service.command.cldb.heapsize.min 256 The minimum heap space for use by the CLDB.

service.command.mfs.heapsize.percent 20 The percentage of heap space reserved for the MapR-FS FileServer.

service.command.mfs.heapsize.min 512 The maximum heap space that can be used by the MapR-FS FileServer.

service.command.webserver.heapsize.percent 3 The percentage of heap space reserved for the MapR Control System.

service.command.webserver.heapsize.max 750 The maximum heap space that can be used by the MapR Control System.

service.command.webserver.heapsize.min 512 The minimum heap space for use by the MapR Control System.

service.command.os.heapsize.percent 3 The percentage of heap space reserved for the operating system.

service.command.os.heapsize.max 750 The maximum heap space that can be used by the operating system.

service.command.os.heapsize.min 256 The minimum heap space for use by the operating system.

service.nice.value -10 The priority under which all services will run.nice

zookeeper.servers 10.250.1.61:5181 The list of ZooKeeper servers.

services.retries 3 The number of times the Warden tries to restart a service that fails.

services.retryinterval.time.sec 1800 The number of seconds after which the warden will again attempt severaltimes to start a failed service. The number of attempts after each interval isspecified by the parameter .services.retries

cldb.port 7222 The port for communicating with the CLDB.

mfs.port 5660 The port for communicating with the FileServer.

hbmaster.port 60000 The port for communicating with the HBase Master.

hoststats.port 5660 The port for communicating with the HostStats service.

jt.port 9001 The port for communicating with the JobTracker.

kvstore.port 5660 The port for communicating with the Key/Value Store.



mapr.home.dir /opt/mapr The directory where MapR is installed.

warden.conf

services=webserver:all:cldb;jobtracker:1:cldb;tasktracker:all:jobtracker;nfs:all:cldb;kvstore:all;cldb:all:kvstore;hoststats:all:kvstoreservice.command.jt.start=/opt/mapr/hadoop/hadoop-0.20.2/bin/hadoop-daemon.sh start jobtrackerservice.command.tt.start=/opt/mapr/hadoop/hadoop-0.20.2/bin/hadoop-daemon.sh start tasktrackerservice.command.hbmaster.start=/opt/mapr/hbase/hbase-0.90.2/bin/hbase-daemon.sh start masterservice.command.hbregion.start=/opt/mapr/hbase/hbase-0.90.2/bin/hbase-daemon.sh start regionserverservice.command.cldb.start=/etc/init.d/mapr-cldb startservice.command.kvstore.start=/etc/init.d/mapr-mfs startservice.command.mfs.start=/etc/init.d/mapr-mfs startservice.command.nfs.start=/etc/init.d/mapr-nfsserver startservice.command.hoststats.start=/etc/init.d/mapr-hoststats startservice.command.webserver.start=/opt/mapr/adminuiapp/webserver startservice.command.jt.stop=/opt/mapr/hadoop/hadoop-0.20.2/bin/hadoop-daemon.sh stop jobtrackerservice.command.tt.stop=/opt/mapr/hadoop/hadoop-0.20.2/bin/hadoop-daemon.sh stop tasktrackerservice.command.hbmaster.stop=/opt/mapr/hbase/hbase-0.90.2/bin/hbase-daemon.sh stop masterservice.command.hbregion.stop=/opt/mapr/hbase/hbase-0.90.2/bin/hbase-daemon.sh stop regionserverservice.command.cldb.stop=/etc/init.d/mapr-cldb stopservice.command.kvstore.stop=/etc/init.d/mapr-mfs stopservice.command.mfs.stop=/etc/init.d/mapr-mfs stopservice.command.nfs.stop=/etc/init.d/mapr-nfsserver stopservice.command.hoststats.stop=/etc/init.d/mapr-hoststats stopservice.command.webserver.stop=/opt/mapr/adminuiapp/webserver stopservice.command.jt.type=BACKGROUNDservice.command.tt.type=BACKGROUNDservice.command.hbmaster.type=BACKGROUNDservice.command.hbregion.type=BACKGROUNDservice.command.cldb.type=BACKGROUNDservice.command.kvstore.type=BACKGROUNDservice.command.mfs.type=BACKGROUNDservice.command.nfs.type=BACKGROUNDservice.command.hoststats.type=BACKGROUNDservice.command.webserver.type=BACKGROUNDservice.command.jt.monitor=org.apache.hadoop.mapred.JobTrackerservice.command.tt.monitor=org.apache.hadoop.mapred.TaskTrackerservice.command.hbmaster.monitor=org.apache.hadoop.hbase.master.HMaster startservice.command.hbregion.monitor=org.apache.hadoop.hbase.regionserver.HRegionServer startservice.command.cldb.monitor=com.mapr.fs.cldb.CLDBservice.command.kvstore.monitor=server/mfsservice.command.mfs.monitor=server/mfsservice.command.nfs.monitor=server/nfsserverservice.command.jt.monitorcommand=/opt/mapr/hadoop/hadoop-0.20.2/bin/hadoop-daemon.sh statusjobtrackerservice.command.tt.monitorcommand=/opt/mapr/hadoop/hadoop-0.20.2/bin/hadoop-daemon.sh statustasktrackerservice.command.hbmaster.monitorcommand=/opt/mapr/hbase/hbase-0.90.2/bin/hbase-daemon.sh status masterservice.command.hbregion.monitorcommand=/opt/mapr/hbase/hbase-0.90.2/bin/hbase-daemon.sh statusregionserverservice.command.cldb.monitorcommand=/etc/init.d/mapr-cldb statusservice.command.kvstore.monitorcommand=/etc/init.d/mapr-mfs statusservice.command.mfs.monitorcommand=/etc/init.d/mapr-mfs statusservice.command.nfs.monitorcommand=/etc/init.d/mapr-nfsserver statusservice.command.hoststats.monitorcommand=/etc/init.d/mapr-hoststats statusservice.command.webserver.monitorcommand=/opt/mapr/adminuiapp/webserver statusservice.command.jt.heapsize.percent=10service.command.jt.heapsize.max=5000service.command.jt.heapsize.min=256service.command.tt.heapsize.percent=2service.command.tt.heapsize.max=325service.command.tt.heapsize.min=64service.command.hbmaster.heapsize.percent=4service.command.hbmaster.heapsize.max=512service.command.hbmaster.heapsize.min=64service.command.hbregion.heapsize.percent=25service.command.hbregion.heapsize.max=4000



service.command.hbregion.heapsize.min=1000service.command.cldb.heapsize.percent=8service.command.cldb.heapsize.max=4000service.command.cldb.heapsize.min=256service.command.mfs.heapsize.percent=20service.command.mfs.heapsize.min=512service.command.webserver.heapsize.percent=3service.command.webserver.heapsize.max=750service.command.webserver.heapsize.min=512service.command.os.heapsize.percent=3service.command.os.heapsize.max=750service.command.os.heapsize.min=256service.nice.value=-10zookeeper.servers=10.250.1.61:5181nodes.mincount=1services.retries=3cldb.port=7222mfs.port=5660hbmaster.port=60000hoststats.port=5660jt.port=9001



kvstore.port=5660mapr.home.dir=/opt/mapr



Ports Used by MapR

Service Port

SSH 22

NFS 2049

MFS server 5660

ZooKeeper 5181

CLDB web port 7221

CLDB 7222

Web UI HTTP 8080 (set by user)

Web UI HTTPS 8443 (set by user)

JobTracker 9001

NFS monitor (for HA) 9997

NFS management 9998

JobTracker web 50030

TaskTracker web 50060

HBase Master 60000

LDAP Set by user

SMTP Set by user



Best Practices

File Balancing

MapR distributes volumes to balance files across the cluster. Each volume has a name container that is restricted to one . Thestorage poolgreater the number of volumes, the more evenly MapR can distribute files. For best results, the number of volumes should be greater than thetotal number of storage pools in the cluster. To accommodate a very large number of files, you can use with the option whendisksetup -Winstalling or re-formatting nodes, to create storage pools larger than the default of three disks each.

Disk Setup

It is not necessary to set up RAID on disks used by MapR-FS. MapR uses a script called to set up storage pools. In most cases, youdisksetupshould let MapR calculate storage pools using the default of two or three disks. If you anticipate a high volume of random-access I/O,stripe widthyou can use the option with to specify larger storage pools of up to 8 disks each.-W disksetup

Setting Up NFS

The service lets you access data on a licensed MapR cluster via the protocol:mapr-nfs NFS

M3 license: one NFS nodeM5 license: multiple NFS nodes with VIPs for failover and load balancing

You can mount the MapR cluster via NFS and use standard shell scripting to read and write live data in the cluster. NFS access to cluster datacan be faster than accessing the same data with the commands. To mount the cluster via NFS from a client machine, see hadoop Setting Up the

.Client

NFS Setup Tips

Before using the MapR NFS Gateway, here are a few helpful tips:

Ensure the stock Linux NFS service is stopped, as Linux NFS and MapR NFS will conflict with each other.Ensure is running (Example: ).portmapper ps a | grep portmapMake sure you have installed the package. If you followed the or instructmapr-nfs Quick Start - Single Node Quick Start - Small Clusterions, then it is installed. You can check by listing the directory and checking for in the list./opt/mapr/roles nfsMake sure you have applied an M3 license or an M5 (paid or trial) license to the cluster. See .Adding a LicenseMake sure the NFS service is started (see ).ServicesFor information about mounting the cluster via NFS, see .Setting Up the Client


At installation time, choose one node on which to run the NFS gateway. NFS is lightweight and can be run on a node running services such asCLDB or ZooKeeper. To add the NFS service to a running cluster, use the instructions in to install the package on theAdding Roles mapr-nfsnode where you would like to run NFS.


At cluster installation time, plan which nodes should provide NFS access according to your anticipated traffic. For instance, if you need 5Gbps ofwrite throughput and 5Gbps of read throughput, here are a few ways to set up NFS:

12 NFS nodes, each of which has a single 1Gbe connection6 NFS nodes, each of which has a dual 1Gbe connection4 NFS nodes, each of which has a quad 1Gbe connection

You can also set up NFS on all file server nodes, so each node can NFS-mount itself and native applications can run as tasks, or on one or morededicated gateways outside the cluster (using round-robin DNS or behind a hardware load balancer) to allow controlled access.

You can set up virtual IP addresses (VIPs) for NFS nodes in an M5-licensed MapR cluster, for load balancing or failover. VIPs provide multipleaddresses that can be leveraged for round-robin DNS, allowing client connections to be distributed among a pool of NFS nodes. VIPs also makehigh availability (HA) NFS possible; in the event an NFS node fails, data requests are satisfied by other NFS nodes in the pool. You should use aminimum of one VIP per NFS node per NIC that clients will use to connect to the NFS server. If you have four nodes with four NICs each, witheach NIC connected to an individual IP subnet, use a minimum of 16 VIPs and direct clients to the VIPs in round-robin fashion. The VIPs shouldbe in the same IP subnet as the interfaces to which they will be assigned.

Here are a few tips:

http://en.wikipedia.org/wiki/Network_File_System_%28protocol%29




1.

Set up NFS on at least three nodes if possible.All NFS nodes must be accessible over the network from the machines where you want to mount them.To serve a large number of clients, set up dedicated NFS nodes and load-balance between them. If the cluster is behind a firewall, youcan provide access through the firewall via a load balancer instead of direct access to each NFS node. You can run NFS on all nodes inthe cluster, if needed.To provide maximum bandwidth to a specific client, install the NFS service directly on the client machine. The NFS gateway on the clientmanages how data is sent in or read back from the cluster, using all its network interfaces (that are on the same subnet as the clusternodes) to transfer data via MapR APIs, balancing operations among nodes as needed.Use VIPs to provide High Availability (HA) and failover. See for more information.Setting Up NFS HA

To add the NFS service to a running cluster, use the instructions in to install the package on the nodes where you wouldAdding Roles mapr-nfslike to run NFS.

NFS Memory Settings

The memory allocated to each MapR service is specified in the file, which MapR automatically configures/opt/mapr/conf/warden.confbased on the physical memory available on the node. You can adjust the minimum and maximum memory used for NFS, as well as thepercentage of the heap that it tries to use, by setting the , , and parameters in the file on each NFS node.percent max min warden.confExample:

...service.command.nfs.heapsize.percent=3service.command.nfs.heapsize.max=1000service.command.nfs.heapsize.min=64...

The percentages need not add up to 100; in fact, you can use less than the full heap by setting the parameters for allheapsize.percentservices to add up to less than 100% of the heap size. In general, you should not need to adjust the memory settings for individual services,unless you see specific memory-related problems occurring.

Running NFS on a Non-standard Port

To run NFS on an arbitrary port, modify the following line in :warden.conf

service.command.nfs.start=/etc/init.d/mapr-nfsserver start

Add to the end of the line, as in the following example:-p <portnumber>

service.command.nfs.start=/etc/init.d/mapr-nfsserver start -p 12345

After modifying , restart the MapR NFS server by issuing the following command:warden.conf

maprcli node services -nodes <nodename> -nfs restart

You can verify the port change with the command.rpcinfo -p localhost

NIC Configuration

For high performance clusters, use more than one network interface card (NIC) per node. MapR can detect multiple IP addresses on each nodeand load-balance throughput automatically.

Isolating CLDB Nodes

In a large cluster (100 nodes or more) create CLDB-only nodes to ensure high performance. This configuration also provides additional controlover the placement of the CLDB data, for load balancing, fault tolerance, or high availability (HA). Setting up CLDB-only nodes involves restrictingthe CLDB volume to its own topology and making sure all other volumes are on a separate topology. Unless you specify a default volumetopology, new volumes have no topology when they are created, and reside at the root topology path: " ". Because both the CLDB-only path and/the non-CLDB path are children of the root topology path, new non-CLDB volumes are not guaranteed to keep off the CLDB-only nodes. To avoidthis problem, set a default volume topology. See .Setting Default Volume Topology




1.

2.

3. 4. 5.

1.

2.

3.

4.

1.

2.

1.

2.

3. 4. 5.










Isolating ZooKeeper Nodes

For large clusters (100 nodes or more), isolate the ZooKeeper on nodes that do not perform any other function, so that the ZooKeeper does notcompete for resources with other processes. Installing a ZooKeeper-only node is similar to any typical node installation, but with a specific subsetof packages. Importantly, do not install the FileServer package, so that MapR does not use the ZooKeeper-only node for data storage.

To set up a ZooKeeper-only node:


INSTALL only the following packages:mapr-zookeepermapr-zk-internalmapr-core

RUN .configure.shFORMAT the disks.START ZooKeeper (as or using ):root sudo


Do not start the warden.

Setting Up RAID on the Operating System Partition



You can set up RAID on the operating system partition(s) or drive(s) at installation time, to provide higher operating system performance (RAID0), disk mirroring for failover (RAID 1), or both (RAID 10), for example. See the following instructions from the operating system websites:

CentOSRed HatUbuntu

Tuning MapReduce

MapR automatically tunes the cluster for most purposes. A service called the determines machine resources on nodes configured to runwardenthe TaskTracker service, and sets MapReduce parameters accordingly.

On nodes with multiple CPUs, MapR uses to reserve CPUs for MapR services:taskset

On nodes with five to eight CPUs, CPU 0 is reserved for MapR servicesOn nodes with nine or more CPUs, CPU 0 and CPU 1 are reserved for MapR services

In certain circumstances, you might wish to manually tune MapR to provide higher performance. For example, when running a job consisting ofunusually large tasks, it is helpful to reduce the number of slots on each TaskTracker and adjust the Java heap size. The following sectionsprovide MapReduce tuning tips. If you change any settings in , restart the TaskTracker.mapred-site.xml

Memory Settings

Memory for MapR Services

The memory allocated to each MapR service is specified in the file, which MapR automatically configures/opt/mapr/conf/warden.confbased on the physical memory available on the node. For example, you can adjust the minimum and maximum memory used for theTaskTracker, as well as the percentage of the heap that the TaskTracker tries to use, by setting the appropriate , , and parametpercent max miners in the file:warden.conf

...service.command.tt.heapsize.percent=2service.command.tt.heapsize.max=325service.command.tt.heapsize.min=64...

The percentages of memory used by the services need not add up to 100; in fact, you can use less than the full heap by setting the heapsize.p parameters for all services to add up to less than 100% of the heap size. In general, you should not need to adjust the memory settingsercent

for individual services, unless you see specific memory-related problems occurring.

MapReduce Memory

The memory allocated for MapReduce tasks normally equals the total system memory minus the total memory allocated for MapR services. Ifnecessary, you can use the parameter to set the maximum physical memory reserved bymapreduce.tasktracker.reserved.physicalmemory.mbMapReduce tasks, or you can set it to to disable physical memory accounting and task management.-1

If the node runs out of memory, MapReduce tasks are killed by the to free memory. You can use (copyOOM-killer mapred.child.oom_adjfrom to adjust the parameter for MapReduce tasks. The possible values of range from -17 to +15.mapred-default.xml oom_adj oom_adjThe higher the score, more likely the associated process is to be killed by OOM-killer.

Job Configuration

Map Tasks

Map tasks use memory mainly in two ways:

The MapReduce framework uses an intermediate buffer to hold serialized (key, value) pairs.The application consumes memory to run the map function.

MapReduce framework memory is controlled by . If is less than the data emitted from the mapper, the task ends upio.sort.mb io.sort.mbspilling data to disk. If is too large, the task can run out of memory or waste allocated memory. By default is 100mb. Itio.sort.mb io.sort.mbshould be approximately 1.25 times the number of data bytes emitted from mapper. If you cannot resolve memory problems by adjusting io.sor

, then try to re-write the application to use less memory in its map function.t.mb

Compression

To turn off MapR compression for map outputs, set mapreduce.maprfs.use.compression=false

http://wiki.centos.org/HowTos/SoftwareRAIDonCentOS5

http://docs.redhat.com/docs/en-US/Red_Hat_Enterprise_Linux/3/html/System_Administration_Guide/ch-software-raid.html

https://help.ubuntu.com/community/Installation/SoftwareRAID

http://www.unix.com/man-page/Linux/1/taskset/

http://linux-mm.org/OOM_Killer



1.

2.

3.

To turn on LZO or any other compression, set and mapreduce.maprfs.use.compression=false mapred.compress.map.output=true

Reduce Tasks

If tasks fail because of an Out of Heap Space error, increase the heap space (the option in ) to give-Xmx mapred.reduce.child.java.optsmore memory to the tasks. If map tasks are failing, you can also try reducing .io.sort.mb(see mapred.map.child.java.opts in mapred-site.xml)


MapR sets up map and reduce slots on each TaskTracker node using formulas based on the number of CPUs present on the node. The defaultformulas are stored in the following parameters in :mapred-site.xml

mapred.tasktracker.map.tasks.maximum: (CPUS > 2) ? (CPUS * 0.75) : 1 (At least one Map slot, up to 0.75 times the number ofCPUs)mapred.tasktracker.reduce.tasks.maximum: (CPUS > 2) ? (CPUS * 0.50) : 1 (At least one Map slot, up to 0.50 times thenumber of CPUs)

You can adjust the maximum number of map and reduce slots by editing the formula used in anmapred.tasktracker.map.tasks.maximumd . The following variables are used in the formulas:mapred.tasktracker.reduce.tasks.maximum

CPUS - number of CPUs present on the nodeDISKS - number of disks present on the nodeMEM - memory reserved for MapReduce tasks

Ideally, the number of map and reduce slots should be decided based on the needs of the application. Map slots should be based on how manymap tasks can fit in memory, and reduce slots should be based on the number of CPUs. If each task in a MapReduce job takes 3 GB, and eachnode has 9GB reserved for MapReduce tasks, then the total number of map slots should be 3. The amount of data each map task must processalso affects how many map slots should be configured. If each map task processes 256 MB (the default chunksize in MapR), then each map taskshould have 800 MB of memory. If there are 4 GB reserved for map tasks, then the number of map slots should be 4000MB/800MB, or 5 slots.

MapR allows the JobTracker to over-schedule tasks on TaskTracker nodes in advance of the availability of slots, creating a pipeline. Thisoptimization allows TaskTracker to launch each map task as soon as the previous running map task finishes. The number of tasks toover-schedule should be about 25-50% of total number of map slots. You can adjust this number with the parameter mapreduce.tasktracker.prefe

.tch.maptasks

Troubleshooting Out-of-Memory Errors

When the aggregated memory used by MapReduce tasks exceeds the memory reserve on a TaskTracker node, tasks can fail or be killed. MapRattempts to prevent out-of-memory exceptions by killing MapReduce tasks when memory becomes scarce. If you allocate too little Java heap forthe expected memory requirements of your tasks, an exception can occur. The following steps can help configure MapR to avoid these problems:

If a particular job encounters out-of-memory conditions, the simplest way to solve the problem might be to reduce the memory footprint ofthe map and reduce functions, and to ensure that the partitioner distributes map output to reducers evenly.

If it is not possible to reduce the memory footprint of the application, try increasing the Java heap size (-Xmx) in the client-sideMapReduce configuration.

If many jobs encounter out-of-memory conditions, or if jobs tend to fail on specific nodes, it may be that those nodes are advertising toomany TaskTracker slots. In this case, the cluster administrator should reduce the number of slots on the affected nodes.

To reduce the number of slots on a node:

Stop the TaskTracker service on the node:

$ sudo maprcli node services -nodes <node name> -tasktracker stop

Edit the file :/opt/mapr/hadoop/hadoop-<version>/conf/mapred-site.xmlReduce the number of map slots by lowering mapred.tasktracker.map.tasks.maximumReduce the number of reduce slots by lowering mapred.tasktracker.reduce.tasks.maximum

Start the TaskTracker on the node:

$ sudo maprcli node services -nodes <node name> -tasktracker start

ExpressLane

MapR provides an express path for small MapReduce jobs to run when all slots are occupied by long tasks. Small jobs are only given this special



treatment when the cluster is busy, and only if they meet the criteria specified by the following parameters in :mapred-site.xml


mapred.fairscheduler.smalljob.schedule.enable true Enable small job fast scheduling inside fair scheduler. TaskTrackersshould reserve a slot called ephemeral slot which is used for smalljob ifcluster is busy.

mapred.fairscheduler.smalljob.max.maps 10 Small job definition. Max number of maps allowed in small job.

mapred.fairscheduler.smalljob.max.reducers 10 Small job definition. Max number of reducers allowed in small job.

mapred.fairscheduler.smalljob.max.inputsize 10737418240 Small job definition. Max input size in bytes allowed for a small job.Default is 10GB.

mapred.fairscheduler.smalljob.max.reducer.inputsize 1073741824 Small job definition. Max estimated input size for a reducer allowed insmall job. Default is 1GB per reducer.

mapred.cluster.ephemeral.tasks.memory.limit.mb 200 Small job definition. Max memory in mbytes reserved for an ephermalslot. Default is 200mb. This value must be same on JobTracker andTaskTracker nodes.

MapReduce jobs that appear to fit the small job definition but are in fact larger than anticipated are killed and re-queued for normal execution.

HBase

The HBase write-ahead log (WAL) writes many tiny records, and compressing it would cause massive CPU load. Before using HBase,turn off MapR compression for directories in the HBase volume (normally mounted at . Example:/hbase

hadoop mfs -setcompression off /hbase

You can check whether compression is turned off in a directory or mounted volume by using to list the file contents. Example:hadoop mfs

hadoop mfs -ls /hbase

The letter in the output indicates compression is turned on; the letter indicates compression is turned off. See for moreZ U hadoop mfsinformation.

On any node where you plan to run both HBase and MapReduce, give more memory to the FileServer than to the RegionServer so that the nodecan handle high throughput. For example, on a node with 24 GB of physical memory, it might be desirable to limit the RegionServer to 4 GB, give10 GB to MapR-FS, and give the remainder to TaskTracker. To change the memory allocated to each service, edit the /opt/mapr/conf/warde

file. See for more information.n.conf Tuning MapReduce



Glossary

Term Definition

.dfs_attributes A special file in every directory, for controlling the compression and chunk size used for the directory and its subdirectories.

.rw A special mount point in the root-level volume (or read-only mirror) that points to the writable original copy of the volume.

.snapshot A special directory in the top level of each volume, containing all the snapshots for that volume.

accesscontrol list

A list of permissions attached to an object. An access control list (ACL) specifies users or system processes that can performspecific actions on an object.

accountingentity

A clearly defined economics unit that is accounted for separately.

ACL See .access control list

advisoryquota

An advisory disk capacity limit that can be set for a volume, user, or group. When disk usage exceeds the advisory quota, analert is sent.

AE See .accounting entity

bitmask A binary number in which each bit controls a single toggle.

CLDB See .container location database

container The unit of sharded storage in a MapR cluster.

containerlocationdatabase

A service, running on one or more MapR nodes, that maintains the locations of services, containers, and other clusterinformation.

desiredreplicationfactor

The number of copies of a volume, not including the original, that should be maintained by the MapR cluster for normaloperation.

disktab A file on each node, containing a list of the node's disks that have been configured for use by MapR-FS.

dump file A file containing data from a volume for distribution or restoration. There are two types of dump files: dump files containingfullall data in a volume, and dump files that contain changes to a volume between two points in time.incremental

entity A user or group. Users and groups can represent .accounting entities

full dump file See .dump file

Hbase A distributed storage system, designed to scale to a very large size, for managing massive amounts of structured data.

heartbeat A signal sent by each FileServer and NFS node every second to provide information to the CLDB about the node's health andresource usage.

incrementaldump file

See .dump file

JobTracker The process responsible for submitting and tracking MapReduce jobs. The JobTracker sends individual tasks to TaskTrackerson nodes in the cluster.

Mapr-FS The NFS-mountable, distributed, high-performance MapR data storage system.

minimumreplicationfactor

The minimum number of copies of a volume, not including the original, that should be maintained by the MapR cluster fornormal operation. When the replication factor falls below this minimum, writes to the volume are disabled.

mirror A read-only physical copy of a volume.

namecontainer

A container that holds a volume's namespace information and file chunk locations, and the first 64 KB of each file in thevolume.

Network FileSystem

A protocol that allows a user on a client computer to access files over a network as though they were stored locally.

NFS See .Network File System

node An individual server (physical or virtual machine) in a cluster.



quota A disk capacity limit that can be set for a volume, user, or group. When disk usage exceeds the quota, no more data can bewritten.

recovery pointobjective

The maximum allowable data loss as a point in time. If the recovery point objective is 2 hours, then the maximum allowableamount of data loss that is acceptable is 2 hours of work.

recovery timeobjective

The maximum alllowable time to recovery after data loss. If the recovery time objective is 5 hours, then it must be possible torestore data up to the recovery point objective within 5 hours. See also recovery point objective

replicationfactor

The number of copies of a volume, not including the original.

RPO See .recovery point objective

RTO See .recovery time objective

schedule A group of rules that specify recurring points in time at which certain actions are determined to occur.

snapshot A read-only logical image of a volume at a specific point in time.

storage pool A unit of storage made up of one or more disks. By default, MapR storage pools contain two or three disks. For high-volumereads and writes, you can create larger storage pools when initially formatting storage during cluster creation.

stripe width The number of disks in a .storage pool

super group The group that has administrative access to the MapR cluster.

super user The group that has administrative access to the MapR cluster.

TaskTracker The process that starts and tracks MapReduce tasks on a node. The TaskTracker receives task assignments from theJobTracker and reports the results of each task back to the JobTracker on completion.

volume A tree of files, directories, and other volumes, grouped for the purpose of applying a policy or set of policies to all of them atonce.

warden A MapR process that coordinates the starting and stopping of configured services on a node.

ZooKeeper A centralized service for maintaining configuration information, naming, providing distributed synchronization, and providinggroup services.

Documents

Quick Start Installation Administration Development ...€¦ · MapR Administrator Training April 2012 Version 1.2.10 Quick Start Installation Administration Development Reference