How to Install and Configure Big Data Edition for MapR 4.0 Library/1/0818-Big... · 2020-04-03 · Pre-Installation Tasks for a Single Node Environment Before you begin the Big Data

How to Install and Configure Big Data Edition

for MapR 4.0.2

© 1993-2015 Informatica Corporation. No part of this document may be reproduced or transmitted in any form, by any means (electronic, photocopying, recording or otherwise) without prior consent of Informatica Corporation. All other company and product names may be trade names or trademarks of their respective owners and/or copyrighted materials of such owners.

AbstractInstall and configure Big Data Edition to run mappings on a Hadoop cluster on MapR 4.0.2. After you install Big Data Edition, you must enable mappings to run on MapR. You must also configure the Big Data Edition Client files to communicate with the Hadoop cluster.

Supported Versions• Big Data Edition 9.6.1 HotFix 2 Update 1

• Big Data Edition 9.6.1 HotFix 3 Update 2

Table of ContentsInstallation and Configuration Overview. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3

Pre-Installation Tasks. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4

Install and Configure PowerCenter. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4

Install and Configure PowerExchange Adapters. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4

Install and Configure Data Replication. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4

Pre-Installation Tasks for a Single Node Environment. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5

Pre-Installation Tasks for a Cluster Environment. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5

Informatica Big Data Edition Installation. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6

Installing in a Single Node Environment. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7

Installing in a Cluster Environment. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7

Installing in a Single Node Environment. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7

Installing in a Cluster Environment from the Primary NameNode Using SCP Protocol. . . . . . . . . . . . . . . 8

Installing in a Cluster Environment from the Primary NameNode Using FTP, HTTP, or NFS Protocol. . . . . 9

Installing in a Cluster Environment from any Machine. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9

Reference Data Requirements. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10

Configure the Hadoop Cluster. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11

Verify the Cluster Details. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11

Update Hadoop Cluster Configuration Parameters on the Hadoop Cluster. . . . . . . . . . . . . . . . . . . . . 12

Configure hive-site.xml on Every Node in the Hadoop Cluster for MapReduce 1. . . . . . . . . . . . . . . . . 13

Configure yarn-site.xml on Every Node in the Cluster for MapReduce 2. . . . . . . . . . . . . . . . . . . . . . . 13

Configure the Heap Space for the MapR-FS. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14

Enable Hive Pushdown for Hbase. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14

Copy Teradata JDBC Jars to Hadoop Nodes. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15

Configure the Informatica Domain. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15

Configure Hadoop Pushdown Properties for the Data Integration Service. . . . . . . . . . . . . . . . . . . . . . 15

Configure hive-site.xml on the Data Integration Service Machine for MapReduce 1. . . . . . . . . . . . . . . 16

Configure yarn-site.xml on the Data Integration Service Machine for MapReduce 2. . . . . . . . . . . . . . . 17

Configure MapR Distribution Variables for Mappings in a Hive Environment. . . . . . . . . . . . . . . . . . . . 17

Library Path and Path Variables for Mappings in a Hive Environment. . . . . . . . . . . . . . . . . . . . . . . . 18

2

Hadoop Environment Properties File. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18

Update Hadoop Cluster Configuration Parameters on the Informatica Domain. . . . . . . . . . . . . . . . . . . 19

Hive Variables for Mappings in a Hive Environment. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19

Copy Teradata JDBC Jars to the Data Integration Service Machine. . . . . . . . . . . . . . . . . . . . . . . . . 19

Update the Repository Plug-in. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20

Configure the Client Machine. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20

Informatica Developer Files and Variables. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20

Copy MapR Distribution Files for PowerCenter Mappings in the Native Environment. . . . . . . . . . . . . . . 21

Configure the PowerCenter Integration Service. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21

Copy Teradata JDBC Jars to the Client Machine. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22

Configure High Availability. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22

Configuring a Highly Available MapR Cluster. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22

Connections. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23

HDFS Connection Properties. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23

HBase Connection Properties. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24

Hive Connection Properties. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25

Creating a Connection. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29

Informatica Big Data Edition Uninstallation. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30

Uninstalling Big Data Edition. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30

Installation and Configuration OverviewInformatica Big Data Edition supports MapR 4.0.2 that uses MapReduce v1 or v2. You can allocate resources using CapacityScheduler or Fair Scheduler. The Big Data Edition installation is distributed as a Red Hat Package Manager (RPM) installation package. After you install Big Data Edition, you must enable mappings to run on a Hadoop cluster on MapR. You must also configure the Big Data Edition client files to communicate with the Hadoop cluster.

To install Big Data Edition and enable Informatica mappings to run on MapR, perform the following tasks:

1. Complete the pre-installation tasks.

2. Install Big Data Edition.

3. Configure the Hadoop cluster.

4. Configure the Informatica domain.

5. Configure the client machine.

3

Pre-Installation TasksBefore you begin the installation, install the Informatica components and PowerExchange adapters, and perform the pre-installation tasks.

Install and Configure PowerCenterBefore you install Big Data Edition, install and configure Informatica PowerCenter.

You can install the following PowerCenter editions:

• PowerCenter Advanced Edition

• PowerCenter Standard Edition

• PowerCenter Real Time Edition

You must install the Informatica services and clients. Run the Informatica services installation to configure the PowerCenter domain and create the Informatica services. Run the Informatica client installation to create the PowerCenter Client.

Install and Configure PowerExchange AdaptersBased on your business needs, install and configure PowerExchange adapters. Use Big Data Edition with PowerCenter and Informatica adapters for access to sources and targets.

To run Informatica mappings in a Hive environment you must install and configure PowerExchange for Hive. For more information, see the Informatica PowerExchange for Hive User Guide.

PowerCenter Adapters

Use PowerCenter adapters, such as PowerExchange for Hadoop, to define sources and targets in PowerCenter mappings.

For more information about installing and configuring PowerCenter adapters, see the PowerExchange adapter documentation.

Informatica Adapters

You can use the following Informatica adapters as part of PowerCenter Big Data Edition:

• PowerExchange for DataSift

• PowerExchange for Facebook

• PowerExchange for HBase

• PowerExchange for HDFS

• PowerExchange for Hive

• PowerExchange for LinkedIn

• PowerExchange for Teradata Parallel Transporter API

• PowerExchange for Twitter

• PowerExchange for Web Content-Kapow Katalyst

For more information, see the PowerExchange adapter documentation.

Install and Configure Data ReplicationTo migrate data with minimal downtime and perform auditing and operational reporting functions, install and configure Data Replication. For information, see the Informatica Data Replication User Guide.

4

Pre-Installation Tasks for a Single Node EnvironmentBefore you begin the Big Data Edition installation in a single node environment, perform the pre-installation tasks.

• Verify that Hadoop is installed with Hadoop File System (HDFS) and MapReduce. The Hadoop installation should include a Hive data warehouse that is configured to use a non-embedded database as the MetaStore. For more information, see the Apache website here: http://hadoop.apache.org.

• To perform both read and write operations in native mode, install the required third-party client software. For example, install the Oracle client to connect to the Oracle database.

• Verify that the Big Data Edition administrator user can run sudo commands or have user root privileges.

• Verify that the temporary folder on the local node has at least 700 MB of disk space.

• Download the following file to the temporary folder: InformaticaHadoop-<InformaticaForHadoopVersion>.tar.gz

• Extract the following file to the local node where you want to run the Big Data Edition installation: InformaticaHadoop-<InformaticaForHadoopVersion>.tar.gz

Pre-Installation Tasks for a Cluster EnvironmentBefore you begin the Big Data Edition installation in a cluster environment, perform the following tasks:

• Install third-party software.

• Verify the distribution method.

• Verify system requirements.

• Verify connection requirements.

• Download the RPM.

Install Third-Party Software

Verify that the following third-party software is installed:Hadoop with Hadoop Distributed File System (HDFS) and MapReduce

Hadoop must be installed on every node within the cluster. The Hadoop installation must include a Hive data warehouse that is configured to use a MySQL database as the MetaStore. You can configure Hive to use a local or remote MetaStore server. For more information, see the Apache website here: http://hadoop.apache.org/.

Note: Informatica does not support embedded MetaStore server setups.

Database client software to perform read and write operations in native mode

Install the client software for the database. Informatica requires the client software to run MapReduce jobs. For example, install the Oracle client to connect to the Oracle database. Install the database client software on all the nodes within the Hadoop cluster.

Verify the Distribution Method

You can distribute the RPM package with one of the following protocols:

• File Transfer Protocol (FTP)

• Hypertext Transfer Protocol (HTTP)

• Network File System (NFS) protocol

• Secure Copy (SCP) protocol

5

http://hadoop.apache.org

http://hadoop.apache.org/

To verify that you can distribute the RPM package with one of the protocols, perform the following tasks:

1. Ensure that the server or service for your distribution method is running.

2. In the config file on the machine where you want to run the Big Data installation, set the DISTRIBUTOR_NODE parameter to the following setting:

• FTP: Set DISTRIBUTOR_NODE=ftp://<Distributor Node IP Address>/pub• HTTP: Set DISTRIBUTOR_NODE=http://<Distributor Node IP Address>• NFS: Set DISTRIBUTOR_NODE=<Shared file location on the node.>

The file location must be accessible to all nodes in the cluster.

Verify System Requirements

Verify the following system requirements:

• The Big Data Edition administrator can run sudo commands or has root user privileges.

• The temporary folder in each of the nodes on which Big Data Edition will be installed has at least 700 MB of disk space.

Verify Connection Requirements

Verify the connection to the Hadoop cluster nodes.

Big Data Edition requires a Secure Shell (SSH) connection without a password between the machine where you want to run the Big Data Edition installation and all the nodes in the Hadoop cluster.

Download the RPM

Download the following file to a temporary folder:

InformaticaHadoop-<InformaticaForHadoopVersion>.tar.gz

Extract the file to the machine from where you want to distribute the RPM package and run the Big Data Edition installation.

Copy the following package to a shared directory based on the transfer protocol you are using: InformaticaHadoop-<InformaticaForHadoopVersion>.rpm.

For example,

• HTTP: /var/www/html

• FTP: /var/ftp/pub

• NFS: <Shared location on the node>

The file location must be accessible by all the nodes in the cluster.

Note: The RPM package must be stored on a local disk and not on HDFS.

Informatica Big Data Edition InstallationYou can install Big Data Edition in a single node environment. You can also install Big Data Edition in a cluster environment from the primary NameNode or from any machine.

Install Big Data Edition in a single node environment or cluster environment:

• Install Big Data Edition in a single node environment.

6

• Install Big Data Edition in a cluster environment from the primary NameNode using SCP protocol.

• Install Big Data Edition in a cluster environment from the primary NameNode using FTP, HTTP, or NFS protocol.

• Install Big Data Edition in a cluster environment from any machine.

Install Big Data Edition from a shell command line.

Installing in a Single Node EnvironmentYou can install Big Data Edition in a single node environment.

1. Extract the Big Data Edition tar.gz file to the machine.

2. Install Big Data Edition by running the installation shell script in a Linux environment.

Installing in a Cluster EnvironmentYou can install Big Data Edition in a cluster environment.

1. Extract the Big Data Edition tar.gz file to a machine.

2. Distribute the RPM package to all of the nodes within the Hadoop cluster. You can distribute the RPM package using any of the following protocols: File Transfer Protocol (FTP), Hypertext Transfer Protocol (HTTP), Network File System (NFS), or Secure Copy Protocol (SCP).

3. Install Big Data Edition by running the installation shell script in a Linux environment. You can install Big Data Edition from the primary NameNode or from any machine using the HadoopDataNodes file.

• Install from the primary NameNode. You can install Big Data Edition using FTP, HTTP, NFS or SCP protocol. During the installation, the installer shell script picks up all of the DataNodes from the following file: $HADOOP_HOME/conf/slaves. Then, it copies the Big Data Edition binary files to the following directory on each of the DataNodes: /<BigDataEditionInstallationDirectory>/Informatica. You can perform this step only if you are deploying Hadoop from the primary NameNode.

• Install from any machine. Add the IP addresses or machine host names, one for each line, for each of the nodes in the Hadoop cluster in the HadoopDataNodes file. During the Big Data Edition installation, the installation shell script picks up all of the nodes from the HadoopDataNodes file and copies the Big Data Edition binary files to the /<BigDataEditionInstallationDirectory>/Informatica directory on each of the nodes.

Installing in a Single Node EnvironmentYou can install Big Data Edition in a single node environment.

1. Log in to the machine.

2. Run the following command from the Big Data Edition root directory to start the installation in console mode: bash InformaticaHadoopInstall.sh

3. Press y to accept the Big Data Edition terms of agreement.

4. Press Enter.

5. Press 1 to install Big Data Edition in a single node environment.

6. Press Enter.

7. Type the absolute path for the Big Data Edition installation directory and press Enter.

7

Start the path with a slash. The directory names in the path must not contain spaces or the following special characters: { } ! @ # $ % ^ & * ( ) : ; | ' ` < > , ? + [ ] \

If you type a directory path that does not exist, the installer creates the entire directory path on each of the nodes during the installation. Default is /opt.

8. Press Enter.

The installer creates the /<BigDataEditionInstallationDirectory>/Informatica directory and populates all of the file systems with the contents of the RPM package.

To get more information about the tasks performed by the installer, you can view the informatica-hadoop-install.<DateTimeStamp>.log installation log file.

Installing in a Cluster Environment from the Primary NameNode Using SCP ProtocolYou can install Big Data Edition in a cluster environment from the primary NameNode using SCP protocol.

1. Log in to the primary NameNode.

2. Run the following command to start the Big Data Edition installation in console mode: bash InformaticaHadoopInstall.sh


4. Press Enter.

5. Press 2 to install Big Data Edition in a cluster environment.

6. Press Enter.

7. Type the absolute path for the Big Data Edition installation directory.



8. Press Enter.

9. Press 1 to install Big Data Edition from the primary NameNode.

10. Press Enter.

11. Type the absolute path for the Hadoop installation directory. Start the path with a slash.

12. Press Enter.

13. Type y.

14. Press Enter.

The installer retrieves a list of DataNodes from the $HADOOP_HOME/conf/slaves file. On each of the DataNodes, the installer creates the Informatica directory and populates all of the file systems with the contents of the RPM package. The Informatica directory is located here: /<BigDataEditionInstallationDirectory>/Informatica

You can view the informatica-hadoop-install.<DateTimeStamp>.log installation log file to get more information about the tasks performed by the installer.

8

Installing in a Cluster Environment from the Primary NameNode Using FTP, HTTP, or NFS ProtocolYou can install Big Data Edition in a cluster environment from the primary NameNode using FTP, HTTP, or NFS protocol.

1. Log in to the primary NameNode.



4. Press Enter.


6. Press Enter.

7. Type the absolute path for the Big Data Edition installation directory.



8. Press Enter.

9. Press 1 to install Big Data Edition from the primary NameNode.

10. Press Enter.

11. Type the absolute path for the Hadoop installation directory. Start the path with a slash.

12. Press Enter.

13. Type n.

14. Press Enter.

15. Type y.

16. Press Enter.

The installer retrieves a list of DataNodes from the $HADOOP_HOME/conf/slaves file. On each of the DataNodes, the installer creates the /<BigDataEditionInstallationDirectory>/Informatica directory and populates all of the file systems with the contents of the RPM package.

You can view the informatica-hadoop-install.<DateTimeStamp>.log installation log file to get more information about the tasks performed by the installer.

Installing in a Cluster Environment from any MachineYou can install Big Data Edition in a cluster environment from any machine.

1. Verify that the Big Data Edition administrator has user root privileges on the node that will be running the Big Data Edition installation.

2. Log in to the machine as the root user.

3. In the HadoopDataNodes file, add the IP addresses or machine host names of the nodes in the Hadoop cluster on which you want to install Big Data Edition. The HadoopDataNodes file is located on the node from where you want to launch the Big Data Edition installation. You must add one IP addresses or machine host names of the nodes in the Hadoop cluster for each line in the file.

9



6. Press Enter.


8. Press Enter.

9. Type the absolute path for the Big Data Edition installation directory and press Enter. Start the path with a slash. Default is /opt.

10. Press Enter.

11. Press 2 to install Big Data Edition using the HadoopDataNodes file.

12. Press Enter.

The installer creates the /<BigDataEditionInstallationDirectory>/Informatica directory and populates all of the file systems with the contents of the RPM package on the first node that appears in the HadoopDataNodes file. The installer repeats the process for each node in the HadoopDataNodes file.

Reference Data RequirementsIf you have a Data Quality product license, you can push a mapping that contains data quality transformations to a Hadoop cluster. Data quality transformations can use reference data to verify that data values are accurate and correctly formatted.

When you apply a pushdown operation to a mapping that contains data quality transformations, the operation can copy the reference data that the mapping uses. The pushdown operation copies reference table data, content set data, and identity population data to the Hadoop cluster. After the mapping runs, the cluster deletes the reference data that the pushdown operation copied with the mapping.

Note: The pushdown operation does not copy address validation reference data. If you push a mapping that performs address validation, you must install the address validation reference data files on each DataNode that runs the mapping. The cluster does not delete the address validation reference data files after the address validation mapping runs.

Address validation mappings validate and enhance the accuracy of postal address records. You can buy address reference data files from Informatica on a subscription basis. You can download the current address reference data files from Informatica at any time during the subscription period.

Installing the Address Reference Data Files

To install the address reference data files on each DataNode in the cluster, create an automation script.

1. Browse to the address reference data files that you downloaded from Informatica.

2. Extract the compressed address reference data files.

3. Stage the files to the NameNode machine or to another machine that can write to the DataNodes.

4. Create an automation script to copy the files to each DataNode.

The default directory for the address reference data files in the Hadoop environment is /reference_data.

• If you staged the files on the NameNode, use the slaves file for the Hadoop cluster to identify the DataNodes.

• If you staged the files on another machine, use the Hadoop_Nodes.txt file to identify the DataNodes. You find this file in the Big Data Edition installation package.

5. Run the script.

10

The script copies the address reference data files to the DataNodes.

Configure the Hadoop ClusterAfter you install Big Data Edition, you must configure the Hadoop cluster to run mappings on MapR.

To enable mappings to run on a MapR cluster that uses MapReduce 1, perform the following tasks:

1. Verify the cluster details.

2. Update Hadoop cluster configuration parameters on the Hadoop cluster.

3. Configure hive-site.xml on every node in the Hadoop cluster for MapReduce 1.

4. Configure the heap space for the MapR-FS.

5. Enable Hive pushdown for HBase.

6. Copy Teradata JDBC jars to Hadoop nodes.

To enable mappings to run on a MapR cluster that uses MapReduce 2, perform the following tasks:

1. Verify the cluster details.

2. Update Hadoop cluster configuration parameters on the Hadoop cluster.

3. Configure yarn-site.xml on every node in the Hadoop cluster for MapReduce 2.

4. Configure the heap space for the MapR-FS.

5. Copy Teradata JDBC jars to Hadoop nodes.

Verify the Cluster DetailsVerify the following settings for the MapR cluster:

MapReduce Version

Verify that the cluster is configured for the correct version of MapReduce. You can use the MapR Control System (MCS) to change the MapReduce version. Then, restart the cluster.

MapR User Details

Verify that the MapR user exists on each Hadoop cluster node and that the following properties match:

• User ID (uid)

• Group ID (gid)

• Groups

For example, the MapR user might have the following properties:

• uid=2000(mapr)

• gid=2000(mapr)

• groups=2000(mapr)

Data Integration Service User Details

Verify that the user who runs the Data Integration Service is assigned the same gid as the MapR user and belongs to the same group.

For example, a Data Integration Service user named testuser, might have the following properties:

• uid=30103(testuser)

• gid=2000(mapr)

11

• groups=2000(mapr)

After you verify the Data Integration Service user details, perform the following steps:

1. Create a user that has the same user ID and name as the Data Integration Service user.

2. Add this user to all the nodes in the Hadoop cluster and assign it to the mapr group.

3. Verify that the user you created has read and write permissions for the following directory: /opt/mapr/hive/hive-0.13/logs.

A directory corresponding to the user will be created at this location.

4. Verify that the user you created has permissions for the Hive warehouse directory.

The Hive warehouse directory is set in the following file: /opt/mapr/hive/hive-0.13/conf/hive-site.xml.

For example, if the warehouse directory is /user/hive/warehouse, run the following command to grant the user permissions for the directory:

hadoop fs –chmod –R 777 /user/hive/warehouse

Update Hadoop Cluster Configuration Parameters on the Hadoop ClusterHadoop cluster configuration parameters that set Java library path in mapred-site.xml can override the paths set in hadoopEnv.properties. Update the mapred-site.xml cluster configuration file on all the cluster nodes to remove Java options that set the Java library path.

The following cluster configuration parameters in mapred-site.xml can override the Java library path set in hadoopEnv.properties:

• mapreduce.admin.map.child.java.opts• mapreduce.admin.reduce.child.java.opts

If the Data Integration Service cannot access the native libraries set in hadoopEnv.properties, the mappings can fail to run in a Hive environment.

After you install, update the cluster configuration file mapred-site.xml to remove the Java option

-Djava.library.path from the property configuration.

Example to Update mapred-site.xml on Cluster Nodes

If the mapred-site.xml file sets the following configuration for mapreduce.admin.map.child.java.opts parameter:

<property><name>mapreduce.admin.map.child.java.opts</name><value>-server -XX:NewRatio=8 -Djava.library.path=/usr/lib/hadoop/lib/native/:/mylib/ -Djava.net.preferIPv4Stack=true</value><final>true</final></property>

The path to Hadoop libraries in mapreduce.admin.map.child.java.opts overrides the following path set in the hadoopEnv.properties file:

infapdo.java.opts=-Xmx512M -XX:GCTimeRatio=34 -XX:+UseConcMarkSweepGC -XX:+UseParNewGC -XX:ParallelGCThreads=2 -XX:NewRatio=2 -Djava.library.path=$HADOOP_NODE_INFA_HOME/services/shared/bin:$HADOOP_NODE_HADOOP_DIST/lib/*:$HADOOP_NODE_HADOOP_DIST/lib/native/Linux-amd64-64 -Djava.security.egd=file:/dev/./urandom

12

Configure hive-site.xml on Every Node in the Hadoop Cluster for MapReduce 1If the MapR cluster uses MapReduce 1, you must configure the Hive metastore property in hive-site.xml that grants permissions to the Data Integration Service to perform operations on the Hive metastore. You must configure the property in hive-site.xml on the Hadoop cluster nodes.

hive-site.xml is located in the following directory on every Hadoop cluster node: <Hadoop_NODE_INFA_HOME>/services/shared/hadoop/mapr_<version>_classic.

In hive-site.xml, configure the following property:hive.metastore.execute.setugi

Enables the Hive metastore server to use the client's user and group permissions. Set the value to true.

The following sample code shows the property you can configure in hive-site.xml:

<property><name>hive.metastore.execute.setugi</name><value>true</value></property>

Configure yarn-site.xml on Every Node in the Cluster for MapReduce 2Configure Hadoop cluster properties in the yarn-site.xml file on every node in the Hadoop cluster.

yarn-site.xml is located in the following directory on the Hadoop cluster nodes: /opt/mapr/hadoop/hadoop-<version>/etc/hadoop.

In yarn-site.xml, configure the following properties:

Note: If a property does not exist in yarn-site.xml, add it to the file.

yarn.nodemanager.resource.memory-mb

Amount of physical memory, in megabytes, that can be allocated for containers.

Use "24000" for the value.

yarn.scheduler.minimum-allocation-mb

The minimum allocation for every container request at the RM, in megabytes. Memory requests lower than this do not take effect, and the specified value will get allocated.


yarn.scheduler.maximum-allocation-mb

The maximum allocation for every container request at the RM, in megabytes. Memory requests higher than this do not take effect and are capped at this value.


yarn.app.mapreduce.am.resource.mb

The amount of memory the MR AppMaster needs.


yarn.nodemanager.resource.cpu-vcores

Number of CPU cores that can be allocated for containers.


13

The following sample code shows the properties you can configure in yarn-site.xml:

<property> <name>yarn.nodemanager.resource.memory-mb</name> <description> Amount of physical memory, in MB, that can be allocated for containers.</description> <value>24000</value></property>

<property> <name>yarn.scheduler.minimum-allocation-mb</name> <description>The minimum allocation for every container request at the RM, in MBs. Memory requests lower than this won't take effect, and the specified value will get allocated at minimum.</description> <value>2048</value></property>

<property> <name>yarn.scheduler.maximum-allocation-mb</name> <description> The maximum allocation for every container request at the RM, in MBs. Memory requests higher than this won't take effect, and will get capped to this value.</description> <value>24000</value> </property>

<property> <name>yarn.app.mapreduce.am.resource.mb</name> <description> The amount of memory the MR AppMaster needs.</description> <value>2048</value></property>

<property> <name>yarn.nodemanager.resource.cpu-vcores</name> <description> Number of CPU cores that can be allocated for containers. </description> <value>8</value></property>

Configure the Heap Space for the MapR-FSYou must configure the heap space reserved for the MapR-FS on every node in the cluster.

Perform the following steps:

1. Navigate to the following directory: /opt/mapr/conf.

2. Edit the warden.conf file.

3. Set the value for the service.command.mfs.heapsize.percent property to 20.

4. Save and close the file.

5. Repeat steps 1 through 4 for every node in the cluster.

6. Restart the cluster.

Enable Hive Pushdown for HbaseIf the MapR cluster uses MapReduce 1, then you must add the hbase-protocol*.jar to the Hadoop classpath on every node to enable Hive pushdown for Hbase. If the cluster uses MapReduce 2, no action is required.

If the MapR cluster uses MapReduce 1, perform the following steps:

1. Add hbase-protocol-0.98.7-mapr-1501.jar to the Hadoop classpath on every node of the Hadoop cluster.

2. Restart the Node Manager for each node.

14

Copy Teradata JDBC Jars to Hadoop NodesTo use Lookup transformations with a Teradata data object in Hive pushdown mode, you must copy the Teradata JDBC drivers to the Informatica installation directory.

You can download the Teradata JDBC drivers from Teradata. For more information about the drivers, see the following Teradata website: http://downloads.teradata.com/download/connectivity/jdbc-driver.

The software available for download at the referenced links belongs to a third party or third parties, not Informatica Corporation. The download links are subject to the possibility of errors, omissions or change. Informatica assumes no responsibility for such links and/or such software, disclaims all warranties, either express or implied, including but not limited to, implied warranties of merchantability, fitness for a particular purpose, title and non-infringement, and disclaims all liability relating thereto.

Copy tdgssconfig.jar and terajdbc4.jar from the Teradata JDBC drivers to the following directory on every node in the Hadoop cluster: <Informatica installation directory>/externaljdbcjars

Configure the Informatica DomainAfter you install Big Data Edition, you must configure the Informatica domain to run mappings in a Hive environment.

To configure the Informatica Domain to run mappings on a MapR cluster that uses MapReduce 1, complete the following tasks:

1. Configure the Hadoop pushdown properties for the Data Integration Service.

2. Configure hive-site.xml on the Data Integration Service Machine for MapReduce 1.

3. Configure MapR distribution variables for mappings in a Hive environment.

4. Configure the library path and path environment variables.

5. Optionally, to add environment variables or extend existing ones, edit the Hadoop environment properties file.

6. Update Hadoop cluster configuration parameters on the Informatica domain.

7. Configure Hive environment variables.

8. Copy Teradata JDBC jars to the Data Integration Service machine.

9. Update the repository plug-in.

To configure the Informatica Domain to run mappings on a MapR cluster that uses MapReduce 2, complete the following tasks:

1. Configure the Hadoop pushdown properties for the Data Integration Service.

2. Configure yarn-site.xml on the Data Integration Service Machine for MapReduce 2.

3. Configure MapR distribution variables for mappings in a Hive environment.

4. Configure the library path and path environment variables.

5. Optionally, to add environment variables or extend existing ones, edit the Hadoop environment properties file.

6. Update Hadoop cluster configuration parameters on the Informatica domain.

7. Configure Hive environment variables.

8. Copy Teradata JDBC jars to the Data Integration Service machine.

9. Update the repository plug-in.

Configure Hadoop Pushdown Properties for the Data Integration ServiceConfigure Hadoop pushdown properties for the Data Integration Service to run mappings in a Hive environment.

You can configure Hadoop pushdown properties for the Data Integration Service in the Administrator tool.

15

http://downloads.teradata.com/download/connectivity/jdbc-driver

The following table describes the Hadoop pushdown properties for the Data Integration Service:

Property Description

Informatica Home Directory on Hadoop

The Big Data Edition home directory on every data node created by the Hadoop RPM install. Type /<BigDataEditionInstallationDirectory>/Informatica.

Hadoop Distribution Directory

The directory containing a collection of Hive and Hadoop JARS on the cluster from the RPM Install locations. The directory contains the minimum set of JARS required to process Informatica mappings in a Hadoop environment. Type /<BigDataEditionInstallationDirectory>/Informatica/services/shared/hadoop/[Hadoop_distribution_name].

Data Integration Service Hadoop Distribution Directory

The Hadoop distribution directory on the Data Integration Service node. The contents of the Data Integration Service Hadoop distribution directory must be identical to Hadoop distribution directory on the data nodes.

Hadoop Distribution Directory

You can modify the Hadoop distribution directory on the data nodes.

When you modify the Hadoop distribution directory, you must copy the minimum set of Hive and Hadoop JARS, and the Snappy libraries required to process Informatica mappings in a Hive environment from your Hadoop install location. The actual Hive and Hadoop JARS can vary depending on the Hadoop distribution and version.

The Hadoop RPM installs the Hadoop distribution directories in the following path: <BigDataEditionInstallationDirectory>/Informatica/services/shared/hadoop.

Configure hive-site.xml on the Data Integration Service Machine for MapReduce 1If the MapR cluster uses MapReduce 1, you must configure the cluster properties in hive-site.xml on the machine where the Data Integration Service runs.

hive-site.xml is located in the following directory on the machine on which the Data Integration Service runs: <Informatica installation directory>/services/shared/hadoop/<hadoop distribution_name>/conf/.

In hive-site.xml, configure the following properties:hive.metastore.execute.setugi

Enables the Hive metastore server to use the client's user and group permissions. Set the value to true.


<property><name>hive.metastore.execute.setugi</name><value>true</value></property>

hive.cache.expr.evaluation

Whether Hive enables the optimization to convert a common join into a mapjoin based on the input file size.

The value must be set to false due to a bug in the optimization feature for hive-0.13. For more information, see the following JIRA entry: https://issues.apache.org/jira/browse/HIVE-7314.


<property> <name>hive.cache.expr.evaluation</name>

16

https://issues.apache.org/jira/browse/HIVE-7314

<value>false</value> <description>Whether Hive enables the optimization to convert a common join into a mapjoin based on the input file size.</description> </property>

Configure yarn-site.xml on the Data Integration Service Machine for MapReduce 2If the MapR cluster uses MapReduce 2, you must configure the cluster properties in yarn-site.xml on the machine where the Data Integration Service runs.

yarn-site.xml is located in the following directory on the machine where the Data Integration Service runs: <Informatica installation directory>/services/shared/hadoop/mapr_<version>_yarn/conf/.

In yarn-site.xml, configure the following properties:mapreduce.jobhistory.address

Location of the MapReduce JobHistory Server. The default value is 10020.

Use the value in the following file: /opt/mapr/hadoop/hadoop-2.5.1/etc/hadoop/mapred-site.xmlmapreduce.jobhistory.webapp.address

Web address of the MapReduce JobHistory Server. The default value is 19888.

Use the value in the following file: /opt/mapr/hadoop/hadoop-2.5.1/etc/hadoop/mapred-site.xmlyarn.resourcemanager.scheduler.address

Scheduler interface address. The default value is 8030.

Use the value in the following file: /opt/mapr/hadoop/hadoop-2.5.1/etc/hadoop/yarn-site.xml

The following sample code describes the properties you can set in yarn-site.xml:

<property> <name>mapreduce.jobhistory.address</name> <value>hostname:port</value> <description>MapReduce JobHistory Server IPC host:port</description></property>

<property> <name>mapreduce.jobhistory.webapp.address</name> <value>hostname:port</value> <description>MapReduce JobHistory Server Web UI host:port</description></property>

<property> <name>yarn.resourcemanager.scheduler.address</name> <value>hostname:port</value> <description>The address of the scheduler interface</description></property>

Configure MapR Distribution Variables for Mappings in a Hive EnvironmentWhen you use the MapR distribution to run mappings in a Hive environment, you must configure MapR distribution variables.

Configure the following MapR variables:

• Add MAPR_HOME to the environment variables in the Data Integration Service Process properties. Set MAPR_HOME to the following path: <BigDataEditionInstallationDirectory>/services/shared/hadoop/mapr_<version>.

17

• Add -Dmapr.library.flatclass to the custom properties in the Data Integration Service Process properties. For example, add

JVMOption1=-Dmapr.library.flatclass• Add -Dmapr.library.flatclass to the Data Integration Service advanced property JVM Command Line Options.

• Set the MapR Container Location Database name variable CLDB in the following file: <BigDataEditionInstallationDirectory>/services/shared/hadoop/mapr_<version>/conf/mapr-clusters.conf.For example, add the following property:

INFAMAPR402 secure=false <master_node_name>:7222

Library Path and Path Variables for Mappings in a Hive EnvironmentTo run mappings in a Hive environment configure the library path and path environment variables in the hadoopEnv.properties file.

Configure following library path and path environment variables:

• If the Data Integration Service runs on a machine that uses SUSE, verify that the following entries are set to a valid value that is not POSIX:

- infapdo.env.entry.lc_all=LC_ALL- infapdo.env.entry.lang=LANGFor example, you can use US.UTF-8.

• When you run mappings in a Hive environment, configure the ODBC library path before the Teradata library path. For example, infapdo.env.entry.ld_library_path=LD_LIBRARY_PATH=$HADOOP_NODE_INFA_HOME/services/shared/bin:$HADOOP_NODE_INFA_HOME/ODBC7.0/lib/:/opt/teradata/client/13.10/tbuild/lib64:/opt/teradata/client/13.10/odbc_64/lib:/databases/oracle11.2.0_64BIT/lib:/databases/db2v9.5_64BIT/lib64/:$HADOOP_NODE_INFA_HOME/DataTransformation/bin:$HADOOP_NODE_HADOOP_DIST/lib/native/Linux-amd64-64:$LD_LIBRARY_PATH .

• When you use the MapR distribution on the Linux operating system, change the environment variable LD_LIBRARY_PATH to include the following path: <BigDataEditionInstallationDirectory>/services/shared/hadoop/mapr_<version>/lib/native/Linux-amd64-64.

• When you use the MapR distribution on the Linux operating system, change the environment variable MAPR_HOME to include the following path: <BigDataEditionInstallationDirectory>/services/shared/hadoop/mapr_<version>.

Hadoop Environment Properties FileTo add environment variables or to extend existing ones, use the Hadoop environment properties file, hadoopEnv.properties.

You can optionally add third-party environment variables or extend the existing PATH environment variable in hadoopEnv.properties.

1. Go to the following location: <InformaticaInstallationDir>/services/shared/hadoop/<Hadoop_distribution_name>/infaConf

2. Find the file named hadoopEnv.properties.

3. Back up the file before you modify it.

4. Use a text editor to open the file and modify the properties.

5. Save the properties file with the name hadoopEnv.properties.

18

Update Hadoop Cluster Configuration Parameters on the Informatica DomainHadoop cluster configuration parameters that set Java library path in mapred-site.xml can override the paths set in hadoopEnv.properties.

If the Data Integration Service cannot access the native libraries set in hadoopEnv.properties, the mappings can fail to run in a Hive environment.

After you install Big Data Edition, edit hadoopEnv.properties to include the user Hadoop libraries in the Java Library path.

Note: Before you perform this task, update mapred-site.xml on all the cluster nodes to remove Java options that set the Java library path. For more information, see “Update Hadoop Cluster Configuration Parameters on the Hadoop Cluster” on page 12.

To run mappings in a Hive environment, change hadoopEnv.properties to include the Hadoop libraries in the path /usr/lib/hadoop/lib/native and /mylib/ with the following syntax:

infapdo.java.opts=-Xmx512M -XX:GCTimeRatio=34 -XX:+UseConcMarkSweepGC -XX:+UseParNewGC -XX:ParallelGCThreads=2 -XX:NewRatio=2 -Djava.library.path=$HADOOP_NODE_INFA_HOME/services/shared/bin:$HADOOP_NODE_HADOOP_DIST/lib/*:$HADOOP_NODE_HADOOP_DIST/lib/native/Linux-amd64-64:/usr/lib/hadoop/lib/native/:/mylib/ -Djava.security.egd=file:/dev/./urandom

Hive Variables for Mappings in a Hive EnvironmentTo run mappings in a Hive environment, configure Hive environment variables..

You can configure Hive environment variables in the file /<BigDataEditionInstallationDirectory>/Informatica/services/shared/hadoop/<Hadoop_distribution_name>/conf/hive-site.xml.

Configure the following Hive environment variables:

• hive.exec.dynamic.partition=true and hive.exec.dynamic.partition.mode=nonstrict. Configure if you want to use Hive dynamic partitioned tables.

• hive.optimize.ppd = false. Disable predicate pushdown optimization to get accurate results for mappings with Hive version 0.9.0. You cannot use predicate pushdown optimization for a Hive query that uses multiple insert statements. The default Hadoop RPM installation sets hive.optimize.ppd to false.

Copy Teradata JDBC Jars to the Data Integration Service MachineTo use Lookup transformations with a Teradata data object in Hive pushdown mode, you must copy the Teradata JDBC drivers to the Informatica installation directory.



Copy tdgssconfig.jar and terajdbc4.jar from the Teradata JDBC drivers to the following directory on the machine where the Data Integration runs: <Informatica installation directory>/externaljdbcjars

19


Update the Repository Plug-inIf you upgraded an existing repository, you must update the repository plug-in to enable PowerExchange for HDFS to run on the Hadoop distribution. If you created a new repository, skip this task.

1. Ensure that the Repository service is running in exclusive mode.

2. On the server machine, open the command console.

3. Run cd <Informatica installation directory>/server/bin

4. Run ./pmrep connect -r <repo_name> -d <domain_name> -n <username> -x <password>

5. Run ./pmrep registerplugin -i native/pmhdfs.xml -e -N true

6. Set the Repository service to normal mode.

7. Open the PowerCenter Workflow manager on the client machine.

The distribution appears in the Connection Object menu.

Configure the Client MachineYou must configure the Big Data Edition Client files to communicate with the Hadoop cluster. The Big Data Edition client includes the Developer tool client.

To enable these files to communicate with the Hadoop cluster, perform the following tasks:

1. Configure the Big Data Edition client files.

2. Copy MapR distribution files for PowerCenter mappings in a native environment.

3. Configure the PowerCenter Integration Service.

4. Copy Teradata JDBC jars to client machine.

Informatica Developer Files and VariablesEdit developerCore.ini to enable the Developer tool to communicate with the Hadoop cluster on a particular Hadoop distribution. After you edit the file, you must click run.bat to launch the Developer tool client again. If you use the MapR distribution you must also set the MAPR_HOME environment variable to run MapR mappings in a Hive environment.

developerCore.ini is located in the following directory: <InformaticaClientInstallationDirectory>\<version>\clients\DeveloperClient

Add the following property to developerCore.ini:

• -DINFA_HADOOP_DIST_DIR=hadoop\<HadoopDistributionName>

For a Hadoop cluster that runs MapR, you must perform the following additional tasks:

• Add the following properties to developerCore.ini:

- -Djava.library.path=hadoop\mapr_<version>\lib\native\Win32;bin;..\DT\bin - -Dmapr.library.flatclass

• Edit run.bat to set the MAPR_HOME environment variable and the -clean settings.For example, include the following lines:

MAPR_HOME=<InformaticaClientInstallationDirectory>/<version>/clients/DeveloperClient\hadoop\mapr_<version>developerCore.exe -clean

20

• Copy mapr-cluster.conf to the following directory on the machine where the Developer tool runs: <Informatica installation directory>\<version>\clients\DeveloperClient\hadoop\mapr_<version>\conf.You can find mapr-cluster.conf in the following directory on any node in the Hadoop cluster: <MapR installation directory>/conf

Copy MapR Distribution Files for PowerCenter Mappings in the Native EnvironmentWhen you use the MapR distribution to run mappings in a native environment, you must copy MapR files to the machine on which you install Big Data Edition.


1. Go to the following directory on any node in the cluster: <MapR installation directory>/confFor example, go to the following directory: /opt/mapr/conf.

2. Find the following files:

• mapr-cluster.conf

• mapr.login.conf

3. Copy the files to the following directory on the machine on which the PowerCenter Integration Service runs:<Informatica installation directory>/server/bin/javalib/hadoop/mapr<version>/conf

4. Log in to the Administrator tool.

5. In the Domain Navigator, select the PowerCenter Integration Service.

6. Recycle the Service.Click Actions > Recycle Service.

Configure the PowerCenter Integration ServiceTo enable support for MapR, configure the PowerCenter Integration Service.


1. Log in to the Administrator tool.

2. In the Domain Navigator, select the PowerCenter Integration Service.

3. Click the Processes view.

4. Add the following environment variable:

MAPR_HOME

Use the following value: <Informatica installation directory>/server/bin/javalib/hadoop/mapr<version>

5. Add the following custom property:

JVMClassPath

Use the following value: <Informatica installation directory>/server/bin/javalib/hadoop/mapr<version>/*:<Informatica installation directory>/server/bin/javalib/hadoop/*

6. Recycle the service.

Click Actions > Recycle Service.

21

Copy Teradata JDBC Jars to the Client MachineTo use Lookup transformations with a Teradata data object in Hive pushdown mode, you must copy the Teradata JDBC drivers to the Informatica installation directory.



Copy tdgssconfig.jar and terajdbc4.jar to the following directory on the machine where the Developer tool runs: <Informatica installation directory>\clients\externaljdbcjars.

Configure High AvailabilityYou can configure the Data Integration Service and the Developer tool to read from and write to a highly available Hadoop cluster.

A highly available Hadoop cluster can provide uninterrupted access to the JobTracker, NameNode, and ResourceManager in the cluster. The JobTracker is the service within Hadoop that assigns MapReduce jobs on the cluster. The NameNode tracks file data across the cluster. The ResourceManager tracks resources and schedules applications in the cluster.

Configuring a Highly Available MapR ClusterYou can enable the Data Integration Service and the Developer tool to read from and write to a highly available MapR cluster. The MapR cluster on MRv1 provides a highly available NameNode and JobTracker.

1. Go to the following directory on the NameNode of the cluster:

/opt/mapr/conf2. Locate the mapr-cluster.conf file.

3. Copy the file to the machine on which the Data Integration Service runs and the machine on which the Developer tool client runs:

On the machine on which the Data Integration Service runs, copy the file to the following directory:

<Informatica installation directory>/services/shared/hadoop/mapr_<version>/confOn the machine on which the Developer tool runs, copy the file to the following directory:

<Informatica installation directory>/clients/DeveloperClient/Hadoop/mapr_<version>/conf4. Open the Developer tool.

5. Click Window > Preferences.

6. Select Informatica > Connections.

7. Expand the domain.

8. Expand File Systems and select the HDFS connection.

22


9. Edit the HDFS connection and configure the following property in the Details tab:

NameNode URI

Use the value of the dfs.nameservices property.

You can get the value of the dfs.nameservices property from hdfs-site.xml from the following location on the NameNode of the cluster: /etc/hadoop/conf

ConnectionsDefine the connections you want to use to access data in Hive or HDFS.

You can create the following types of connections:HDFS connection

Create an HDFS connection to read data from or write data to the Hadoop cluster.

HBase connection

Create an HBase connection to access HBase. The HBase connection is a NoSQL connection.

Hive connection

Create a Hive connection to access Hive data or run Informatica mappings in the Hadoop cluster. Create a Hive connection in the following connection modes:

• Use the Hive connection to access Hive as a source or target. If you want to use Hive as a target, you need to have the same connection or another Hive connection that is enabled to run mappings in the Hadoop cluster. You can access Hive as a source if the mapping is enabled for the native or Hive environment. You can access Hive as a target only if the mapping is run in the Hadoop cluster.

• Use the Hive connection to validate or run an Informatica mapping in the Hadoop cluster. Before you run mappings in the Hadoop cluster, review the information in this guide about rules and guidelines for mappings that you can run in the Hadoop cluster.

You can create the connections using the Developer tool, Administrator tool, and infacmd.

Note: For information about creating connections to other sources or targets such as social media web sites or Teradata, see the respective PowerExchange adapter user guide for information.

HDFS Connection PropertiesUse a Hadoop File System (HDFS) connection to access data in the Hadoop cluster. The HDFS connection is a file system type connection. You can create and manage an HDFS connection in the Administrator tool, Analyst tool, or the Developer tool. HDFS connection properties are case sensitive unless otherwise noted.

Note: The order of the connection properties might vary depending on the tool where you view them.

The following table describes HDFS connection properties:


Name Name of the connection. The name is not case sensitive and must be unique within the domain. The name cannot exceed 128 characters, contain spaces, or contain the following special characters:~ ` ! $ % ^ & * ( ) - + = { [ } ] | \ : ; " ' < , > . ? /

ID String that the Data Integration Service uses to identify the connection. The ID is not case sensitive. It must be 255 characters or less and must be unique in the domain. You cannot change this property after you create the connection. Default value is the connection name.

23


Description The description of the connection. The description cannot exceed 765 characters.

Location The domain where you want to create the connection. Not valid for the Analyst tool.

Type The connection type. Default is Hadoop File System.

User Name User name to access HDFS.

NameNode URI

Use one of the following formats to specify the NameNode URI in MapR distribution:- maprfs:///- maprfs:///mapr/my.cluster.com/Where my.cluster.com is the cluster name that you specify in the mapr-clusters.conf file.

HBase Connection PropertiesUse an HBase connection to access HBase. The HBase connection is a NoSQL connection. You can create and manage an HBase connection in the Administrator tool or the Developer tool. Hbase connection properties are case sensitive unless otherwise noted.

The following table describes HBase connection properties:


Name The name of the connection. The name is not case sensitive and must be unique within the domain. You can change this property after you create the connection. The name cannot exceed 128 characters, contain spaces, or contain the following special characters:~ ` ! $ % ^ & * ( ) - + = { [ } ] | \ : ; " ' < , > . ? /


Description The description of the connection. The description cannot exceed 4,000 characters.

Location The domain where you want to create the connection.

Type The connection type. Select HBase.

ZooKeeper Host(s) Name of the machine that hosts the ZooKeeper server. The name is case sensitive.When the ZooKeeper runs in the replicated mode, specify a comma-separated list of servers in the ZooKeeper quorum servers. If the TCP connection to the server breaks, the client connects to a different server in the quorum.

ZooKeeper Port Port number of the machine that hosts the ZooKeeper server.

Enable Kerberos Connection Enables the Informatica domain to communicate with the HBase master server or region server that uses Kerberos authentication.

24


HBase Master Principal Service Principal Name (SPN) of the HBase master server. Enables the ZooKeeper server to communicate with an HBase master server that uses Kerberos authentication.Enter a string in the following format:

hbase/<domain.name>@<YOUR-REALM>Where:- domain.name is the domain name of the machine that hosts the HBase master

server.- YOUR-REALM is the Kerberos realm.

HBase Region Server Principal Service Principal Name (SPN) of the HBase region server. Enables the ZooKeeper server to communicate with an HBase region server that uses Kerberos authentication.Enter a string in the following format:

hbase_rs/<domain.name>@<YOUR-REALM>Where:- domain.name is the domain name of the machine that hosts the HBase master

server.- YOUR-REALM is the Kerberos realm.

Hive Connection PropertiesUse the Hive connection to access Hive data. A Hive connection is a database type connection. You can create and manage a Hive connection in the Administrator tool, Analyst tool, or the Developer tool. Hive connection properties are case sensitive unless otherwise noted.

Note: The order of the connection properties might vary depending on the tool where you view them.

The following table describes Hive connection properties:


Name The name of the connection. The name is not case sensitive and must be unique within the domain. You can change this property after you create the connection. The name cannot exceed 128 characters, contain spaces, or contain the following special characters:~ ` ! $ % ^ & * ( ) - + = { [ } ] | \ : ; " ' < , > . ? /


Description The description of the connection. The description cannot exceed 4000 characters.

Location The domain where you want to create the connection. Not valid for the Analyst tool.

Type The connection type. Select Hive.

25


Connection Modes Hive connection mode. Select at least one of the following options:- Access Hive as a source or target. Select this option if you want to use the

connection to access the Hive data warehouse. If you want to use Hive as a target, you must enable the same connection or another Hive connection to run mappings in the Hadoop cluster.

- Use Hive to run mappings in Hadoop cluster. Select this option if you want to use the connection to run mappings in the Hadoop cluster.

You can select both the options. Default is Access Hive as a source or target.

User Name User name of the user that the Data Integration Service impersonates to run mappings on a Hadoop cluster. The user name depends on the JDBC connection string that you specify in the Metadata Connection String or Data Access Connection String for the native environment.If the Hadoop cluster uses Kerberos authentication, the principal name for the JDBC connection string and the user name must be the same. Otherwise, the user name depends on the behavior of the JDBC driver. With Hive JDBC driver, you can specify a user name in many ways and the user name can become a part of the JDBC URL.If the Hadoop cluster does not use Kerberos authentication, the user name depends on the behavior of the JDBC driver.If you do not specify a user name, the Hadoop cluster authenticates jobs based on the following criteria:- The Hadoop cluster does not use Kerberos authentication. It authenticates jobs

based on the operating system profile user name of the machine that runs the Data Integration Service.

- The Hadoop cluster uses Kerberos authentication. It authenticates jobs based on the SPN of the Data Integration Service.

Common Attributes to Both the Modes: Environment SQL

SQL commands to set the Hadoop environment. In native environment type, the Data Integration Service executes the environment SQL each time it creates a connection to a Hive metastore. If you use the Hive connection to run mappings in the Hadoop cluster, the Data Integration Service executes the environment SQL at the beginning of each Hive session.The following rules and guidelines apply to the usage of environment SQL in both connection modes:- Use the environment SQL to specify Hive queries.- Use the environment SQL to set the classpath for Hive user-defined functions and

then use environment SQL or PreSQL to specify the Hive user-defined functions. You cannot use PreSQL in the data object properties to specify the classpath. The path must be the fully qualified path to the JAR files used for user-defined functions. Set the parameter hive.aux.jars.path with all the entries in infapdo.aux.jars.path and the path to the JAR files for user-defined functions.

- You can use environment SQL to define Hadoop or Hive parameters that you want to use in the PreSQL commands or in custom queries.

If you use the Hive connection to run mappings in the Hadoop cluster, the Data Integration service executes only the environment SQL of the Hive connection. If the Hive sources and targets are on different clusters, the Data Integration Service does not execute the different environment SQL commands for the connections of the Hive source or target.

26

Properties to Access Hive as Source or Target

The following table describes the connection properties that you configure to access Hive as a source or target:


Metadata Connection String

The JDBC connection URI used to access the metadata from the Hadoop server.You can use PowerExchange for Hive to communicate with a HiveServer service or HiveServer2 service.To connect to HiveServer, specify the connection string in the following format:jdbc:hive2://<hostname>:<port>/<db>Where- <hostname> is name or IP address of the machine on which HiveServer2 runs.- <port> is the port number on which HiveServer2 listens.- <db> is the database name to which you want to connect. If you do not provide the database name, the

Data Integration Service uses the default database details.To connect to HiveServer 2, use the connection string format that Apache Hive implements for that specific Hadoop Distribution. For more information about Apache Hive connection string formats, see the Apache Hive documentation.

Bypass Hive JDBC Server

JDBC driver mode. Select the check box to use the embedded JDBC driver mode.To use the JDBC embedded mode, perform the following tasks:- Verify that Hive client and Informatica services are installed on the same machine.- Configure the Hive connection properties to run mappings in the Hadoop cluster.If you choose the non-embedded mode, you must configure the Data Access Connection String.Informatica recommends that you use the JDBC embedded mode.

Data Access Connection String

The connection string to access data from the Hadoop data store.To connect to HiveServer, specify the non-embedded JDBC mode connection string in the following format:jdbc:hive2://<hostname>:<port>/<db>Where- <hostname> is name or IP address of the machine on which HiveServer2 runs.- <port> is the port number on which HiveServer2 listens.- <db> is the database to which you want to connect. If you do not provide the database name, the Data

Integration Service uses the default database details.To connect to HiveServer 2, use the connection string format that Apache Hive implements for the specific Hadoop Distribution. For more information about Apache Hive connection string formats, see the Apache Hive documentation.

27

Properties to Run Mappings in Hadoop Cluster

The following table describes the Hive connection properties that you configure when you want to use the Hive connection to run Informatica mappings in the Hadoop cluster:


Database Name Namespace for tables. Use the name default for tables that do not have a specified database name.

Default FS URI The URI to access the default Hadoop Distributed File System.Use the following connection URI:hdfs://<node name>:<port>Where- <node name> is the host name or IP address of the NameNode.- <port> is the port on which the NameNode listens for remote procedure calls (RPC).

JobTracker/Yarn Resource Manager URI

The service within Hadoop that submits the MapReduce tasks to specific nodes in the cluster.Use the following format:<hostname>:<port>Where- <hostname> is the host name or IP address of the JobTracker or Yarn resource

manager.- <port> is the port on which the JobTracker or Yarn resource manager listens for

remote procedure calls (RPC).Note: MapR distribution supports a highly available JobTracker. If you are using MapR distribution, define the JobTracker URI in the following format: maprfs:///

Hive Warehouse Directory on HDFS

The absolute HDFS file path of the default database for the warehouse that is local to the cluster. For example, the following file path specifies a local warehouse:/user/hive/warehouse

Advanced Hive/Hadoop Properties

Configures or overrides Hive or Hadoop cluster properties in hive-site.xml on the machine on which the Data Integration Service runs. You can specify multiple properties.Use the following format:<property1>=<value>Where- <property1> is a Hive or Hadoop property in hive-site.xml.- <value> is the value of the Hive or Hadoop property.To specify multiple properties use &: as the property separator.The maximum length for the format is 1 MB.If you enter a required property for a Hive connection, it overrides the property that you configure in the Advanced Hive/Hadoop Properties.The Data Integration Service adds or sets these properties for each map-reduce job. You can verify these properties in the JobConf of each mapper and reducer job. Access the JobConf of each job from the Jobtracker URL under each map-reduce job.The Data Integration Service writes messages for these properties to the Data Integration Service logs. The Data Integration Service must have the log tracing level set to log each row or have the log tracing level set to verbose initialization tracing.For example, specify the following properties to control and limit the number of reducers to run a mapping job:mapred.reduce.tasks=2&:hive.exec.reducers.max=10

28


Temporary Table Compression Codec

Hadoop compression library for a compression codec class name.

Codec Class Name Codec class name that enables data compression and improves performance on temporary staging tables.

Metastore Execution Mode Controls whether to connect to a remote metastore or a local metastore. By default, local is selected. For a local metastore, you must specify the Metastore Database URI, Driver, Username, and Password. For a remote metastore, you must specify only the Remote Metastore URI.

Metastore Database URI The JDBC connection URI used to access the data store in a local metastore setup. Use the following connection URI:jdbc:<datastore type>://<node name>:<port>/<database name>where- <node name> is the host name or IP address of the data store.- <data store type> is the type of the data store.- <port> is the port on which the data store listens for remote procedure calls (RPC).- <database name> is the name of the database.For example, the following URI specifies a local metastore that uses MySQL as a data store:jdbc:mysql://hostname23:3306/metastore

Metastore Database Driver Driver class name for the JDBC data store. For example, the following class name specifies a MySQL driver:com.mysql.jdbc.Driver

Metastore Database Username The metastore database user name.

Metastore Database Password The password for the metastore user name.

Remote Metastore URI The metastore URI used to access metadata in a remote metastore setup. For a remote metastore, you must specify the Thrift server details.Use the following connection URI:thrift://<hostname>:<port>Where- <hostname> is name or IP address of the Thrift metastore server.- <port> is the port on which the Thrift server is listening.

Creating a ConnectionCreate a connection before you import data objects, preview data, profile data, and run mappings.

1. Click Window > Preferences.

2. Select Informatica > Connections.

3. Expand the domain in the Available Connections list.

4. Select the type of connection that you want to create:

• To select a Hive connection, select Database > Hive.

• To select an HDFS connection, select File Systems > Hadoop File System.

5. Click Add.

29

6. Enter a connection name and optional description.

7. Click Next.

8. Configure the connection properties. For a Hive connection, you must choose the Hive connection mode and specify the commands for environment SQL. The SQL commands appy to both the connection modes. Select at least one of the following connection modes:

Option Description

Access Hive as a source or target

Use the connection to access Hive data. If you select this option and click Next, the Properties to Access Hive as a source or target page appears. Configure the connection strings.

Run mappings in a Hadoop cluster.

Use the Hive connection to validate and run Informatica mappings in the Hadoop cluster. If you select this option and click Next, the Properties used to Run Mappings in the Hadoop Cluster page appears. Configure the properties.

9. Click Test Connection to verify the connection.

You can test a Hive connection that is configured to access Hive data. You cannot test a Hive connection that is configured to run Informatica mappings in the Hadoop cluster.

10. Click Finish.

Informatica Big Data Edition UninstallationThe Big Data Edition uninstallation deletes the Big Data Edition binary files from all of the DataNodes within the Hadoop cluster. Uninstall Big Data Edition from a shell command.

Uninstalling Big Data EditionTo uninstall Big Data Edition in a single node or cluster environment:

1. Verify that the Big Data Edition administrator can run sudo commands.

2. If you are uninstalling Big Data Edition in a cluster environment, set up password-less Secure Shell (SSH) connection between the machine where you want to run the Big Data Edition installation and all of the nodes on which Big Data Edition will be uninstalled.

3. If you are uninstalling Big Data Edition in a cluster environment using the HadoopDataNodes file, verify that the HadoopDataNodes file contains the IP addresses or machine host names of each of the nodes in the Hadoop cluster from which you want to uninstall Big Data Edition. The HadoopDataNodes file is located on the node from where you want to launch the Big Data Edition installation. You must add one IP addresses or machine host names of the nodes in the Hadoop cluster for each line in the file.

4. Log in to the machine. The machine you log into depends on the Big Data Edition environment and uninstallation method:

• If you are uninstalling Big Data Edition in a single node environment, log in to the machine on which Big Data Edition is installed.

• If you are uninstalling Big Data Edition in a cluster environment using the HADOOP_HOME environment variable, log in to the primary NameNode.

• If you are uninstalling Big Data Edition in a cluster environment using the HadoopDataNodes file, log in to any node.

30

5. Run the following command to start the Big Data Edition uninstallation in console mode: bash InformaticaHadoopInstall.sh


7. Press Enter.

8. Select 3 to uninstall Big Data Edition.

9. Press Enter.

10. Select the uninstallation option, depending on the Big Data Edition environment:

• Select 1 to uninstall Big Data Edition in a single node environment.

• Select 2 to uninstall Big Data Edition in a cluster environment.

11. Press Enter.

12. If you are uninstalling Big Data Edition in a cluster environment, select the uninstallation option, depending on the uninstallation method:

• Select 1 to uninstall Big Data Edition from the primary NameNode.

• Select 2 to uninstall Big Data Edition using the HadoopDataNodes file.

13. Press Enter.

14. If you are uninstalling Big Data Edition in a cluster environment from the primary NameNode, type the absolute path for the Hadoop installation directory. Start the path with a slash.

The uninstaller deletes all of the Big Data Edition binary files from the /<BigDataEditionInstallationDirectory>/Informatica directory. In a cluster environment, the uninstaller delete the binary files from all of the nodes within the Hadoop cluster.

AuthorsLaura CatonTechnical Writer

31

Documents

How to Install and Configure Big Data Edition for MapR 4.0 Library/1/0818-Big... · 2020-04-03 · Pre-Installation Tasks for a Single Node Environment Before you begin the Big Data