Teradata Hadoop Data Archival Strategy with Hadoop and Hive

September 2014

Teradata to Hadoop Data Migration

BySeshu Kanuri

Enterprise Data Architect

2

Agenda

The Data Archival Proof of concept is currently underway under the direction and guidance of the Business Insurance (BI) Teradata 14.10 upgrade Program.

This high level proof of concept design will focus on various techniques practices for achieving such data archival and retrieval of BI Warehouse data between Teradata and Hadoop environment.

2

3

Use Cases for POC

Case -1: Copy an existing Teradata base table and all its’ data to Hadoop DFS and verify that the data is structurally similar. Move the data back to Teradata database as a relational data table and verify that the structure and data are exactly similar.

Case -2: Copy an existing Teradata SCD table and all its’ data to Hadoop DFS and verify that the data is structurally similar. Apply all the CDC values from the Teradata table and apply these changes to

HDFS table and verify that the CDC values are reflected in the HDFS Table. Move the HDFS table back to Teradata to verify that the structure and data are similar between Teradata and HDFS.

4

Scope Definition Out of Scope

Ab Initio Based ETL from TD to HDFS(Extracts to landing Zone may be considered)

Non-Apache Hadoop Drivers and Connectors Cluster Hardware / Software configurations Security Layer Implementation (Data Encryption,

Masking, etc.) Performance Tuning and Benchmarking High Availability and DR

5

CDH Stack Components Component Installed Version Desired Version Features in Desired Version

DataFu pig-udf-datafu-0.0.4+11

Apache Flume flume-ng-1.4.0+96

Apache Hadoop hadoop-2.0.0+1554

Apache HBase hbase-0.94.15+86

Apache Hive hive-0.10.0+237 .14 Truncate, More Data Types

Hue hue-2.5.0+217

Apache Mahout mahout-0.7+15

Apache Oozie oozie-3.3.2+100

Parquet parquet-1.2.5+7

Apache Pig pig-0.11.0+42

Apache Sentry sentry-1.1.0+20

Apache Sqoop sqoop-1.4.3+92

Apache Sqoop2 sqoop2-1.99.2+99

Apache Whirr whirr-0.8.2+15

Apache ZooKeeper zookeeper-3.4.5+25

6

Solution Architecture – Option A ( Sqoop with Hive )

6

Sourc

e

Layer

Sto

rage

Layer

Custom Map-

Reduce – JDBC utility

Cloudera Sqoop

connector powered

by Teradata

Teradata Connector

for Hadoop – CLI utility.

Import

Export

HDFS

TDCH Sqoop

HBASEClou

dera

Man

agem

ent &

M

onito

ring

Serv

ices

JDBC/ODBC

Scripting (Pig) SQL Query (Hive)

Hadoop Ecosystem

OozieMap Reduce

JDBC

7

Solution Architecture Description – Option A

# Solution Component Description

1 Source Layer - Teradata • Contains Teradata tables that need to be migrated to Hadoop Storage. • Tables could be Full Refresh tables or SCD tables.

2 Storage Layer – CDH 4.x. • Cloudera Distribution with Cloudera Manager for management and monitoring.• Hadoop stack includes: Hive, Pig, HBASE, Oozie, Sqoop.

3 Sqoop Connector for Teradata / Teradata Connector for Hadoop - CLI

• Cloudera connector for Sqoop powered by Teradata – developed by Cloudera and Teradata.

• Supports importing data split by AMP/VALUE/PARTITION/HASH.• Supports exporting data via batch.insert, multiple.fastload, internal.fastload.• Supports importing and exporting of data in Text / Sequence / Avro file format.• Cloudera Recommendation – Use Cloudera connector powered by Teradata versus

Cloudera connector for Teradata (Older version). • TDCH – This is a command line interface utility provided by Teradata leveraging

Teradata Java SDK (TeradataImportTool / TeradataExportTool) developed for data transfer between Hadoop and Teradata.

4 Hadoop/Other (Processing Layer)

• HDFS will be used to store the files and process it.• Sqoop Imported Files could also be directly imported into Hive or could be loaded

in HBASE through custom loading utility.• SCD – Load Fact tables into Hive. Load

L Oozie • Hadoop processing can be scheduled in a workflow through Oozie.

7

8

Solution Architecture – Option B ( Hive with Sqoop and Teradata Utilities )

8

Sourc

e \

ETL

Layer

Sto

rage

Layer

Extract

LoadLoad

Extract

File Landi

ng Zone

HDFS

TDCH Sqoop

HBASEClou

dera

Man

agem

ent &

M

onito

ring

Serv

ices

JDBC/ODBC

Scripting (Pig) SQL Query (Hive)

Hadoop Ecosystem

OozieMap Reduce

HDFS File Copy

Teradata Utility

9

Solution Architecture Description – Option B

# Solution Component Description

1 Source Layer - Teradata • Contains Teradata tables that need to be migrated to Hadoop Storage. • Tables could be Full Refresh tables or SCD tables.

2 Source Layer – Ab-Initio / File Landing Zone

• Leverage Ab-Initio to extract data into a flat file while importing data to HDFS.• Leverage Ab-Initio to load data into Teradata tables from files exported by HDFS.• Files would copied to and from Ab-Initio and HDFS in a designated File Landing

Zone.

3 Storage Layer – CDH 4.x. • Cloudera Distribution with Cloudera Manager for management and monitoring.• Hadoop stack includes: Hive, Pig, HBASE, Oozie, Sqoop.

4 Hadoop/Other (Processing Layer)

• HDFS will be used to store the files and process it.• Files could also be directly imported into Hive or could be loaded in HBASE

through custom program.• SCD - TBA

5 Oozie / Autosys • Hadoop processing can be scheduled in a workflow through Oozie.• Ab-Initio processing can be scheduled via Autosys.

9

10

Known Limitations for POC # Solution Component Description

1 Cloudera Connector Powered by Teradata (Latest version – 1.2c5)

• Does not support HCatalog• Does not support import into HBASE.• Does not support upsert functionality (parameter --update-mode allowinsert).• Does not support the --boundary-query option.

2 Cloudera Connector for Teradata (Older Version)

• Does not support HCatalog.• Does not support import into HBase.• Does not support AVRO format.• Does not support import-all-tables.• Does not support upsert functionality (parameter --update-mode allowinsert).• Does not support imports from views.• Does not support data types INTERVAL, PERIOD, and DATE/TIME/TIMESTAMP WITH

TIME ZONE.• Optional query band is not set for queries executed by the Teradata JDBC driver

itself (namely BT and SETSESSION CHARACTERISTICS AS TRANSACTION ISOLATION LEVEL SR).

3 Hive • Hive does not provide record level update, insert, or delete.• Hive does not provide transactions.• Compared to an OLTP database, Hive queries have higher latency due to start-up

overhead of MapReduce jobs.

4 Sqoop • Each execution requires input of password. Password can be passed in command line, as standard input, or from a password file. Password file is more secure way to automate Sqoop workflow.

• Encoding of NULL values during Import/Export needs to be considered.• Incremental Updates would need to utilize Sqoop metastore for preserving last

value.

10

Data & Analytics

Teradata Hadoop Data Archival Strategy with Hadoop and Hive