Upload
seshu-kanuri
View
154
Download
4
Embed Size (px)
DESCRIPTION
Teradata Hadoop Data Archival Strategy with Hadoop and Hive
Citation preview
September 2014
Teradata to Hadoop Data Migration
BySeshu Kanuri
Enterprise Data Architect
2
Agenda
The Data Archival Proof of concept is currently underway under the direction and guidance of the Business Insurance (BI) Teradata 14.10 upgrade Program.
This high level proof of concept design will focus on various techniques practices for achieving such data archival and retrieval of BI Warehouse data between Teradata and Hadoop environment.
2
3
Use Cases for POC
Case -1: Copy an existing Teradata base table and all its’ data to Hadoop DFS and verify that the data is structurally similar. Move the data back to Teradata database as a relational data table and verify that the structure and data are exactly similar.
Case -2: Copy an existing Teradata SCD table and all its’ data to Hadoop DFS and verify that the data is structurally similar. Apply all the CDC values from the Teradata table and apply these changes to
HDFS table and verify that the CDC values are reflected in the HDFS Table. Move the HDFS table back to Teradata to verify that the structure and data are similar between Teradata and HDFS.
4
Scope Definition Out of Scope
Ab Initio Based ETL from TD to HDFS(Extracts to landing Zone may be considered)
Non-Apache Hadoop Drivers and Connectors Cluster Hardware / Software configurations Security Layer Implementation (Data Encryption,
Masking, etc.) Performance Tuning and Benchmarking High Availability and DR
5
CDH Stack Components Component Installed Version Desired Version Features in Desired Version
DataFu pig-udf-datafu-0.0.4+11
Apache Flume flume-ng-1.4.0+96
Apache Hadoop hadoop-2.0.0+1554
Apache HBase hbase-0.94.15+86
Apache Hive hive-0.10.0+237 .14 Truncate, More Data Types
Hue hue-2.5.0+217
Apache Mahout mahout-0.7+15
Apache Oozie oozie-3.3.2+100
Parquet parquet-1.2.5+7
Apache Pig pig-0.11.0+42
Apache Sentry sentry-1.1.0+20
Apache Sqoop sqoop-1.4.3+92
Apache Sqoop2 sqoop2-1.99.2+99
Apache Whirr whirr-0.8.2+15
Apache ZooKeeper zookeeper-3.4.5+25
6
Solution Architecture – Option A ( Sqoop with Hive )
6
Sourc
e
Layer
Sto
rage
Layer
Custom Map-
Reduce – JDBC utility
Cloudera Sqoop
connector powered
by Teradata
Teradata Connector
for Hadoop – CLI utility.
Import
Export
HDFS
TDCH Sqoop
HBASEClou
dera
Man
agem
ent &
M
onito
ring
Serv
ices
JDBC/ODBC
Scripting (Pig) SQL Query (Hive)
Hadoop Ecosystem
OozieMap Reduce
JDBC
7
Solution Architecture Description – Option A
# Solution Component Description
1 Source Layer - Teradata • Contains Teradata tables that need to be migrated to Hadoop Storage. • Tables could be Full Refresh tables or SCD tables.
2 Storage Layer – CDH 4.x. • Cloudera Distribution with Cloudera Manager for management and monitoring.• Hadoop stack includes: Hive, Pig, HBASE, Oozie, Sqoop.
3 Sqoop Connector for Teradata / Teradata Connector for Hadoop - CLI
• Cloudera connector for Sqoop powered by Teradata – developed by Cloudera and Teradata.
• Supports importing data split by AMP/VALUE/PARTITION/HASH.• Supports exporting data via batch.insert, multiple.fastload, internal.fastload.• Supports importing and exporting of data in Text / Sequence / Avro file format.• Cloudera Recommendation – Use Cloudera connector powered by Teradata versus
Cloudera connector for Teradata (Older version). • TDCH – This is a command line interface utility provided by Teradata leveraging
Teradata Java SDK (TeradataImportTool / TeradataExportTool) developed for data transfer between Hadoop and Teradata.
4 Hadoop/Other (Processing Layer)
• HDFS will be used to store the files and process it.• Sqoop Imported Files could also be directly imported into Hive or could be loaded
in HBASE through custom loading utility.• SCD – Load Fact tables into Hive. Load
L Oozie • Hadoop processing can be scheduled in a workflow through Oozie.
7
8
Solution Architecture – Option B ( Hive with Sqoop and Teradata Utilities )
8
Sourc
e \
ETL
Layer
Sto
rage
Layer
Extract
LoadLoad
Extract
File Landi
ng Zone
HDFS
TDCH Sqoop
HBASEClou
dera
Man
agem
ent &
M
onito
ring
Serv
ices
JDBC/ODBC
Scripting (Pig) SQL Query (Hive)
Hadoop Ecosystem
OozieMap Reduce
HDFS File Copy
Teradata Utility
9
Solution Architecture Description – Option B
# Solution Component Description
1 Source Layer - Teradata • Contains Teradata tables that need to be migrated to Hadoop Storage. • Tables could be Full Refresh tables or SCD tables.
2 Source Layer – Ab-Initio / File Landing Zone
• Leverage Ab-Initio to extract data into a flat file while importing data to HDFS.• Leverage Ab-Initio to load data into Teradata tables from files exported by HDFS.• Files would copied to and from Ab-Initio and HDFS in a designated File Landing
Zone.
3 Storage Layer – CDH 4.x. • Cloudera Distribution with Cloudera Manager for management and monitoring.• Hadoop stack includes: Hive, Pig, HBASE, Oozie, Sqoop.
4 Hadoop/Other (Processing Layer)
• HDFS will be used to store the files and process it.• Files could also be directly imported into Hive or could be loaded in HBASE
through custom program.• SCD - TBA
5 Oozie / Autosys • Hadoop processing can be scheduled in a workflow through Oozie.• Ab-Initio processing can be scheduled via Autosys.
9
10
Known Limitations for POC # Solution Component Description
1 Cloudera Connector Powered by Teradata (Latest version – 1.2c5)
• Does not support HCatalog• Does not support import into HBASE.• Does not support upsert functionality (parameter --update-mode allowinsert).• Does not support the --boundary-query option.
2 Cloudera Connector for Teradata (Older Version)
• Does not support HCatalog.• Does not support import into HBase.• Does not support AVRO format.• Does not support import-all-tables.• Does not support upsert functionality (parameter --update-mode allowinsert).• Does not support imports from views.• Does not support data types INTERVAL, PERIOD, and DATE/TIME/TIMESTAMP WITH
TIME ZONE.• Optional query band is not set for queries executed by the Teradata JDBC driver
itself (namely BT and SETSESSION CHARACTERISTICS AS TRANSACTION ISOLATION LEVEL SR).
3 Hive • Hive does not provide record level update, insert, or delete.• Hive does not provide transactions.• Compared to an OLTP database, Hive queries have higher latency due to start-up
overhead of MapReduce jobs.
4 Sqoop • Each execution requires input of password. Password can be passed in command line, as standard input, or from a password file. Password file is more secure way to automate Sqoop workflow.
• Encoding of NULL values during Import/Export needs to be considered.• Incremental Updates would need to utilize Sqoop metastore for preserving last
value.
10