38
© 2012 DataStreams Corp. All Rights Reserved.

Tera stream ETL

Embed Size (px)

Citation preview

Page 1: Tera stream ETL

© 2012 DataStreams Corp. All Rights Reserved.

Page 2: Tera stream ETL

© 2012 DataStreams Corp. All Rights Reserved.

Data Integration as Data Infrastructure

TeraStream™ for Data Integration

Case Studies

Appendix

Q & A

content

Page 3: Tera stream ETL

© 2012 DataStreams Corp. All Rights Reserved.

Data IntegrationAsData Infrastructure

Page 4: Tera stream ETL

© 2012 DataStreams Corp. All Rights Reserved.

Data Integration Landscape: Business Challenges9 Inaccurate data leads to bad or no decisions9 More than 30% of IT budgets typically spent on Data integration9 Inconsistent enterprise and application architecture for integration

Factors Impact

Resultz Disparate data

z Inaccurate data

z Incomplete data

z Untimely data

z Fragmented Integration Approach

z Multiple versions of the “Truth”

z Wasted time and resources aggregating information

z Difficult to use Data

z Delayed Decision making

z Uninformed management

z Bad decisions

z Lost revenue

z Lost productivity

z Lost market opportunity

z Bad Citizen relationships

This is more than 30 percent of corporate IT budgets so data integrity is used to emphasize what is important.

Page 5: Tera stream ETL

© 2012 DataStreams Corp. All Rights Reserved.

Data Integration

Deliver Real time

Changed Data capture

DI Solutions

Near Real TimeData Processing

Enterprise Data Warehouse

E

T

L

E

T

L

Source System Integrated ODS/DW

ODS Model

(1:1)

DW Model

(ER)

Report Mart

MultidimensionalMart

Summary Table

Data Governance Architecture

Meta Data Data Quality Impact AnalysisMaster Data Management

Analyze Application and

Data

AssureHigh Quality

Manage Metadata

DQ Solutions

Complete Enterprise Data Management Suite

DataStreams solution suite enables complex data integration projects with minimalimplementation effort while producing high-quality Business Intelligence output.

System Architecture

Operating

DB

Page 6: Tera stream ETL

© 2012 DataStreams Corp. All Rights Reserved.

Company ETL

Real Time

Data

Integration

Change

the data

extraction

High

Speed

Sorting

Enterprise

Meta Data

Mgt.

Data

Quality

Impact

Analysis

Master

Data Mgt.

Integrated

repository

Domestic

DataStreams

GTONE

WISE

EnCore

BTL

Global

Informatica

IBM

SAP

Oracle

SAS

Possession of Key Technology

* Possession * Processing * Not yet

Page 7: Tera stream ETL

© 2012 DataStreams Corp. All Rights Reserved.

TeraStream™ for Data Integration

Page 8: Tera stream ETL

© 2012 DataStreams Corp. All Rights Reserved.

TeraStreamTM for Data IntegrationTeraStream™ is a high-performance ETL solution with an user-friendly GUI proven for itsreliability in a variety of enterprises over a decade .

TeraStream™

Performance Experience User Friendly High Value

¾ Powerful Perfomance(TeraSort™)

¾ High-speed extraction (FACT™)

¾ Reuse of data (EBH)

¾Over 200 customers¾Serving multiple

industries includingbanking, governmentretail

¾Over a decade of experience

¾ Intuitive GUI

¾ Easy to operate

¾ Easy to maintain

¾ Fast implementation

¾ Easy customization

¾ Low resource use

Page 9: Tera stream ETL

© 2012 DataStreams Corp. All Rights Reserved.

TeraStream™ Approach9 Variety of data types and formats transport from source to target as needed.

9 Covers enterprise-wise data flow from operational to subject Data Mart.

9 Also applied to high volume batch processing and near real-time data integration.

Loading

Files

New Systems

Files

Databases

Databases

Extraction

Transform / Cleansing

Conversion Reformat

SortJoin

Aggregation

Automatic generation of scripts can be used for loading to various DBMSs

LOAD

Data extraction from various commercial DBMS in highspeed

High performance SORT engine resolves time bottleneck due to transform large datum

EXTRACT TRANSFORM

Page 10: Tera stream ETL

© 2012 DataStreams Corp. All Rights Reserved.

TeraStreamTM out-performed 3-times in speed against its competitor with 30% of CPUresource using SORT Engine.(Data Migration in Shinhan Bank, Korea)

Excellent performance using novel method

� thread MAX for sort =3 � File manipulation : 35% CPU usage�Load : 80% of peak CPU usage

� Parallel = 4� File manipulation : 58% of CPU usage.� Load: 58% of peak CPU usage

¾ Elapse time : 20 minutes¾ Wasted System Resource : 800

( 40% Avg. CPU usage X 20 mins )

Conclusion¾ Elapse time : 59 minutes¾ Wasted System Resource : 3000

(50% Avg. CPU usage X 60 mins)

Conclusion

FILE → FILE FILE → DB

TeraStream™

FILE → DB DB → DB

IBM DataStage

Page 11: Tera stream ETL

© 2012 DataStreams Corp. All Rights Reserved.

Superior performance in NRT Implementation

Transportation of up to 1 million records per minute by reading flat files through EAI andsplitting them per tables eliminating the duplicated business days to Sybase IQ.

3 X

0

10

20

30

40

50

60

70

100 1,000 5,000 10,000 20,000

IBM DataStage

(minutes)

(Thousand records)

[Shinhan bank DW Benchmark in August, 2006)]See Appendix 2 for performance of NRT additional information

10 million cases, expect more than 3 times performance improvement

Page 12: Tera stream ETL

© 2012 DataStreams Corp. All Rights Reserved.

TeraStream™’s excellent performance can be applied to not only ETL but also daily batch jobs.

[Batch Job of POST Insurance Service Company, 2007]

No. of Records Oracle(SQL) TeraStream

400,000 1m 32s 28s

1,000,000 5m 01s 41s

2,500,000 12m 21s 59s

No. of Recs

Oracle

Time

Exceptional Performance in Batch Jobs

250,000~500,000

Tth

High Performance

Effective use of

resources

Convenience

Page 13: Tera stream ETL

© 2012 DataStreams Corp. All Rights Reserved.

Over 56% improvement in ETL performance

9 Using EBH, TeraStreamTM can cut down data path from Legacy to DATA MART saving ETL time and resource usage.

9 Massive volume of files extracted from Legacy Systems are stored in EBH for further reuse in next step.

9 ETL time is reduced by avg. 56%. (In LG Telecom from D-3 to D-1)

EDW Server

IBM p690

NCR 10Node

Teradata

D-1

Oracle 8i

ETL Server

ODS

Customer/Call/Billing

ConnectionPPS/BSS

Mining Input VariableMOLAP AnalysisMining Analysis

Campaign AnalysisSybase IQ/ASE

OLAP

MART Server

CSM/AR Billing

Oracle 8.0.6

CCS/MPS/ERPCTI /PPS/NMS

SRDF

Legacy

ETL

EBH

Informatica

EBH (ETL and Batch Hub) stores temporary and result files which is shared for further table generation in EDW and DATA MART.

Page 14: Tera stream ETL

© 2012 DataStreams Corp. All Rights Reserved.

Over 20 times faster extraction than SQL

9 High speed data extraction of commercial database with SQL is supported.

9 Automatic extraction query is generated.

Select * from table

• High speed extraction engine(FACT™) with optimized database API.

• DBMS Supported :- Oracle- Informix - DB2 / UDB- Sybase IQ /ASE- Teradata- Greenplum- MSSQL /MySQL- Altibase

• File split and filtering while extraction • Time, time stamp, and user data format

specification

Page 15: Tera stream ETL

© 2012 DataStreams Corp. All Rights Reserved.

Intuitive User Interface

Supports for data integration activities(develop, execute, monitor, validation) in integrated GUI environment

Intuitive task flow

Project monitorEditor window

GUI for developers¾Intuitive task flow

checking standard output/error/file information/

number of files processed

¾Execution log

real time job monitoring¾ Project Monitor

scheduling by time/ period/ business calendar

¾ Scheduler

Mapping creation¾ Editor window

SchedulerTask block execution log

Metadata property

Impact analysisChange history manager

¾ Metadata Repository

Page 16: Tera stream ETL

© 2012 DataStreams Corp. All Rights Reserved.

Work with best of breed DBMS providers

9 Powerful connection between different DBMS types.

9 Both DB-to-DB and File-to-DB data transportation are supported.

• N:N mapping

• Conversion while transportation

• Click to choose record processing types : (Insert/delete/update/insert-update/delete-insert)

• DBMS types : Oracle, DB2, Sybase, Informix, Teradata, Greenplum, MSSQL, MySQL, (Altibase, Tibero)

Transformation LogicSource Table Target Table

Page 17: Tera stream ETL

© 2012 DataStreams Corp. All Rights Reserved.

Easy Data Conversion

9 By mapping source to target, conversion of formats, types, character sets, dates, bytes/bits, encryption

• Easy data conversion using mapping window of “converter task block”

• Data character set conversion including EBCDIC to ASCII

• Data conversion from NDB(Unisys 9-bit) or HDB(IBM) data type to RDB

• 300 built-in functions

• DATE, Time Stamp Conversion between different date formats

• CLOB/BLOB supported

• Users can add more functions as needed

List of provided functions

CALLED_NO function editor

=addday(cdate(“",”",” (N)")

addday(cdate("2005/05/12 12:08:24", "YYYY/HH/DD HH:MI:SS"),2)

Converter task block

Page 18: Tera stream ETL

© 2012 DataStreams Corp. All Rights Reserved.

Easy Data Transport9 TeraStream uses various transportation method according to file structure, transportation

distance, security, amount of record and etc.

• File to DB data load for bulk data

• “Load task block” generates load scripts automatically.

• Remote transportation using FTP

• Encryption while transporting

• Near Real-time and Bulk transportation is possible

Load Scripts

Page 19: Tera stream ETL

© 2012 DataStreams Corp. All Rights Reserved.

Up to 40% cost Savings9 The higher complexity, the bigger cost saving in development .

(Courtesy of Hanhwa Insurance Co. and SKC&Cin 2007)

Jobcomplexity

No. of recs

InputSize(Gb)

TeraStream™In-

housecoding

Speed-up

1:1 mapping 90 22 30min 2hour 75%

1:N mapping 900 21 2hour 6hour 66%

N:1 mapping 1700 15 2hour 10hour 80%

N:N mapping,complex 1300 8 2hour 20hour 90%

¾ Avg. 70% of development speed-up ¾ 90% speed-up for more complex jobs¾ Overhead from modification, test and

preliminary data checking.

Development(4Month)

Test(4Month)

Stabilization(1Month)

24M/M

48M/M54M/M

TeraStream™

In-house coding(Estimated)

40M/M

80M/M

90M/M

40% Reduction

Page 20: Tera stream ETL

© 2012 DataStreams Corp. All Rights Reserved.

Case Studies

Page 21: Tera stream ETL

© 2012 DataStreams Corp. All Rights Reserved.

System configurationIssues

Plans

Kookmin Bank

IBM M/F

HDB, DB2

Server RDB

Sybase ASIQ 12.7IMS HDB

- Seg. split- conversion & Array split- logic applied

- conversion- logic applied

- Logic applied

EDW

ETL

ETL

ETL

Informover

TS(FACT)

Informover

Source system

File process flow DB QUERY

Expected Result

¾ Various DBMS(IMS HDB, HOST DB2, Oracle, DB2 UDB) integration by using TeraStream™

¾ Meeting batch target time of 2 hours and 30 minutes for 4TB of EBCDIC data.

• M/F and IMS HDB conversion• Processing changed data in absence of time-series

column• Processing large size data within batch process

time(10TB/day based on source data)• How to process high volume files in parallel

• Converting main frame data into data in Unix environment (10TB → 25TB) within 18 hours.

• Various data conversion and processing including Korean character conversion

• ETL task from accounting system server to new ODW server(extracting appx. 200 GB of daily changed data within 1 hour and 30 minutes by using FACT module of TeraStream™)

• ETL and Batch process in unified way.• Batch job in core banking system within 6 hours.

9 EDW and integrated DM installation

A-SOR DM

Page 22: Tera stream ETL

© 2012 DataStreams Corp. All Rights Reserved.

E-Voucher Statistical DWOperational

Health and Welfare Department’s e-Voucher9 E-Voucher DW Performance Improvement

¾ Statistics reporting time is dramatically reduced from 1~6 days to a few second or minutes.¾ Statistics reporting process made simple and easy to get report.¾ Consistent data delivery increase data reliability.

• daily transportation to ODS• build ODS, DW and DM for better table model• e-Voucher System (DB2 -> DW Server)• Platform

- OS : AIX 5.3(ASIS,TOBE )- CPU : Power5, 2.1GHz, 6core , IBM P-serise- MEM : 12 GB - H/W : 1TB

• Simple logic made MA easy

• Low data integrity

• Lack of expeditious response

• Fraud detection was hard.

• Low reliability of statistic data caused dispute

between data users and generators

Plans

Issues System Configuration

- ODS data conversion- update/insert at ODS

- 1:1 mapping- Daily batch

- Load to ODS

IBM P-serise

Voucher Service

Mis-settlement

Pregnancy & Birth

History

Target DB (oracle)

FACT

ODS DM

DW

ETL

- ODS/ DW data manipulation- update/insert to data mart

ETLETL

Expected Result

Source DB(oracle)

Page 23: Tera stream ETL

© 2012 DataStreams Corp. All Rights Reserved.

Deashin Securities9 Deashin Securities Next generation System build

• Process transactions via data extraction and transformation.

• Build preambles using transformed data.• Bulk file processing (e.g. ASCII)• Enable execution of modules in different languages

via shell.• TeraStream Use Case

1. Non-periodic ETL or file processing routine.Cybos UI -> TeraStreamCybos UI generates a preamble or a report file.

2. daily/weekly/monthly/quarterly/yearly data batch and non-periodic data processing routine- Linkage between Control-M and TeraStream- TeraStream extracts data from core-banking- Data are transformed and loaded back to the system.

• Bulk file operations required for file types such as ASCII

• Modules in different languages to be executed via shell.

Channel(Service)

Channel(External)

Core-Banking(Business Data)

Business System

CybosTerminal

IE

CB+

FEPX-MINS

FIX

OracleCORE DB

AIX

Control-M

Scheduler

Business Support AP

Batch AP

WebsphereNEFSS

HIS(Web

Server)

TR(Online)

UnixShell

TeraStream

OTIS

OracleCORE DB

AIX

OracleCORE DB

AIX

1. Cybos ->TeraStream

3. Control-M ->TeraStream->

OTIS

2. Control-M ->TeraStream

¾ Services to ensure speed and reliability¾ Standardized linkage with other systems¾ 24 * 365 system, building and operating the system faster issue resolution and

ease of maintenance

Expected Result

Plans

Issues System Configuration

Page 24: Tera stream ETL

© 2012 DataStreams Corp. All Rights Reserved.

Samsung Electronics

• Rea-time data transportation between Germany and China.

• Bi-directional synchronization between TeraStream of Germany and China.

• 20 min. MAX loading time for transported data is implemented using TeraStream NRT.

• Web Monitoring is developed

• Registration in one country should have the same service at other country.• duplicated record should be avoided due to cross transportation.• 20 minutes Near Real-time• Perfect Recovery scheme should be presented

Plans

Issues System Configuration

Smart Phone System in Germany

DBs in Service

¾ Efficiency is maintained despite cross transportation¾Bi-directional NRT integration allows the same service regardless of system type

and country (Time from extraction to loading is 20 minutes.)¾ Bi-directional remote data transportation using TeraStream

NRT Extract

프로그램성공, 실패등실행결과

Web Monitoring

Sam To DBUPSERT

NRT ExtractSAM To DB

UPSERT

9 Global Database Integration using NRT ETL

DBs in Service

Smart Phone System in China

Expected Result

Page 25: Tera stream ETL

© 2012 DataStreams Corp. All Rights Reserved.

LG Telecom

• Solution provided by ‘I’ company requires more than twelve hours for processing every billing and call data.

• It delays entire processes and often requires re-processing of data.

• Efficient unique key generation for entire business tasks

• Transition from old to new billing system- Data size: 3TB→ 3.5TB, Object: Transition in 30 minutes

• Move unchanged data among large dataset three days prior to the new system open date.

• Separate files that will be loaded to EDW and DM and load them in different business tables.

• Unique key generation for entire business process is done first.

Legacy ODS Server

SRDFAR

Billing

MPS

ERP

PPS

NMS

CCS CTI

DM Server

EDW Server

IBM P SeriesSybase ASIQ

ODS

TeraStream loads data transformed

in ODS to EDW and DM at the same time.

ETL

ETL

CSM NCR 10Node

Customer Billing

Call Data

Contacts PPS/BSS

Teradata

OLAP Mart

D+1

Oracle Oracle/Informatica

CampaignAnalysis

MiningInput

VariablesMOLAPAnalysis

MiningAnalysis

9 LG Telecom new billing system data transfer¾ The working hours shortened to D +3 and D +1 in reducing the system load¾ On average, 56% of the effect of reducing working hours¾ Emergency response system rework due to delay in securing and providing data

to minimize Impact

Expected Result

Plans

Issues System Configuration

Page 26: Tera stream ETL

© 2012 DataStreams Corp. All Rights Reserved.

Products

Page 27: Tera stream ETL

© 2012 DataStreams Corp. All Rights Reserved.

Real Time Change Data Capture_DeltaStream

DeltaStream is a real-time CDC(Change Data Capture) solution which automatically detects the data change information from transaction log and transfers it to a target system.

Features Expected Result

System Architecture

�Minimizing the burden on source system

�Minimizing the businessimpact

� Real-time data Capture

Page 28: Tera stream ETL

© 2012 DataStreams Corp. All Rights Reserved.

Metadata Management_MetaStreamMetaStream is to manage meta data which describes data, extracts and integrates metainformation which is spread over multiple systems, and supports for standardization management system.

Features Expected Result

System Architecture

� Improving efficiency by consistent meta information managementfrom preventing meta data redundancy.

� Preventing redundant R&R andmeta request based on ownershipwith standardization and model.

� Saving analysis time

Page 29: Tera stream ETL

© 2012 DataStreams Corp. All Rights Reserved.

Data Quality Management _QualityStreamQualityStream is a data quality control solution which accesses to the target data, makes a diagnosis, and analyzes the results. It analyzes the current data quality by running database profiling. It registers the management issues and analyzes the results by scheduling.

Features Expected Result

System Architecture

� Support of establishing quality management system

� Six sigma based approach togenerate more accurate statistical indicators and precisely detect errors.

� Efficient data quality control withthe register and management process.

� Error rate reduction with error data maintenance and control plan.

Page 30: Tera stream ETL

© 2012 DataStreams Corp. All Rights Reserved.

Application Impact Analysis_ ImpactStreamImpactStream is Impact Analysis tool after changes in application. It constructs Application Knowledge Database to improve understanding and readability. ImpactStream receives the changed source from change management tool, automatically analyses it by parser engine, stores it in the repository, and provides impact analysis information through search screen.

Features Expected Result

System Architecture

• Improving development productivityand reducing maintenance costs

• IT Application Development /Maintaining management information

• Integrating efficient enterpriseapplications

• Improving control over outsourcing

Page 31: Tera stream ETL

© 2012 DataStreams Corp. All Rights Reserved.

Master Data Management_MasterStreamMasterStream is a master data management solution which ensures consistency of master data within an enterprise. It has centralized type and cross over type to collect, create, verify, and simultaneously distribute data. Data from the legacy system is integrated, verified by business rules before it is referred by application system, synchronized, and monitored.

Main Components Expected Result

System Architecture

� Improving efficiency in the workplace

by sharing the high quality key information

with enterprise users

� Supporting quick decision making with

reliable statistical analysis

� Reducing maintenance costs by improving

operating system with integration

Page 32: Tera stream ETL

© 2012 DataStreams Corp. All Rights Reserved.

Appendix

Page 33: Tera stream ETL

© 2012 DataStreams Corp. All Rights Reserved.

App.1 : Product Configuration

9 TeraStream™ includes a sort engine and a high volume data extraction engine(FACT™), and meta data is stored and managed in DBMS.

• easy to use GUI for developers.¾ User Interface

• High performance (FACT/CoSORT)• External command(shell/SortCL)• Query processing• Data conversion (Korean/Japanese)• Function processing

¾ Data Processing

¾ Metadata Management

¾ Operations & Administration

User Interface

Operations & Administration

Data Processing Engine

TeraStream Designer

Metadata Management Engine

TeraStream DB(Repository)

LogManager

ProjectScheduler

FFDManager

ProcessManager

DataAccess Manager

MessageBroker

FACTTM CoSORTTM Converter USQL External command User SCL

• Job and system log management • Job scheduling• File Format Description for metadata• Real-time job monitoring• Authentication Management

•Data format, job & system information in TSDB(Repository)

Monitor

Page 34: Tera stream ETL

© 2012 DataStreams Corp. All Rights Reserved.

App. 2 : Time Table for NRT Implementation

Unit(records in thousand)

TeraStream™ D product

mapping/processing/loading mapping/processing/loading

start end time start end time

100 18:02:39 18:02:55 0:16 15:08:16 15:10:33 00:53

1000 18:05:25 18:06:23 0:58 15:11:13 15:20:34 03:32

5000 18:07:20 18:12:02 4:42 15:25:14 15:43:44 15:28

10,000 18:13:54 18:24:20 10:26 15:47:57 16:23:45 31:09

20,000 18:29:10 18:49:55 20:45 16:31:40 17:36:10 58:41

10,000(concurrent execution)

11:35:48 11:50:35 14:47 11:35:48 12:17:10 41:22

Page 35: Tera stream ETL

© 2012 DataStreams Corp. All Rights Reserved.

App. 3 : Performance Improvement Details

Job Task Cycle System Before After Improvementrate

Billing Sales MonthEDW 12:50 5:00 61%

OLAP Mart 18:35 8:20 55%

Calls

Charges dayEDW 5:50 3:00 49%OLAP Mart 8:00 4:00 50%

ACCUM weekEDW 4:20 1:55 56%OLAP Mart 7:20 3:00 60%

receiving CDR (NMS)

dayEDW 1:00 0:30 50%OLAP Mart 2:20 0:55 61%

Sending CDR (NMS)

Day EDW 1:40 1:05 35%

ERP batch Month EDW 11:20 3:15 71%receiving CDR

(NMS)Month EDW 5:00 2:15 55%

OLAP mart 11:40 2:20 80%sending CDR (NMS) Month EDW 8:20 4:50 42%

ERP provided BATCH

Month EDW 16:20 5:15 68%

Customer Service After service month EDW 5:30 5:05 9%

Page 36: Tera stream ETL

© 2012 DataStreams Corp. All Rights Reserved.

App. 4 : TeraStream ™ Features & Benefits(1/2)9 TeraStream™ guarantees to meet your need for enterprise data integration as well as

excellent batch job hub.

Sort EngineUsing TeraSort™, TeraStream™ can accelerate sort-related data manipulation (dedup, average, min, max, join, summary and etc.)

FAst extraCT FACT™ performs high speed bulk extraction from various commercial DBMS.

Automatic Metadata Generation

TeraStream™ provides direct reading of DBMS data dictionary to create its own metadata information.

High Speed Lookup It provides in-memory lookup function which is high speed mapping conversion using lookup tables.

Variety of conversion function calls

It provides more than 100 user friendly mapping functions.

Developers can easily add their own functions.

Pre/Post Processing TeraStream™ provides inter-record and inter-table conversion through pre/post mapping.

Major Features Description

Page 37: Tera stream ETL

© 2012 DataStreams Corp. All Rights Reserved.

9 TeraStream™ has been evolved to meet various parallel processing needs and to give convenience through highly efficient GUIs.

Inter-node Operation

Remote call is possible to initiate the projects of other nodes between TeraStream™s.Distributed Computing using idle nodes is possible by easy transfer of data.

Near Real-Time ETL Data transportation every minute is possible including complex data mapping

Efficient GUI

Using GUI, no skills on programming language are necessary.Unified monitor and control in single screen or specialized monitoring is possible through web browser.Scheduling of jobs is made in unified GUI but even for distributed servers.

Multi Language Support UTF-8 is supported.

App. 4 : TeraStream ™ Features & Benefits(1/2)

Major Features Description

Page 38: Tera stream ETL

Thank youwww.datastreams.co.kr