8
Leveraging analytic insights for business outcomes Traditional ETL The emergence of cloud technology and the rise of ELT ETL vs. ELT A new approach: entity-based ETL (eETL) ETL vs. ELT vs. eETL Bottom line – which is best for you? About K2View 1 1 3 5 5 7 7 8 Table of contents ETL vs ELT vs eETL (entity-based ETL).

ETL vs ELT vs eETL

  • Upload
    others

  • View
    1

  • Download
    0

Embed Size (px)

Citation preview

Page 1: ETL vs ELT vs eETL

1

ETL vs ELT vs eETL (entity-based ETL).

Leveraging analytic insights for business outcomes

Traditional ETL

The emergence of cloud technology and the rise of ELT

ETL vs. ELT

A new approach: entity-based ETL (eETL)

ETL vs. ELT vs. eETL

Bottom line – which is best for you?

About K2View

1

1

3

5

5

7

7

8

Table of contents

ETL vs ELT vs eETL(entity-based ETL).

Page 2: ETL vs ELT vs eETL

2

ETL vs ELT vs eETL (entity-based ETL).

Leveraging analytic insights for business outcomes According to Gartner, by 2022, only 20% of analytic insights will lead to business outcomes.

Not having the right data is one of the key reasons for big data analytics projects to fail. To start with, collecting, creating, and/or purchasing data is not easy. And even if you can get access to the data, you must still answer some serious questions, such as:

• Can the data be processed quickly and cost-effectively?

• Can the data be cleansed and masked?

• Can the data be secured and protected?

• Can the data be used, in ethical and legal terms?

In this whitepaper, we will review the common methods used for creating and preparing data for analytical purposes, and discuss the pros and cons of each approach.

Traditional ETLThe traditional approach to data integration, known as extract-transform-load (ETL), has been predominant since the 1970s. At its core, ETL is a standard process where data is collected from various sources (extracted), converted into a desired format (transformed), then stored into its new destination (loaded).

It is the industry standard among established organizations, and the acronym ETL is often used colloquially to describe data integration activities in general.

The workflow that data engineers and analysts must perform to produce an ETL pipeline looks like this:

Extract: The first step is to extract the data from its source systems. Data teams must decide how, and how often, to access each data source – whether via recurrent batch processes, real-time streaming, or triggered by specific events or actions.

Transform: This step is about cleansing, formatting, and normalizing the data for storage in the target data lake or warehouse. The resultant data is used by reporting and analytics tools.

Load: This step is about delivering the data into a data store, where applications and reporting tools can access it. The data store can be as simple as an unstructured text file, or as complex as a highly structured data warehouse, depending on the business requirements, applications, and user profiles provided.

Extract Transform Load

Raw data is read and collected from various sources, including message queues, databases, flat files, spreadsheets, data streams, and event streams.

Business rules are applied to clean the data, enrich it, anonymize it, if necessary, and format it for analysis.

The transformed data is loaded into a big data store, such as a data warehouse, data lake, or non-relational database.

ETL workflow

Page 3: ETL vs ELT vs eETL

3

ETL vs ELT vs eETL (entity-based ETL).

Traditional ETL has the following disadvantages:

• Smaller extractions: Heavy processing of data transformations (e.g., I/O and CPU processing of high-volume data) often means having to compromise on smaller data extractions.

• Complexity: Traditional ETL is comprised of custom-coded programs and scripts, based on the specific needs of specific transformations. This means that the data engineering team must develop highly specialized, and often non-transferrable, skill sets for managing its code base.

• Cost and time consumption: Once set up, adjusting the ETL process can be both costly and time consuming, often requiring lengthy re-engineering cycles – by highly skilled data engineers.

• Rigidity: Traditional ETL limits the agility of data scientists, who receive only the data after it was transformed and prepared by the data engineers – as opposed to the entire pool of raw data – to work with.

• Legacy technology: Traditional ETL was primarily designed for periodic, batch migrations, was performed on-premise and does not support

continuous data streaming. It is also extremely limited when it comes to real-time data processing, ingestion, or integration.

ETL has evolved quite a bit from the 1970s and 1980s, when the process was sequential, data was more static, systems were monolithic, and reporting was needed on a weekly or monthly basis.

The emergence of cloud technology and the rise of ELTELT stands for Extract-Load-Transform. Unlike traditional ETL, ELT extracts and loads the data into the target first, where it runs transformations, often using proprietary scripting which are executed on the target data store. The target is most commonly data lake, or big data store, such as Teradata, Spark, or Hadoop.

Extract Load Transform

Raw data is read and collected from various sources, including message queues, databases, flat files, spreadsheets, data streams, and event streams.

The extracted data is loaded into a data store, whether it is a data lake or warehouse, or non-relational database.

Data transformations are performed in the data lake or warehouse, primarily using scripts.

ETL steps and timeline

ELT workflow

3

Page 4: ETL vs ELT vs eETL

4

ETL vs ELT vs eETL (entity-based ETL).

ELT steps and timeline

ELT offers several advantages:

• Fast extraction and loading: Data is delivered into the target system immediately, with very little processing in-flight.

• Lower upfront development costs: ELT tools are good at moving source data into target systems with minimal user intervention, since user-defined transformations are not required.

• Low maintenance: ELT was designed for use in the cloud, so things like schema changes can be fully automated.

• Greater flexibility: Data analysts no longer have to determine what insights and data types they need in advance, but can perform transformations on the data as needed in the data warehouse or lake.

• Greater trust: All the data, in its raw format, is available for exploration and analysis. No data is lost, or mis-transformed along the way.

While the ability to transform data in the data store answers ELT’s volume and scale limitations, it does not address the issue of data transformation, which is still very costly and time-consuming.

Data scientists, who are scarce, high-value company resources, need to match, clean, and transform the data – accounting for 40% of their time – before even engaging in any analytics.

So, ELT has challenges of its own, including:

• Costly and time consuming: Data scientists need to match, clean, and transform the data before applying analytics.

• Compliance risks: With raw data being loaded into the data store, ELT, by nature, doesn’t anonymize or mask the data, so compliance with privacy laws may be compromised.

• Data migration costs and risks: The movement of massive amounts of data, from on-premise to cloud environments, consumes high network bandwidth.

• Big store requirement: ELT tools require a modern data staging technology, such as a data lake, where the data is loaded. Data teams then transform the data into a data warehouse where it can be sliced and diced for reporting and analysis.

• Limited connectivity: ELT tools lacks connectors to legacy and on-premise systems, although this is becoming less of an issue as ELT products mature, and legacy systems are retired.

Page 5: ETL vs ELT vs eETL

5

ETL vs ELT vs eETL (entity-based ETL).

ETL vs. ELTThe following table summarizes the main differences between ETL and ELT:

ETL ELTProcess • Data is extracted in bulks from sources, transformed,

then loaded into a DWH/lake

• Typically batch

• Raw data is extracted and loaded directly into a DWH/lake, where it is transformed

• Typically batch

Primary use • Smaller sets of structured data that require complex data transformation

• Offline, analytical workloads

• Massive sets of structured and unstructured data

• Offline, analytical workloads

Flexibility • Rigid, requiring data pipelines to be scripted, tested, and deployed

• Difficult to adapt, costly to maintain

• Data scientists and analysts have access to all the raw data

• Data is prepared for analytics when needed, using self-service tools

Time to insights • Slow - data engineers spend a lot of time building data pipelines

• Slow - data scientists and analysts spend a lot of time preparing the data for analytics

Compliance • ETL anonymizes confidential and sensitive information before loading it to the target data store

• With raw data loaded directly into the big data stores, there are greater chances of accidental data exposure and breaches

Technology • Mature and stable, used for 20+ years

• Supported by many tools• Comparatively new, with fewer data connectors, and

less advanced transformation capabilities

• Supported by fewer professionals and tools

Bandwidth and computation costs

• Can be costly due to lengthy, high-scale, and complex data processing

• High bandwidth costs for large data loads

• Can impact source systems when extracting large data sets

• Can be very costly due to cloud-native data transformations

• Typically requires staging area

• High bandwidth costs for large data loads

Page 6: ETL vs ELT vs eETL

6

ETL vs ELT vs eETL (entity-based ETL).

An eETL tool pipelines data into a target datastore by business entities rather than by database tables. Effectively, it applies the ETL (Extract-Transform-Load) process to a business entity.

At the foundation of the eETL approach is a logical abstraction layer that captures all the attributes of any given business entity (such as a customer, product or order), from all source systems. Accordingly, data is collected, processed, and delivered per business entity (instance) as a complete, clean, and connected data asset.

In the extract phase, the data for a particular entity is collected from all source systems. In the transform phase, the data is cleansed, enriched, anonymized, and transformed – as an entity – according to

predefined rules. In the load phase, the entity data is safely delivered to any big data store.

eETL executes concurrently to thousands of business entity instances at a time, to support enterprise-scale, sub-second response times. As opposed to batch processes, this is done continuously by capturing data changes in real time, from all source systems, and streaming them through the business entity layer to the destination data store.

Collecting, processing, and pipelining data by business entity, continuously, ensures fresh data, and data integrity by design. You wind up with insights you can trust, because you start with data you can trust.

6

eETL

eETL steps and timeline

A new approach: entity-based ETL (eETL)

A new approach that addresses the limitations of both traditional ETL and ELT, and delivers trusted, clean, and complete data that you can immediately use to generate insights.

Page 7: ETL vs ELT vs eETL

7

ETL vs ELT vs eETL (entity-based ETL).

ETL vs. ELT vs. eETL The table below summarizes the ETL, ELT and eETL approaches to data pipelining:

ETL ELT eETL

Process • Data is extracted in bulks from sources, transformed, then loaded into a DWH/lake

• Typically batch

• Raw data is extracted and loaded directly into a DWH/lake, where it is transformed

• Typically batch

• ETL is multi-threaded by business entities

• Data is clean, fresh, and complete by design

• Batch or real time

Primary use • Smaller sets of structured data that require complex data transformation

• Offline, analytical workloads

• Massive sets of structured and unstructured data

• Offline, analytical workloads

• Massive amounts of structured and unstructured data, with low impact on sources and destinations

• Complex data transformation is performed in real time at the entity level, leveraging a 360-degree view of the entity

• Operational and analytical workloads

Flexibility • Rigid, requiring data pipelines to be scripted, tested, and deployed

• Difficult to adapt, costly to maintain

• Data scientists and analysts have access to all the raw data

• Data is prepared for analytics when needed, using self-service tools

• Highly flexible, easy to set up and adapt

• Data engineers define the entity data flows

• Data scientists decide on scope, time and destination of data

Time to insights

• Slow - data engineers spend a lot of time building data pipelines

• Slow - data scientists and analysts spend a lot of time preparing the data for analytics

• Quick - data preparation is done instantly and continuously, in real time

Compliance • ETL anonymizes confidential and sensitive information before loading it to the target data store

• With raw data loaded directly into the big data stores, there are greater chances of accidental data exposure and breaches

• Data is anonymized and is fully compatible with privacy regulations (GDPR, CCPA) before loading it to the target data store

Technology • Mature and stable, used for 20+ years

• Supported by many tools

• Comparatively new, with fewer data connectors, and less advanced transformations

• Supported by fewer professionals and tools

• Mature and stable, used for 12+ years

• Proven at very large enterprises, at massive scale

Bandwidth and computation costs

• Can be costly due to lengthy, high-scale, and complex data processing

• High bandwidth costs for large data loads

• Can impact source systems when extracting large data sets

• Can be very costly due to cloud-native data transformations

• Typically requires staging area

• High bandwidth costs for large data loads

• Low computing costs since transformation is done per digital entity, on commodity hardware

• No data staging

• Bandwidth costs are reduced by 90% due to smart data compression

The eETL processEntity-based pipelining resolves the scale and processing drawbacks of ETL, since all phases (Extract-Transform-Load) are performed on small amounts of data vs massive database tables joined together.

Entity-based ETL technology supports real-time data movement through messaging, streaming and CDC methods for data integration and delivery – and matches the right integrated data to the right business entity in flight.

eETL represents the best of both (ETL and ELT) worlds, because the data in the target store is:

Extracted, transformed, and loaded – from all sources, to any data store, at any scale – via any data integration method: messaging, CDC, streaming, virtualization, JDBC, and APIs

• Always clean, fresh, and analytics-ready

• Built, packaged and reused by data engineers, for invoking by business analysts and data scientists

• Continuously enriched and connected, to support complex queries and avoid the need for running

Page 8: ETL vs ELT vs eETL

8

ETL vs ELT vs eETL (entity-based ETL).

About K2View

K2View provides an operational data fabric dedicated to making every customer experience personalized and profitable.

The K2View platform continually ingests all customer data from all systems, enriches it with real-time insights, and transforms it into a patented Micro-Database™ - one for every customer. To maximize performance, scale, and security, every micro-DB is compressed and individually encrypted. It is then delivered in milliseconds to fuel quick, effective, and pleasing customer interactions.

Copyright 2021 K2View. All rights reserved. Digital Entity and Micro-Database are trademarks of K2View. Content subject to change without notice.

Bottom line – which is best for you? As described above, both ETL and ELT methods have their advantages and disadvantages.

Applying an entity-based approach to data preparation and pipelining enables you to overcome the limitations of both ETL and ELT methods and deliver clean and ready-to-use data, at high scale, without having to compromise on flexibility, data security and privacy compliance requirements.