Upload
others
View
8
Download
0
Embed Size (px)
Citation preview
The Complete Buyer’s Guide For Intelligent Data VirtualizationFuture proof your technology choices and avoid vendor lock-in
Buyers Guide2020
By 2022, 60% of all
organizations will
implement data
virtualization as one
key delivery style in
their data integration
architecture.
GARTNER Gartner Market Guide for Data Virtualization
ABOUT THIS GUIDE
Data virtualization technology has been around for quite some time, primarily supporting operational
use cases. However, the explosion of data in size and variety, coupled with the increased focus on
analytical use cases, has created new challenges for legacy data virtualization technologies. The
need for ad-hoc access to both live and historical data steadily increases among business users and
that demand stretches the limits of how the most robust analytics tools can process these massive
datasets. Moreover, the volume and speed with which data is generated is beyond the capacity and
economic bounds of today’s typical enterprise infrastructures.
In this guide, we will lay out some key considerations for organizations looking to apply data
virtualization to their analytics use cases. Along with key features and capabilities, we’ll discuss the
differences between data virtualization and query federation, drill down on caching techniques and
include a detailed ranking for evaluating vendors in the space.
TABLE OF CONTENTS
What Is Data Virtualization? ___________________________________________1
Is Your Organization Ready For Data Virtualization? ________________________1
Getting Started On Your Search _________________________________________2
What Is “Intelligent” Data Virtualization? ________________________________3
Key Considerations ___________________________________________________4
Analytics Versus Operational Use Cases _______________________________4
Data Sources & Connections ________________________________________5
Development Environment _________________________________________6
Calculations & Analytical Functions (OLAP) ____________________________6
Query Performance & Caching Approaches ____________________________7
Client-side Requirements __________________________________________8
Security & Governance _____________________________________________8
Feature Ranking ______________________________________________________9
Conclusion _________________________________________________________11
Resources & Further Reading __________________________________________12
© 2020 AtScale Inc. All rights reserved. 1
WHAT IS DATA VIRTUALIZATION?
IS YOUR ORGANIZATION READY FOR DATA VIRTUALIZATION?
It wasn’t that long ago when most believed that a centralized data warehouse could be the data panacea
for the enterprise. It didn’t take long to realize that the demands of the business were too fluid to wait
on IT to collect, normalize and store data in a single physical location. Since then, our challenges of
providing an enterprise data “single source of truth” have become even more elusive. The explosion
of data sizes, varieties and formats have made it all but impossible to standardize on a single storage
format. Even worse, with the emerging popularity of the public cloud, data stewards now need to deal
with data locale as well. While it’s tough enough for those who need to engineer and secure the data, it’s
even harder for the consumers of data. We are now asking our business users to engineer, wrangle and
access data on-premise, in the cloud, in databases and in data lakes.
If these types of challenges sound familiar, you may be ready to explore data virtualization. By
abstracting away how and where data is stored, data virtualization frees the enterprise from the
© 2020 AtScale Inc. All rights reserved. 2
inflexibility and costs that data silos beget. By presenting users with logical data views, data
virtualization creates agility for the enterprise while future-proofing the technology choices for
years to come.
Even though data virtualization technology has been around for several years, the emergence of the
public cloud and the explosion in the number of data platforms and analytics tools has made virtualization
an essential component for enterprises. According to Gartner, by 2022, “60% of all organizations will
implement data virtualization as one key delivery style in their data integration architecture.”
GETTING STARTED ON YOUR SEARCH
There are a variety of data virtualization approaches in the market and it’s important to understand
the different styles before launching your search. According to Andrew Brust in his GigaOM report
titled “Data Virtualization: A Spectrum of Approaches”, there are four main approaches or styles for
data virtualization.
APPROACH DESCRIPTION PROS CONS EXAMPLE VENDORS
Core Data Virtualization Platforms
A stand alone virtual data (semantic) layer that abstracts the physical data platform and location and allows queries across disparate data platforms.
Platform independent
True semantic layer
Caching for performance
Separate service to manage
Data needs to be modeled
AtScale
Denodo
Dremio
TIBCO DV
SQL-On-Anything A query federation engine that allows a single SQL query to combine data from more than one data platform.
Good for file-based access
Scale out with clustering
Users need to understand remote schemas
Unpredictable performance
Amazon Athena
Apache Drill
Presto
Remote Data Source Bridges A database extension that is embedded in a RDBMS that allows SQL access to remote databases using “external” tables.
Good when one primary ware-house is used
No separate service to manage
Users need to understand remote schemas
Unpredictable performance
Amazon Redshift Spectrum
IBM Db2 Big SQL
Microsoft SQL Server Polybase
Oracle Big Data SQL
Teradata QueryGrid
Autonomous Data Warehouses A platform that automates the modeling, integration, and con-nectivity of source data which is then loaded into a target data platform.
Data stays in one place
Performance is easier to predict
Requires data movement (data copies)
Data latency or staleness
Data Virtuality
Infoworks.io
Incorta
Illustration 1: Approaches to data virtualization.
© 2020 AtScale Inc. All rights reserved. 3
To further summarize, you can think of these approaches as pure data virtualization, database
federation and next generation ETL tools. If you are looking for the most flexible approach, the pure
data virtualization platforms are your best choice. These platforms provide the best degree of “future-
proofing” since they allow users to plug and play with new data platforms as they emerge. Also, these
tools usually allow users to define virtual calculations which promotes the consolidation of business
terms, definitions and calculations in a central location. In contrast, the database federation tools force
you into a single database platform while the autonomous data warehouses require data movement and
lock customers into a single vendor’s ecosystem.
WHAT IS “INTELLIGENT” DATA VIRTUALIZATION?
Traditional data virtualization platforms typically rely on the operator to manually tune query
performance. This becomes a difficult task especially when dealing with multiple data platforms. A
federated query that joins data from more than one data platform can create an unreasonable amount
of data movement in real time. Simple memory and table caching are not enough to handle the variation
in query patterns, and transferring atomic data for heterogeneous joins may create an unacceptable
volume of network traffic which results in a poor and frustrating user experience.
“Intelligent” data virtualization addresses these scale and performance issues. Intelligent data
virtualization platforms avoid high volumes of real time traffic due to federated joins by creating and
managing a distributed and data platform optimized cache. By inspecting query patterns and data
platform and network performance characteristics, this approach ensures that the heavy query lifting
is pushed down to the remote data platforms. By avoiding unnecessary data transfers, intelligent data
virtualization delivers more consistent query performance with far less resource consumption.
Recommendation: The Core Data Virtualization style of data virtualization delivers
the most flexibility for future-proofing your information architecture and minimizing
vendor lock-in.
© 2020 AtScale Inc. All rights reserved. 4
KEY CONSIDERATIONS
When choosing a vendor, there are a few core capabilities to keep in mind. Depending on your needs,
you can weigh the options accordingly. The following categories are further broken down in our
checklist later in this document.
Analytics Versus Operational Use Cases
Until recently, traditional data virtualization has focused primarily on operational application
integration use cases. In the last few years, enterprises have expanded their use of data virtualization
beyond just development or test deployments. According to Gartner, in 2011 only 11% of surveyed
organizations reported that they were utilizing data virtualization and primarily in operational
application integration—as a semantic tier to multiple datasets that were not permanently stored in
operational data stores, warehouses or marts. By 2018, however, as many as 40% of organizations were
utilizing data virtualization with the majority targeting analytics use cases.
If you are targeting operational application integration as a use case, traditional data virtualization
platforms may be good enough. In these use cases, the data size and query profiles tend to be more
predictable and thus more suitable to a one time optimization effort. However, analytics workloads
are much more varied and tend to scan and aggregate large amounts of data. For these use cases, it is
essential that the virtualization solution support autonomous performance management. One time,
manual performance tuning is not suitable for analytical workloads and will likely lead to low end user
adoption due to unpredictable and poor query performance.
Recommendation: When choosing a vendor, make sure that query performance
management is automated based on user query patterns, network performance and
data platform capabilities.
Recommendation: For analytical use cases, choose an intelligent data virtualization
platform that automatically manages query performance.
© 2020 AtScale Inc. All rights reserved. 5
Data Sources & Connections
There are two questions to consider when choosing a data virtualization solution with regards to data
platform connections.
The first is rather obvious: how many different data platforms or data sources does the solution support?
To maximize flexibility, your vendor should support data sources that are (1) relational (i.e Oracle,
Teradata, Snowflake), (2) file-based (i.e. CSV, JSON, XML, HDFS, S3), (3) API based (i.e. REST, HTML)
and (4) application based (i.e. Salesforce, Workday, Service Now). With this type of coverage, your
virtual view designers can incorporate just about any data, on-premise, in the cloud, structured and
unstructured without involving ETL or manual data movement.
The second consideration may be less obvious but is even more critical than the first. The question to
ask is “Where does the query processing happen?”. There are two different answers to this question.
The first answer is to deploy a “least common denominator” approach. In this style, the virtualization
engine exposes its own proprietary dialect (usually based on Postgres) that is responsible for all the data
processing and aggregation. This approach becomes problematic when your data platforms have varied
data types or require platform specific functions. With a lowest common denominator approach, the
virtualization engine is required to stream atomic level data to the vendor’s proprietary query engine for
applying calculations to aggregation functions. This approach tends to break down when data is large
since data platform query push down is limited.
The second answer and approach is a pure query push down. In this approach, queries are passed through
to the data platform and aggregations and platform specific functions are performed remotely. In this
case, for table joins, only the pre-aggregated data is streamed to the virtualization engine which minimizes
the amount of query time and data movement.
For analytical use cases, it’s imperative to minimize query time data transfer for heterogeneous table joins.
Recommendation: For analytical use cases, choose a data virtualization platform
that maximizes query pushdown to minimize query time data movement.
© 2020 AtScale Inc. All rights reserved. 6
Development Environment
When data virtualization first arrived on the scene in the early 2000s, IT managed data pipelines and the
centralized data warehouse was king. Later, business intelligence (BI) innovators like Tableau and Qlik
democratized reporting and analytics with their user friendly, self service oriented tools. As these tools
took off in the enterprise, IT’s role as the centralized arbiter of data access became less the norm and
more of the exception.
When evaluating data virtualization vendors, it’s important to keep this democratization and self service
trend in mind. Therefore, a citizen data scientist or business analyst should be able to design virtual
data views and semantic models in the data virtualization platform. In fact, the process of defining
virtual data views should be as easy as building a Tableau worksheet. With this in mind, it’s imperative
that you choose a data virtualization vendor that supports a web-based (as opposed to desktop based)
design environment that promotes multiple, simultaneous designers and leverages a library to promote
re-use and standardization.
Calculations & Analytical Functions (OLAP)
If you are looking to address analytical use cases, it’s imperative that your chosen platform can handle
the complex analytical calculations required by business applications. For example, just about every
analytical use case requires time calculations and time intelligence. At a minimum, this means that your
data virtualization platform must support period-over-period (i.e., this year/last year), period-to-date
(i.e., year to date, quarter to date) and moving averages (i.e., monthly average sales). In addition, many
analytical use cases require semi-addition metrics: distinct counts for counting unique customers, first
and last functions for reporting on beginning and ending inventory levels. These types of calculations
are often referred to as OLAP calculations. In addition to OLAP functionality, support of the OLAP
protocol (MDX) means that the virtual platform can also serve queries from tools like Excel, Cognos,
BusinessObjects and Microstrategy that speak the MDX protocol.
Recommendation: Choose a data virtualization platform with a web-based,
multi-user design environment with libraries to promote re-use and enforce
standardization.
© 2020 AtScale Inc. All rights reserved. 7
Without support for analytical functions and MDX (OLAP), analytical workloads must be pushed down
into the tooling layer which subverts the power of the universal semantic layer and often results in
inconsistent business definitions and conflicting reports.
Query Performance & Caching Approaches
When evaluating vendors, this is arguably the area where you should spend most of your time. Without
consistent and performant query serving, a virtualization platform has little value.
In analytical use cases, business users are accustomed to interactive query performance since they
typical query proprietary analytical databases or cubes that are designed for fast queries. As a result, a
virtualization platform needs to deliver even better performance than the native platforms they interact
with since the virtualization layer needs to match or beat the existing solutions they are replacing.
To make matters worse, data virtualization queries often include heterogeneous database joins that
further tax query performance.
Data virtualization solutions that simply cache query results or create cached tables are not sufficient
for analytical use cases. As mentioned above, analytical queries are too variable and often scan and
aggregate massive amounts of data on the fly.
Recommendation: Choose a data virtualization platform with a web-based,
multi-user design environment with libraries to promote re-use and enforce
standardization.
Recommendation: Choose a data virtualization vendor that includes a
comprehensive performance management system that goes beyond simple caching
techniques.
© 2020 AtScale Inc. All rights reserved. 8
Client-side Requirements
In order to maximize the ROI for a data virtualization platform investment, it’s extremely important
to drive broad adoption for your data consumers in the enterprise. There’s little value to a universal
semantic layer if it’s not used by business analysts and data scientists as their primary means to
access data. Therefore, it’s imperative that your chosen data virtualization vendor platform supports
a lightweight client-side footprint. For example, does your data virtualization vendor require the
installation of custom drivers on client desktops? Does the virtualization platform require a desktop
application for managing or discovering new virtual views? If either question is “yes”, IT must now push
client-side software and manage versioning on every desktop that needs access to data. In a large
enterprise, this may be a huge inhibitor to end user adoption.
Security & Governance
Since data virtualization platforms serve as middleware for analytical queries, it’s imperative that the
platform integrates with the enterprise’s security infrastructure. There are two main forms of security to
consider: authentication & authorization.
First, a data virtualization platform must integrate with the enterprise’s single sign-on infrastructure
in order to authenticate users, whether that be Active Directory (AD), LDAP, OAuth or other third party
authentication platforms. The authorization capabilities must flow through the client applications and
the data virtualization platform must synchronize users automatically.
Second, the data virtualization platform must include the ability to hide or mask sensitive columns,
limit data rows based on user access rules and impersonate users when querying the underlying data
sources. Impersonation is especially crucial since using a proxy user (instead of the query user) to query
underlying data sources may circumvent security policies for those data platforms and force users to
duplicate security policies in the virtualization layer.
Recommendation: Choose a data virtualization vendor with a zero client-side
footprint, leveraging existing tooling connectors for accessing the virtualization layer.
© 2020 AtScale Inc. All rights reserved. 9
FEATURE RANKING
The following feature table is a tool for evaluating different vendors along the capability categories
described above. Use a number between 1 and 5 (5 being best) to score the vendor’s capabilities for
each feature. You may also use the weighting column to personalize the scoring results based on your
most important priorities.
Recommendation: Choose a data virtualization vendor that integrates with your
single sign-on standards and supports column level security, row level security and
impersonation.
GET STARTED
© 2020 AtScale Inc. All rights reserved. 10
FEATURE CATEGORY FEATUREScore
(1-5, 5 = BEST)Weigh
(1-5, 5- BEST)Weighted Score
(CALC)
Use Cases
Supports analytical workloads
Supports data science workloads
Supports operational workloads
Data Sources & Connections
Supports a large number of data sources
Supports relational/MPP data sources
Supports SaaS data sources (Salesforce, Workday)
Supports file-based data sources (CSV, Excel)
Supports web application based data sources (HTTP, REST)
Supports Big Data data sources (Hadoop, Spark)
Supports non-scalar data sources (JSON, Arrays, Structs)
Development Environment
Supports web-based development (versus client application)
Supports multiple, simultaneous editors for virtual view development
Supports object/calc re-usable across virtual views/projects
Supports development lifecycle (dev/test/prod)
Calculations and Analytical Functions (OLAP)
Supports Time Intelligence (period over period, period to date)
Supports post query calculations
Supports aggregation functions (SUM, AVG, MAX, MIN)
Supports non-additive metrics (Distinct Count, First, Last)
Supports MDX expressions
Supports live Excel pivot tables
Query Performance & Caching
Supports automated query performance management
Supports in-memory caching
Supports persistent (table) caching
Support query & predicate pushdown (no least common denominator)
Support data source/network capability aware optimization
Client-side Requirements
Supports zero client install for data consumers
Supports zero client install for virtual view developers
Supports wide range of tooling natively (i.e. built-in)
Total
© 2020 AtScale Inc. All rights reserved. 11
CONCLUSION
So much has changed since data virtualization platforms hit the scene in the early 2000’s. Data
has grown in scale and has become more varied, the data lake arrived to compete with the data
warehouse, the public cloud became a compelling destination and data platforms continue to
proliferate, not consolidate. To summarize, here are the key recommendations to keep in mind as you
choose your vendor:
1. The Core Data Virtualization style of data virtualization delivers the most flexibility for future-proofing
your information architecture and minimizing vendor lock-in.
2. When choosing a vendor, make sure that query performance management is automated based on
user query patterns, network performance and data platform capabilities.
3. For analytical use cases, choose an intelligent data virtualization platform that automatically
manages query performance.
4. For analytical use cases, choose a data virtualization platform that maximizes query pushdown to
minimize query time data movement.
5. Choose a data virtualization platform with a web-based, multi-user design environment with libraries
to promote re-use and enforce standardization.
6. Choose a data virtualization vendor that includes a comprehensive performance management system
that goes beyond simple caching techniques.
7. Choose a data virtualization vendor with a zero client-side footprint, leveraging existing tooling
connectors for accessing the virtualization layer.
8. Choose a data virtualization vendor that integrates with your single sign-on standards and supports
column level security, row level security and impersonation.
As you can see, there’s a lot to consider when choosing a data virtualization platform. Refer to this guide
and ranking to help you make order from the chaos and have the confidence to chose the best vendor to
realize your ambitions of simplifying your data analytics stack.
© 2020 AtScale Inc. All rights reserved. 12
ABOUT ATSCALE
The Global 2000 relies on AtScale – the intelligent data virtualization company – to provide a single, secured and governed
workspace for distributed data. The combination of the company’s Autonomous Data Engineering™ and Universal Semantic
Layer™ powers business intelligence and machine learning resulting in faster, more accurate business decisions at scale. For
more information, visit www.atscale.com.
RESOURCES & FURTHER READING
GigaOM Market Landscape Report: “Data Virtualization: A Spectrum of Approaches”
Gartner Research: “Market Guide for Data Virtualization”
ReadWrite: “How Intelligent Data Addresses the Chasm in Cloud”
AtScale Blog: “What is Intelligent Data Virtualization and Why Do You Need It?”