Upload
others
View
4
Download
0
Embed Size (px)
Citation preview
Drawing the Big Picture Multi-Platform Data Architectures,
Queries, and Analytics
Philip Russom TDWI Research Director for Data Management
August 26, 2015
2
Sponsor
3
Speakers
Imad Birouty Director, Technical Product
Marketing, Teradata
Philip Russom TDWI Research Director,
Data Management
Agenda • The Mission
– Queries, analytics, and other BI that reach multiple warehouse and data platforms simultaneously
• Enabling Technologies
– Modern data warehouse environments (DWEs)
– Single-console tools
– Data exploration and discovery
– Standard SQL, but extended
– Grid, fabric, virtualization, logical DW…
• Benefits of the single big picture
– New ways to view data and develop queries or analytics
– Simplification for architecture, governance, stewardship, compliance, auditing, security...
• Recommendations
PLEASE TWEET
@pRussom, @Teradata,
#TDWI, #Analytics, #Big Data
The Mission Redux
• Today’s BI/DW/analytics demands:
– As much data as possible
– From more sources and source types
– In many structures or structure free
– Persisted on old and new data platform types
– Virtualized, as appropriate
– All the above, available all the time, for everyone
• We’ve always aspired toward these goals:
– But success is more likely today, because we have better
software, hardware, skills, best practices…
– We also have better executive support
• Organizations want more business value from big data, new data, analytics,
new data-driven business programs…
Enablers for the Revised Mission • New tool types and functions,
plus their disciplines & practices
– Data exploration and data discovery
– More agile data preparation
– Data visualization – ease of use, analytics,
fun & compelling presentations, story telling…
• New data platforms
– Hadoop, whether open source or vendor distro
– MPP RDBMSs, appliances & columnar
• Old skills and technologies, too
– SQL & other relational techs are as important as ever
• All the above, integrated and interoperable
– Single console – or as few tools as possible
– Single access & query method – SQL, but for any data, platform
– Data architecture – to integrate the back end
DEFINITION
Multi-Platform Data
Warehouse Environments
• Many enterprise data warehouses (EDWs) are evolving into
multi-platform data warehouse environments (DWEs).
• Users continue to add additional standalone data platforms to
their warehouse tool and platform portfolio.
• The new platforms don’t replace the core warehouse, because
it is still the best platform for the data that goes into standards
reports, dashboards, performance management, and OLAP.
• Instead, the new platforms complement the warehouse,
because they are optimized for workloads that manage,
process, and analyze new forms of big data, non-structured
data, and real-time data.
Modern DW Architectures are Complex • Tech stack for DW, BI, DI, & analytics has always been multi-platform environ.
• What’s new? The trend toward a portfolio of many physical data platforms has accelerated. Logical architecture that integrates them is very important.
• Why do it? More platform types to serve more types of users, data & workloads.
Complex,
Event
Processing
Streaming
Data
Tools
Analytic
Sand
Box
Data
Federation
& Virtuali-
zation
No-SQL
Database
Hadoop
Distributed
File Sys
Map
Reduce
No-SQL
Database
Hadoop
Distributed
File Sys
Star or
Snowflake
Scheme
Data
Warehouse
Federated
Data
Marts
Customer
Mart or
ODS
Metrics for
Performance
Mgt
Multi-
dimensional
Data Models
Federated
Data
Marts
Federated
Data
Marts
Customer
Mart or
ODS
Real
Time
ODS
Data
Staging
Areas
OLAP
Cubes
Detailed
Source
Data
Data
Staging
Areas
Data
Staging
Areas
Detailed
Source
Data
Detailed
Source
Data
OLAP
DBMSs
DW from a
Merger
Over The Passage of Time
DW
Appliance
Columnar
DBMS Columnar
DBMS
DW
Appliances
Cloud-
based
DBMSs
Logical Data Warehouse
It’s a logical and/or virtual layer of the DW
architecture that complements the
physical layer of architecture under it.
DEFINITIONS OF THE
Logical Data
Warehouse • TDWI: A Data Warehouse is user-defined data architecture
– The architecture & its design components must be populated by data
– But the data can be physical, logical/virtual, or both
– So, most DW architectures have two key layers: physical & logical
• Gartner’s view: A Logical DW depends on virtual tech
– From simple federation to object-oriented virtualization, plus virtual
views, indices, semantics, server memory…
• Building out the Logical Layer of your DW is important
– The logical layer enables cross-platform integration and
interoperability, for broad queries, exploration, analytics…
• The LDW layer provides a unified view (or a collection of views) of data in multiple platforms – Plus a simplified (yet diverse & high-performance)
collection of interfaces into such sources and targets to achieve interoperability, especially for queries
• The point of the LDW layer is to provide – A fairly comprehensive big picture of data in the DWE
– A single layer through which data can be accessed, thereby reducing data redundancy, movement, processing
– A simplified view & related mechanisms that enable more user types
• Similar Concepts: – Virtual DW (LDW is often partially virtual, but mostly physical)
– Real-Time DW, Operational DW, Active DW, Dynamic DW
– Query Grid, Data Grid, Data Fabric
DEFINITIONS OF THE
Logical Data
Warehouse (LDW)
NEW ARCHITECTURES
Hadoop integrated with a Relational DBMS The strengths of one balance the weaknesses of the other
• A Relational DBMS is good at:
– Metadata management
– Complex query optimization
– Table joins, views, keys, etc.
– Security, including roles, directories
• HDFS & other Hadoop tools are good at:
– Massive, linear scalability
– Multi-structured & no-schema data
– Some ETL and ELT functions
– Custom code for algorithmic analytics
• Other platforms are also being tightly integrated w/relational DW – Analytic DBMSs based on columnar, appliance, MapReduce, graph
• To make this integration of diverse data platforms practical – Good design by users for the logical DW architectural layer
– Vendor tools that can reach all the above and more from one query
Importance of Data Exploration • Exploring data is a first step to leveraging new data
– Never allow new data into a DW without proper vetting
– Assess value & use cases for new (big) data via exploration
• Exploring data is a prerequisite to analyzing data – By its natural, analysis makes correlations across data of
diverse sources, structures, subjects, and vintages
– Finding just the right combination for successful analysis depends on data exploration as a first step
• High ease of use for user productivity – Some users are biz people who need biz friendly view
– Ease of use accelerates developers’ productivity, too
• Support for all data platforms, from relational to Hadoop – A modern data exploration tool will merge diverse data via a
single complex query
• A data exploration tool must do more than exploration – Profile data to understand its content and condition
– Extract data, model the result set, index big data
– Deduce data’s structure and develop metadata
– Perform tasks as you go, not ahead of time, for greater agility
ITERATIVE, FOUR-STEP PROCESS FOR
Exploratory Analytics with New (Big) Data
Visualize Explore
Analyze Data Prep
A FEW REQUIREMENTS FOR
Advanced Analytics • Market direction: Seamless integration
– In one tool environment, exploration, data prep, analysis, visualization, and more
– The iterative, four-step process of exploratory analytics demands tight tool integration
• Advanced forms of analytics
– Mining, predictive, statistics, NLP (not OLAP)
– Algorithmic, as well as query based
• Both canned and home-grown algorithms
– Tool should include library of pre-built algorithms
– Tool should also help you write your own
• High ease-of-use for broad collaboration
– Functions for both technical and business users
– Both develop analytic apps and consume them
– Assume that many user types will share their work
Visualize Explore
Analyze Data Prep
ITERATIVE,
FOUR-STEP
PROCESS
SQL is More Important than Ever
• Data professionals want and depend on SQL
– It must be ANSI standard, high performance, iterative, optimized
– Why? To leverage user skills and SQL-based tool portfolios
• SQL on Hadoop versus SQL off Hadoop argument
– Users interviewed want BOTH !
– In survey, SQL on Hadoop is a “must have” (69%)
– Only 4% don’t need SQL on Hadoop
Source: TDWI survey run in late 2014.
Based 99 respondents.
SQL-Based Analytics • Data Exploration = Ad-hoc queries on steroids
– A query grows in size, scope, and complexity
with each iteration
• KLOCs = Thousands of Lines of [SQL] Code
– Whether tool-generated, hand-written, or both
• Complex SQL expresses many things
– Data access via many interfaces, near real time
– Data models, even dimensional ones
– Multi-way joins, but also complex transformations
• Growing number and diversity of users
– Data analysts, data scientists, BI/DW pros,
business analysts
• All the above demand a hefty tool environ’t
– As described on the next slide…
SUMMARY & CONCLUSION: TOOLS AND REQUIREMENTS FOR
Logical Data Warehousing and Other
Complex Data Ecosystems
• Look for tools and environments that enable:
– Designing and architecting a “big picture”
– Interoperability among diverse systems and data types
– Data operations optimized across multiple platforms
– ANSI SQL support; performance for iterative queries
• Features that help with complex data architectures:
– Distributed queries, in the extreme
– High performance, even with multiple platforms
– Metadata management and metadata deduction
– Easy ingestion of new data, whether streaming or static
– Real-time indexing, to keep pace with data ingestion
– Single-sign-on security, despite multiple systems
RECOMMENDATIONS
Draw the Big Picture for its Benefits
• Benefits of the unified big picture of data.
– New ways to view data & develop queries & analytics
– Simplification for data architecture, governance,
stewardship, compliance, auditing, security...
• Revisit your mission as a data professional
– Tons of data, sources, and source types, in many
structures (or structure free) persisted on old and
new data platform types (virtualized, as appropriate)
– All the above, available all the time, for everyone
• Satisfy new requirements with tools/platforms that provide unified view
– Virtual DW and miscellaneous approaches to Real-Time DW
– Query Grid, Data Grid, Data Fabric
– Special functions: Hadoop, exploration, SQL-based analytics…
Teradata QueryGrid™
Imad Birouty Director, Teradata Product Marketing
20
DA
TA M
AR
T
1990’s
Just Give Me
Some Data and Fast!
EDW
/ID
W
2000’s
Give Me
Good Data But Do It Efficiently!
LOG
ICA
L D
ATA
WA
REH
OU
SE
2010’s
Give Me
All Data Fast, Simple &
Effectively!
21
What’s Different Today? There Is No Single Technology That Can Do Everything
Higher volume of data
New sources of data
New types of data
New technologies
New economic models
Increased prevalence of analytics
22
What’s The Same Today?
• Users need access to all relevant data to make informed business decisions
• Users need timely access to data when they need it
• User skills and tools
23
Shift from a Single Platform to an Ecosystem
“We will abandon the old
models based on the
desire to implement for
high-value analytic
applications.”
"Logical" Data Warehouse
24
Not All Data Should Be Treated Equally
• Data of different value – High value density ERP, CRM,…
– Low value density Sensors, weblogs, social,…
• Different processing techniques required – Structured data SQL
– Multi-structured data SQL, NoSQL
• Different integration requirements – Pre-define schema and integrated upon data acquisition (schema-
on-write)
– Define schema during query runtime (schema-on-read)
Regardless….data and analytics should be accessible
25
• Pick Your Best-of Breed
Technology:
– Data types
– Analytic engines
– Economic options
• Run the right analytic on
the right platform:
– Minimize data movement, process data where it resides
– Minimize data duplication
– Optimized work distribution through “push-down” processing
– Bi-directional data movement
Data Fabric Enabled by QueryGrid™ Analytic Flexibility to meet your business needs
Users direct their queries to a cohesive data fabric using existing SQL skills & tools
Focus on data and business questions, not integrating separate systems
Teradata QueryGrid™ Demo
27
Metadata
Teradata Confidential
Goal: View Database in
Hadoop
HELP FOREIGN SERVER hdp21;
28
Metadata
Teradata Confidential
Goal: View Tables in
Hadoop
HELP FOREIGN DATABASE "default"@hdp21;
29
Metadata
Goal: View Specific
Table in Hadoop
HELP FOREIGN TABLE "default".carpricedata@hdp21;
30
Querying Hadoop Table
Goal: Select a Sample of
Rows From a Hadoop Table
SELECT *
FROM sample_08@HDP21;
31
For all cars that received warranty repair, find the reported Diagnostic Trouble Code – Requires data from Hadoop and Teradata data warehouse
– Query passed through, data not persisted
Multi-System Query
TERADATA
PRODUCTION
DATA
•VINs
• Service records
•Warranty data
•DTC descriptions
HADOOP
RAW MULTI-
STRUCTURED DATA
•Massive amounts of detailed sensor data
Teradata QueryGrid
32 32
33
Questions?
34
Contact Information
If you have further questions or comments: Philip Russom, TDWI [email protected] Imad Birouty, Teradata