18
Modern Data Warehouse Stephen Alex BI & Big Data Architect

Modern data warehouse

Embed Size (px)

Citation preview

Modern Data Warehouse

Stephen Alex

BI & Big Data Architect

AGENDA

History and Milestones

Traditional Data Warehouse

Key trends breaking the traditional data warehouse

Modern Data Warehouse

Multiple parallel processing (MPP) architecture

Hadoop Ecosystem

Technical Innovation on Hadoop

Big Data Value Assessment

2

Rolta AdvizeX Confidential & Proprietary 9/11/2016

History and Milestones

1970’s: Relational Model Invented

1984: DB2 released, RDBMS declared mainstream

1990: RDBMS takes over

3 Rolta AdvizeX Confidential & Proprietary 9/11/2016

The Traditional Data Warehouse

Central repository for all internal data in a company.

Overall relational schema.

The predictable data structure and quality optimized processing and reporting.

Data is in disk block formatting

Fundamental operation is read a row

Indexing via B-trees

Dynamic row-level locking

Data transfer usually EOD

4

Key Trends Breaking The Traditional Data Warehouse

5

Key Related Business and IT Trends

Emerging Technologies are disruptive by nature and play a key role in driving digital business and the related business trends.

Business Ecosystems enable each of the business trends, and organizations are aggressively searching for ways to leverage the role they play in the business ecosystem

Business Moments provide opportunities to capture value by setting in motion a series of events and actions involving a network of people, businesses and things that spans or crosses multiple industries and business ecosystems.

Digital Economics seeks to harvest value from across the business ecosystem by identifying business moments of opportunity and exploiting the economics of connections. This early-stage trend will have increasing importance as business models evolve to leverage algorithmic business.

Algorithmic Business propels organizations to leverage business algorithms to drive value in the business ecosystem. In this early-stage trend, we are starting to see organizations transforming data with algorithms to drive intelligent actions, particularly with the IoT.

6

The Risks of Bottlenecks in Data Movement

7

Hadoop Changes the Game

Storage and Compute on One Platform

8

Modern Data Warehouse

9

Incorporates Hadoop, traditional data warehouses, and other data stores.

Includes multiple repositories may reside in different locations.

Includes Data from cloud, mobile devices, sensors, and the Internet of Things

Includes structured/semi-structured/unstructured, raw data

Inexpensive commodity hardware in cluster mode

Multiple parallel processing (MPP) architecture Multiple parallel processing (MPP)

architecture enables extremely powerful distributed computing and scale

Resources can be added for a near linear scale-out to the largest data warehousing projects.

MPP architecture uses a “shared-nothing” There are multiple physical nodes, each running its own instance. This results in performance many times faster than traditional architectures.

10

Apache Hadoop Ecosystem

Hadoop ecosystem components as part of Apache Software Foundation projects.

The components are categorized into file system and data store, serialization, job execution, and others as shown on the image.

11

Hadoop / BDD Ecosystem

Technology Purpose

Hadoop Distributed

File System

Distributed file system that provides high-throughput access to application data. Data is

split into blocks and distributed across multiple nodes in the cluster

Hadoop YARN Framework for job scheduling/monitoring and cluster resource management

Hive Facilitates ad hoc queries over data stored in HDFS. Uses HiveQL which is a SQL-like

language. Provides a relational view of data stored in HDFS.

HCatalog Hcatalog (aka Hive Metastore) provides a table and storage management layer for Hadoop

Spark Spark Powers a stack of high-level tools including Spark SQL, MLlib for machine learning,

GraphX, and Spark Streaming

Pig Pig is a high level platform for creating MapReduce programs. BDD uses Pig to manipulate

data prior to ingesting via data processing.

Technology Purpose

Oozie Oozie is the workflow scheduler system to manage Apache Hadoop jobs. BDD

uses Oozie for workflow management (sampling, profiling, enrichment).

Sqoop Tool for efficiently transferring bulk data between Hadoop and structured

datastores such a relational database

Flume Tool for efficiently collecting, aggregating and moving large amounts of streaming

data into the HDFS

ZooKeeper Zookeeper is a centralized service for maintaining configuration information,

naming, providing distributed synchronization, and providing group services

Hue Hue is a set of web applications that enable you to interact with CDH cluster.

Hadoop / BDD Ecosystem

Top Three Hadoop Vendors

14

Oracle BDD Technical Innovation on Hadoop

15

Key Features and Functionality:

Find

• Access a rich, interactive catalog of all data in Hadoop

• Use familiar search and guided navigation to find information quickly

• See data set summaries, user annotation and recommendations

• Provision personal and enterprise data to Hadoop via self-service

Explore

• Visualize all attributes by type

• Sort attributes by information potential

• Assess attribute statistics, data quality and outliers

• Use a scratch pad to uncover correlations between attributes

Transform

• Get the data ready for analytics via Intuitive, user driven data wrangling

• Leverage an extensive library of data transformations and enrichments

• Preview results, undo, commit and replay transforms

• Test on sample data in memory then apply to full data set in Hadoop

Discover

• Join and blend data for deeper perspectives

• Compose project pages via drag and drop

• Use powerful search and guided navigation to ask questions

• See new patterns in rich, interactive data visualizations

Share

• Share projects, bookmarks and snapshots with others

• Build galleries and tell Big Data stories

• Collaborate and iterate as a team

• Publish blended data to HDFS for leverage in other tools

Components of Big Data Discovery

16

Big Data Value Assessment

17

Descriptive analytics looks at past performance and understands that

performance by mining historical data to look for the reasons behind past

success or failure and that is the traditional BI work.

Predictive analytics answers the question what will happen. This is when

historical performance data is combined with rules, algorithms, and external

data to determine the probable future outcome of an event or the likelihood

of a situation occurring.

Prescriptive analytics not only anticipates what will happen and when it will

happen, but also why it will happen.

Basic Analytics

Advanced Analytics

Prescriptive

Predictive

Descriptive

Thank You!!! Stephen Alex

BI & Big Data Architect

(732) 485-0011(m)

9/11/2016 18 Rolta AdvizeX Proprietary and Confidential