Upload
stephen-alex
View
131
Download
0
Embed Size (px)
Citation preview
AGENDA
History and Milestones
Traditional Data Warehouse
Key trends breaking the traditional data warehouse
Modern Data Warehouse
Multiple parallel processing (MPP) architecture
Hadoop Ecosystem
Technical Innovation on Hadoop
Big Data Value Assessment
2
Rolta AdvizeX Confidential & Proprietary 9/11/2016
History and Milestones
1970’s: Relational Model Invented
1984: DB2 released, RDBMS declared mainstream
1990: RDBMS takes over
3 Rolta AdvizeX Confidential & Proprietary 9/11/2016
The Traditional Data Warehouse
Central repository for all internal data in a company.
Overall relational schema.
The predictable data structure and quality optimized processing and reporting.
Data is in disk block formatting
Fundamental operation is read a row
Indexing via B-trees
Dynamic row-level locking
Data transfer usually EOD
4
Key Related Business and IT Trends
Emerging Technologies are disruptive by nature and play a key role in driving digital business and the related business trends.
Business Ecosystems enable each of the business trends, and organizations are aggressively searching for ways to leverage the role they play in the business ecosystem
Business Moments provide opportunities to capture value by setting in motion a series of events and actions involving a network of people, businesses and things that spans or crosses multiple industries and business ecosystems.
Digital Economics seeks to harvest value from across the business ecosystem by identifying business moments of opportunity and exploiting the economics of connections. This early-stage trend will have increasing importance as business models evolve to leverage algorithmic business.
Algorithmic Business propels organizations to leverage business algorithms to drive value in the business ecosystem. In this early-stage trend, we are starting to see organizations transforming data with algorithms to drive intelligent actions, particularly with the IoT.
6
Modern Data Warehouse
9
Incorporates Hadoop, traditional data warehouses, and other data stores.
Includes multiple repositories may reside in different locations.
Includes Data from cloud, mobile devices, sensors, and the Internet of Things
Includes structured/semi-structured/unstructured, raw data
Inexpensive commodity hardware in cluster mode
Multiple parallel processing (MPP) architecture Multiple parallel processing (MPP)
architecture enables extremely powerful distributed computing and scale
Resources can be added for a near linear scale-out to the largest data warehousing projects.
MPP architecture uses a “shared-nothing” There are multiple physical nodes, each running its own instance. This results in performance many times faster than traditional architectures.
10
Apache Hadoop Ecosystem
Hadoop ecosystem components as part of Apache Software Foundation projects.
The components are categorized into file system and data store, serialization, job execution, and others as shown on the image.
11
Hadoop / BDD Ecosystem
Technology Purpose
Hadoop Distributed
File System
Distributed file system that provides high-throughput access to application data. Data is
split into blocks and distributed across multiple nodes in the cluster
Hadoop YARN Framework for job scheduling/monitoring and cluster resource management
Hive Facilitates ad hoc queries over data stored in HDFS. Uses HiveQL which is a SQL-like
language. Provides a relational view of data stored in HDFS.
HCatalog Hcatalog (aka Hive Metastore) provides a table and storage management layer for Hadoop
Spark Spark Powers a stack of high-level tools including Spark SQL, MLlib for machine learning,
GraphX, and Spark Streaming
Pig Pig is a high level platform for creating MapReduce programs. BDD uses Pig to manipulate
data prior to ingesting via data processing.
Technology Purpose
Oozie Oozie is the workflow scheduler system to manage Apache Hadoop jobs. BDD
uses Oozie for workflow management (sampling, profiling, enrichment).
Sqoop Tool for efficiently transferring bulk data between Hadoop and structured
datastores such a relational database
Flume Tool for efficiently collecting, aggregating and moving large amounts of streaming
data into the HDFS
ZooKeeper Zookeeper is a centralized service for maintaining configuration information,
naming, providing distributed synchronization, and providing group services
Hue Hue is a set of web applications that enable you to interact with CDH cluster.
Hadoop / BDD Ecosystem
Oracle BDD Technical Innovation on Hadoop
15
Key Features and Functionality:
Find
• Access a rich, interactive catalog of all data in Hadoop
• Use familiar search and guided navigation to find information quickly
• See data set summaries, user annotation and recommendations
• Provision personal and enterprise data to Hadoop via self-service
Explore
• Visualize all attributes by type
• Sort attributes by information potential
• Assess attribute statistics, data quality and outliers
• Use a scratch pad to uncover correlations between attributes
Transform
• Get the data ready for analytics via Intuitive, user driven data wrangling
• Leverage an extensive library of data transformations and enrichments
• Preview results, undo, commit and replay transforms
• Test on sample data in memory then apply to full data set in Hadoop
Discover
• Join and blend data for deeper perspectives
• Compose project pages via drag and drop
• Use powerful search and guided navigation to ask questions
• See new patterns in rich, interactive data visualizations
Share
• Share projects, bookmarks and snapshots with others
• Build galleries and tell Big Data stories
• Collaborate and iterate as a team
• Publish blended data to HDFS for leverage in other tools
Big Data Value Assessment
17
Descriptive analytics looks at past performance and understands that
performance by mining historical data to look for the reasons behind past
success or failure and that is the traditional BI work.
Predictive analytics answers the question what will happen. This is when
historical performance data is combined with rules, algorithms, and external
data to determine the probable future outcome of an event or the likelihood
of a situation occurring.
Prescriptive analytics not only anticipates what will happen and when it will
happen, but also why it will happen.
Basic Analytics
Advanced Analytics
Prescriptive
Predictive
Descriptive