Upload
douglas-bernardini
View
1.780
Download
0
Embed Size (px)
Citation preview
Running together in Retail Environment
1Author: Douglas Bernardini
Big Data Platform
2
Big Data Platform
3
Collection of Hadoop & Apache solutions running together and integrated
Open-source: Apache Software Foundation.
Works across component technologies and integrates with pre-existing EDW, RDBMS and MPP systems.
Linux and Windows.
Authentication, Authorization, & Data Protection.
Native integration with Major BI/ analytics developers & vendors.
HDP Platform overview
Real time Ingest
Flume
Real time Ingest
Storm
Batch Integration
Sqoop
Integration
Processing
YARN
Storage
HDFS
Data Management
Data Access
Script/ETL
PigProcess
MapReduce
SQL like
HiveOnLine
HbaseInMemory
Spark
Hortonworks Data Platform (HDP)
Big Data Platform
4
Scalable: Store/Distribute very large data sets across hundreds of servers operating in parallel, with thousands of nodes involving thousands of terabytes of data.
Cost effective: Savings are staggering, offering computing and storage capabilities for hundreds of dollars per terabyte.
Flexible: an be used for a wide variety of purposes, such as log processing, recommendation systems, data warehousing, market campaign analysis and fraud detection.
Fast: able to efficiently process terabytes of data in just minutes, and petabytes in hours
Resilient to failure: data in individual nodeis also replicated to other nodes in the cluster, which means that in the event of failure.
Hadoop Technology Advantages & Profile
External: In almost off
cases from outside
corporation. Social
networks or suppliers
Source
Big: Normally used for
up to Tens/Hundreds of
terabytes. Petabyte
scale.
Size
Not structured: Data
not separated in
columns/rows or with
schema.
Structure
Data Management
5
Stores data in several clusters & servers
NameNode and DataNodes
Large volume: 200 PB of storage and a single cluster of 4500 servers, supporting close to a billion files and blocks;
Minimal data motion: Hadoop moves compute processes to the data on HDFS and not the other way around. Moving Computation is Cheaper than Moving Data
Dynamically diagnose: the health of the file system and rebalance the data on different nodes;
Rollback: Allows operators to bring back the previous version of HDFS after an upgrade;
Node redundancy: Supports high availability (HA);
Storage
Hadoop File System HDFS
Data Management
6
Manages HDFS
Multi-tenancy: Multiple access engines to use Hadoop as the common standard for batch, interactive and real-time engines that can simultaneously access the same data set.
Cluster utilization: Dynamic allocation of cluster resources improves utilization over more static MapReduce rules used in early versions.
Scalability: Data center processing power continues to rapidly expand. Resource Manager scheduling clusters expand to thousands of nodes managing petabytes of data.
Processing
Hadoop YARN
Data Access
7
Process Data
The Map function: Divides input data into ranges (parts) by the InputFormat and creates a map task for each range in the input.
JobTracker distributes those tasks to the worker nodes. The output of each map task is partitioned into a group of key-value pairs for each reduce.
The Reduce function: Collects the various results and combines them to answer the larger problem that the master node needs to solve.
reduce is able to collect the data from all of the maps for the keys and combine them to solve the problem.
Batch
MapReduce
No-structuredData
Map Reduce Data Analysis
Data Integration & Governance
8
High volume data ingestion
Stream data Ingest streaming data from multiple sources into
Hadoop for storage and analysis
Guarantee data delivery Channel-based transactions to guarantee reliable
message delivery.
When a message moves from one agent to another, two transactions are started, one on the agent that delivers the event and the other on the agent that receives the event.
This ensures guaranteed delivery semantics
Scale horizontally To ingest new data streams: Additional volume
Real-time Ingest
Apache FLUME
No-structuredData
AgentNodes
CollectorNodes
HDFS
StorageArea
Data Integration & Governance
9
Very-large volume data ingestion
Fast – benchmarked as processing million+ messages/records per second per node
Scalable – with parallel calculations that run across a cluster of machines
Fault-tolerant – when workers die, Storm will automatically restart them. If a node dies, the worker will be restarted on another node.
Reliable – Storm guarantees that each unit of data (tuple) will be processed at least once or exactly once. Messages are only replayed when there are failures.
Real-time Ingest
Apache STORM
Data Integration & Governance
10
Connects to traditional RDBMS
Data imports: Moves certain data from external stores and EDWs into Hadoop to optimize cost-effectiveness of combined data storage and processing
Improvements: Compression, indexing for query performance
Parallel data transfer: For faster performance and optimal system utilization
Fast data copies: From external systems into Hadoop
Load balancing: Mitigates excessive storage and processing loads to other systems
Batch Integration
Apache SQOOP
Efficient data analysis: Improves efficiency of data analysis by combining structured data with unstructured data in a schema-on-read data lake
Data Access
11
Easy programing language
Easily programmed: Complex tasks involving interrelated data transformations can be simplified and encoded as data flow sequences. Pig programs accomplish huge tasks, but they are easy to write and maintain
Iterative data processing: Extract-transform-load (ETL) data pipelines. Research tools on raw data.
Extensible: Pig users can create custom functions to meet their particular processing requirements
Self-optimizing: Because the system automatically optimizes execution of Pig jobs, the user can focus on semantics.
Script
Pig
Data Access
12
SQL like tools
Familiar: Query data with a SQL-based language. similar to tables in a relational database, and data units are organized in a taxonomy from larger to more granular units.
Fast: Interactive response times, even over huge datasets
Partitioned: Each table can be sub-divided into partitions that determine how data is distributed within sub-directories of the table directory. Scalable and Extensible: As data variety and volume grows, more commodity machines can be added, without a corresponding reduction in performance.
Uses JobTracker (MapReduce) functionalities
SQL
Hive
Data Access
13
NoSQL tools with SQLlike command interface
Apache HBase is an open source NoSQL database that provides real-time read/write access to those large datasets.
Scales linearly to handle huge data sets with billions of rows and millions of columns
Easily combines data sources that use a wide variety of different structures and schemas.
Natively integrated with Hadoop and works seamlessly alongside other data access engines through YARN.
Choice for storing semi-structured data like log data.
OnLine
Hbase
Data Access
14
Fast, in-memory data processing to Hadoop.
Elegant and expressive development APIs in Scala, Java, R, and Python.
Allow data workers to efficiently execute streaming, machine learning or SQL workloads for fast iterative access to datasets.
Designed for data science and its abstraction makes data science easier.
Data scientists commonly use machine learning – a set of techniques and algorithms that can learn from data. These algorithms are often iterative
InMemory
Spark
ERP/DW-BI Platform
15
16
Fast In-memory Database
Traditional DBMS: SQL interface, Transactional isolation and recovery (ACID).
Parallel Data Flow Model: calculations can be executed in parallel with distribution across hosts.
Last generation Data Storage:
Columnar and Row-Based
Near to eliminate of indexes.
High Data compression
Automatic recovery: From memory errors without system reboot.
Native tools: Predictive Analysis Library & Analytical and Special Interfaces
SAP/Hana ArchitectureERP/DW-BI Platform
17
100x faster
Optimization: InfoCubes and Datastore Objects (DSO) with better performance.
Faster remodeling: Improved and Lean Data Models.Simplified data modeling and reduced materialized layers
Datamarts: Integrated and embedded flexibility. Also may have OLAP and OLTP are executed in one system.
Increased Flexibility: Optimized Layered Scalable Architecture. Aggregates and Cubes no more required (optional).
Improved response times: for existing transactions and entire business processes through general performance improvement of the underlying HANA database
SAP/Hana Technology Advantages & Profile
Big: Not considered BIG
for web 2.0 era. Tens of
terabytes. Not reaching
Petabytes.
Size
Structured: separated
in columns/rows and
with schema.
Structure
Internal: In almost off
cases from INSIDE
corporation, from
ERP/CRM/SCM.
Source
ERP/DW-BI Platform
SAP/Hana Evolution
Starting Point: SAP Landscape consists of SAP ERP running on a relational database, connected to a OLAP engine (e.g. SAP BI) and perhaps using Business Intelligence Accelerator like BOBJ
AnalyticsSAP/BOBJ
OLTP SAP/ECC
ETL
OLAPSAP/BW
ERP/DW-BI Platform
Introducing HANA in parallel: Install and run the In-Memory engine (HANA) in TOGETHER with traditional SAP instances
02 BW extractors running at same time and exporting same data
Key factor: Real data performance processing COMPARISON
AnalyticsSAP/BOBJ
SAP/HANA2nd ETL
AnalyticsSAP/BOBJ
OLTP SAP/ECC
ETL
OLAPSAP/BW
SAP/Hana Evolution
BW database upgrade: Re-created traditional-style BI in memory
ERP/DW-BI Platform
OLTP SAP/ECC
OLAPSAP/BW
ETL
AnalyticsSAP/BOBJ
SAP/HANA
ERP/BI full database upgrade: Eliminate traditional database and run both instances in In-Memory, using non materialized views
OLTP SAP/ECC Analytics
SAP/BOBJ
SAP/HANA
OLAPBI 2.0
20
Sizing on SAP/HanaERP/DW-BI Platform
• Memory• Traditional sizing:
• CPU performance <> SAP HANA memory.
• Master/transactional data > Main memory.
• Main memory required:
• Storing the business data;
• Temporary memory space ;
• Support complex queries;
• Buffers & Caches;
• CPU• Behaves differently with SAP HANA compared to traditional
databases.
• Querys: Complex & Maximum speed
• Disk Size• Still required disk storage space.
• Preserve database information if the system shuts down (either intentionally or due to a power loss)
• Data changes: Periodically copied to disk (Ensures a full image of the business data on disk)
• Logging mechanism: Enable system recovery.
21
SAP/Hana on VMERP/DW-BI Platform
• SAP HANA on vSphere is fully supported
• Combining SAP HANA and vSphere provides additional benefits with regards to deployment and availability.
• Some remaining customer slots for SAP on SAP HANA controlled availability Proof Of Concepts
• SAP HANA > BlueMedora Plug-In
• Monitor memory and vCPU utilization
• Add/Delete resources
• Underutilized – Deploy more SAP HANA
• Over utilized – SAP HANA unleased
• Workload management
• Determine Consolidation Ratios
• AmazonAWS: SAP Partner
• SAP BW on HANA Trial - PoC
• The AWS server provides a HANA
• Ready to go in 30min
• OLAP: BW on HANA or any other data warehouse application predominantly with OLAP workloads including data marts running a lot of complex queries
• OLTP: any transactional application like Business Suite on HANA predominantly running simple queries or CRUD operations
22
ERP/DW-BI Platform
SAP/Hana on Cloud
• Storage Replication: • The storage itself replicates all data to another
location within one or between several data centers. The technology is hardware vendor-specific and multiple concepts are available on the market.
• System Replication: • SAP HANA replicates all data to another location
within one or between several data centers. The technology is independent from hardware vendor concepts and reusable with a changing infrastructure.
23
ERP/DW-BI Platform
Disaster Recovery
Host Auto-Failover
• Standby mode: • No data; requests or queries.
• When an active (worker) host fails, a standby host automatically takes its place.
• Since the standby host can take over operations from any of the primary hosts, it needs access to all of the database volumes.
• Once repaired:• The failed host can be rejoined to the system as
the new standby host to re-establish the failure recovery capability:
24
ERP/DW-BI Platform
SAP/HANA High Availability
ERP/DW-BI & Big Data Platform
25
ERP/DW-BI & Big Data Platform
26
Not Structured Data
Structured DataERP/CRM/SCM
Real time Ingest
Flume
Real time Ingest
Storm
Batch Integration
Sqoop
Integration
Processing
YARN
Storage
HDFS
Data Management
Data Access
Script/ETL
PigProcess
MapReduce
SQL like
HiveOnLine
HbaseInMemory
Spark
Hortonworks Data Platform
Analytics
Data Repositories
HANA
OLAPEngine
PredectiveEngine
SpatialEngine
AplicationLogic & Rendering
Architecture Proposal
ERP/DW-BI & Big Data Platform
27
Business Case: CRM/RetailInternal structured data source
• Point-of-sale data – Data captured when the customer makes purchases either in-store or on the company’s e-commerce site.(04T)
• Inventory and stock Information –Products are in stock at which locations/promotion. (07T)
• CRM data – From all the interactions the customer has had with the company at support site.(8T)
• Total data Size: 21T
External Unstructured data source
• Social media data – Customer’s social media sentiment analysis, such Facebook (70T)
• Historical Web log information – Record of the customer’s past browsing behavior on the company’s Web site.(30T)
• Geographic customer behavior: Origin/Destiny potential customer nearby stores. (20T)
• Total data Size: 120T
ERP/DW-BI & Big Data Platform
28
Business Case: Data Process