13
Big Data Analytics - Accelerated stream-horizon.com

StreamHorizon and bigdata overview

Embed Size (px)

DESCRIPTION

StreamHorizon adaptiveETL

Citation preview

Page 1: StreamHorizon and bigdata overview

Big Data Analytics - Accelerated

stream-horizon.com

Page 2: StreamHorizon and bigdata overview

StreamHorizon & Big Data

Integrates into your Data Processing Pipeline…

• Seamlessly integrates at any point of your your data processing pipeline

• Implements generic input/output HDFS connectivity.

• Enables you to implement your own, customised input/output HDFS connectivity.

• Ability to process data from heterogeneous sources like:

• Spark

• Storm

• Kafka

• Netty

• TCP Streams

• Local File System

Accelerates your clients…

• Reduces Network data congestion

• Improves latency of Impala, Hive or any other Massively Parallel Processing SQL Query Engine.

And more…

• Portable Across Heterogeneous Hardware and Software Platforms

• Portable from one platform to another

• Impala • Hive • HBase • Any Other…

• Any Other…

Page 3: StreamHorizon and bigdata overview

StreamHorizon – Big Data Processing Pipeline

Page 4: StreamHorizon and bigdata overview

StreamHorizon - Flavours – Big Data Processing

Storm - Reactive, Fast, Real Time Processing

• Guaranteed data processing

• Guarantees no data loss

• Real-time processing

• Horizontal scalability

• Fault-tolerance

• Stateless nodes

• Open Source

Kafka

• Designed for processing of real time activity stream data (metrics, KPI's, collections, social media streams)

• A distributed Publish-Subscribe messaging system for Big Data

• Acts as Producer, Broker, Consumer of message topics

• Persists messages (has ability to rewind)

• Initially developed by LinkedIn (current ownership of Apache)

Hadoop – Big Batch Oriented Processing

• Batch processing

• Jobs runs to completion

• Stateful nodes

• Scalable

• Guarantees no data loss

• Open Source Seamless

Integration

Page 5: StreamHorizon and bigdata overview

Big Data & StreamHorizon – Data Persistence (Example: Finance Industry)

Data Calculation, Data Processing & HDFS Persistence

Big Data – DataNode Cluster

- StreamHorizon instance (daemon)

Data

Processing

Tasks

HDFS NameNode

QL - Data Source (Quant Library Instance)

QL

HDFS

QLQL

QL

HDFS

QLQL

QL

HDFS

QLQL

QL

HDFS

QLQL

QL

HDFS

QLQL

QL

HDFS

QLQL

QL

HDFS

QLQL

QL

HDFS

QLQL

QL

HDFS

QLQL

QL

HDFS

QLQL

QL

HDFS

QLQL

QL

HDFS

QLQL

QL

HDFS

QLQL

QL

HDFS

QLQL

QL

HDFS

QLQL

QL

HDFS

QLQL

Page 6: StreamHorizon and bigdata overview

HDSF vs. Tier 2 Storage – Performance Benchmark

Big Data (HDFS) vs. Commodity Tier 2 Storage Benchmarks

Non - HDFS filesystem

Single

Server

0.95 million

records/sec

1.04 million

records/sec

File to File

Stream to File

HDFS filesystem

Single

DataNode

Cluster of

10

DataNodes

1.1 million

records/sec

per node

11 million

records/sec

per node

1.2million

records/sec

per node

12million

records/sec

per node

Page 7: StreamHorizon and bigdata overview

StreamHorizon & Big Data – Advanced Concepts

Streaming Data Aggregations – SDA

• SDA are integral part of all StreamHorizon instances (daemons)

• StreamHorizon SDA processes & aggregates data as it is processed (on the fly)

• Aggregated data is directly persisted to HDFS (or any alternative Data Target)

Page 8: StreamHorizon and bigdata overview

StreamHorizon ‘Accelerator Method’ - persisting your Big Data (faster)

Accelerate Hadoop or/and HDFS persistence by writing less:

• StreamHorizon enables you to persist only transactional data (traditionally known as ‘Fact table data’) and omit persisting repetitive dimensional data (low cardinality data) to DataNodes across your Big Data cluster

• Client front end tool (or any end user application like Excel) simply merges Dimensional data with Fact (transactional data) brought from your Big Data cluster.

Reduced Client & Network footprint

• StreamHorizon Accelerator Method reduces Network traffic between your clients & Big Data cluster to ~10% of nominal size (due to avoidance of shipping of low cardinality data (dimensional data) via network)

Page 9: StreamHorizon and bigdata overview

Big Data & StreamHorizon – Data Retrieval (Example: Finance Industry)

Nominal Network Congestion

Network Congestion – StreamHorizon ‘Accelerator Method’

Nominal Network Congestion

User

Groups

NameNode

HD

FS

Impala

Hive

HD

FS

Impala

Hive

HD

FS

Impala

Hive

HD

FS

Impala

Hive

HD

FS

Impala

Hive

HD

FS

Impala

Hive

Dimensional Data

(Low cardinality)

Dimensional And

Fact Data

Fact Data

(High Cardinality)

User

Groups

NameNode

HD

FS

Impala

Hive

HD

FS

Impala

Hive

HD

FS

Impala

Hive

HD

FS

Impala

Hive

HD

FS

Impala

Hive

HD

FS

Impala

Hive

Page 10: StreamHorizon and bigdata overview

StreamHorizon beneficial to Big Data filesystem (HDFS)

• StreamHorizon Accelerates Hadoop processing - MapReduce has two main disadvantages (processing is slow & inconvenient to use)

• Hive + StreamHorizon SDA outperforms Impala (Hive based reporting stack usually has higher query latency compared to Impala stack. This is significantly improved with StreamHorizon)

• Due to aggregation, single HDFS file contains even more of your business data. Benefits are:

• Hive queries are more effective

• Reduces number of I/O requests

• Single I/O request executes as sequential read in comparison to default I/O footprint

• Reduces HDFS replication latency

• Move via Network only high cardinality (Fact) rather than low cardinality (Dimensional) data. Achieve reduction of network traffic down to 10% of it’s nominal value.

• StreamHorizon SDA accelerates heavy data operations like joins etc.

Page 11: StreamHorizon and bigdata overview

Client Queries - accelerated by StreamHorizon

StreamHorizon accelerates ad-hoc queries for Hive, Impala or any other MPP SQL Query Engine. This is can be achieved with:

• StreamHorizon SDA (Streaming Data Aggregations)

• StreamHorizon ‘Accelerated Method’ data topology

StreamHorizon reduces memory pressure for Impala (or any other memory dependent data access component)

StreamHorizon reduces MapReduce processing latency for Hive (when utilizing StreamHorizon SDA)

Hive query latency reduced by order of magnitude (function of data volume reduction of your StreamHorizon SDA aggregations)

Page 12: StreamHorizon and bigdata overview

Streaming Data Aggregations – impact on Big Data Query Latency

Out of the Box

Impala Hive

Medium-Low High

Medium-Low Medium-High

High Nominal

Query Latency

Memory Footprint

Processing Footprint

Medium HighSpace Consumption

With Streaming Data

Aggregations

Impala Hive

Low Low

Low Medium-Low

Medium - Low Nominal - Low

Low Low

Page 13: StreamHorizon and bigdata overview

Q&A stream-horizon.com