Building Distributed Data Streaming System

Proprietary and Confidential 1

Building Distributed Data Streaming

SystemAshish Tadose

Lead Software EngineerBig Data Analytics

Agenda

• What is stream processing

• Streaming architecture

• Scalable Data Ingestion

• RealTime Streaming Processing system

What is Streaming Process ?

Reactive

Programmin

gStreaming

Server – Sent Events

Change Data Capture

Event SourcingComplex Event

Processing

In simple words, Streaming is…

Processing events in the order they occur

Batch & Streaming processing

Data Generator

Ingestion

Distributed File

systemProcessin

gData Store

Batch processing

Data Generator

Ingestion

MessageQueue

Processing

Data Store

Stream Data processing

Data Generator

Ingestion

MessageQueue

Processing

Data Store

Distributed File

systemProcessin

gData Store

Batch processing

Data Generator

Ingestion

MessageQueue

Processing

Data Store

Distributed File

systemProcessin

gData Store

Batch processing

Lambda Architecture: Velocity & Volume

StreamingIngestion

Technologies

Ingestion Ecosystem• Sources • Machine data• External stream & syslogs

• Data Collection • Flume • Kafka• Kinesis• Confluent

• Easier to setup • Rich set of in-build tools • No inherent support for data replication • Nodes works in isolation • Memory channel vs File Channel

Kinesis

http://kafka.apache.org/ Originated at LinkedIn, open sourced in early 2011 Implemented in Scala, some Java 9 core committers, plus ~ 20 contributors

Why is Kafka so fast?• Fast writes:• While Kafka persists all data to disk, essentially all writes go to

thepage cache of OS, i.e. RAM.

• Fast reads:• Very efficient to transfer data from page cache to a network

socket• Linux: sendfile() system call

• Combination of the two = fast Kafka!• Example (Operations): On a Kafka cluster where the consumers

are mostly caught up you will see no read activity on the disks as they will be serving data entirely from cache.

http://kafka.apache.org/documentation.html#persistence

Flafka – Flume meets Kafka

Confluent - Centralized Ingestion with Kafka Pipeline

StreamProcessing

RealTime Stream Processing

• Processing system• Apache Storm • Apache Samza• Apache Spark (Streaming) • Project Apex - DataTorrent

• Storage • Hive HDFS• Hbase• MySql • Custom

• Access• Depend of data storage • Scalable query interface - Kafka

Streaming Design Patterns

• Micro batching • Unpredictable incoming data • Creating multiple streams • Out of sequence events• Stream joins • Top N metrics • External Lookup

Thank You

Building Distributed Data Streaming System

Data & Analytics

Congestion Control in Distributed Media Streaming

Pseudo-DHT: Distributed Search Algorithm for P2P Video Streaming

Streaming Graph Partitioning KDD 8/15 Streaming Graph Partitioning for Large Distributed Graphs Isabelle Stanton, UC Berkeley Gabriel Kliot, Microsoft

Building a video streaming studio

Building production spark streaming applications

Apalya - Building a Light-weight Video Streaming Portal

Building Distributed Systems

Distributed live streaming on mesh networks

Cs- Distributed Video Streaming Over Internet

Streaming Distributed Data Processing with Silk #deim2014

A distributed protocol to serve dynamic groups for peer to-peer streaming

SAP HANA Smart Data Streaming: Building Custom Adapters

Continuous Sampling from Distributed Streamsqinzhang/papers/cdsample-full.pdf · 2011. 10. 6. · Continuous distributed streaming. Many streaming applications [26] involve multiple,

Building Big Data Streaming Architectures

Crossdata: an efficient distributed datahub with batch and streaming query capabilities

Distributed Session Announcement Agents for Real-time Streaming Applications

Stratio CrossData: an efficient distributed datahub with batch and streaming query capabilities

Lossless coding for distributed streaming sources · Lossless coding for distributed streaming sources Cheng Chang, Stark C. Draper, and Anant Sahai October 18, 2006 Abstract Distributed

Building Reactive Distributed Systems For Streaming Big Data, Analytics & Machine Learning

Distributed Video Streaming Over Internet