Building a Real-Time Data Pipeline with Spark, Kafka, and Python

Preview:

Citation preview

Douglas ButlerProduct Manager

massively parallel, lock free, FASTdistributed SQL database

in-memory, on-diskACID

JSON and geospatialtransactions and analytics

2 Minute Install

A Simple Pipeline

from pystreamliner.api import Extractor

class CustomExtractor(Extractor): def initialize(self, streaming_context, sql_context, config, interval, logger): logger.info("Initialized Extractor") def next(self, streaming_context, time, sql_context, config, interval, logger): rdd = streaming_context._sc.parallelize([[x] for x in range(10)]) return sql_context.createDataFrame(rdd, ["number"])

> memsql-ops pip install [package]

distributed cluster-wide

any Python package

bring your own

Real-time pipeline

Q & A time

Recommended