46
© Rocana, Inc. All Rights Reserved. | 1 Joey Echeverria, Platform Technical Lead San Francisco Hadoop Users Group, June 14th 2016 San Francisco, CA Streaming ETL for All

Streaming ETL for All

Embed Size (px)

Citation preview

Page 1: Streaming ETL for All

© Rocana, Inc. All Rights Reserved. | 1

Joey Echeverria, Platform Technical Lead

San Francisco Hadoop Users Group, June 14th 2016

San Francisco, CA

Streaming ETL for All

Page 3: Streaming ETL for All

© Rocana, Inc. All Rights Reserved. | 3

Context

Page 4: Streaming ETL for All

© Rocana, Inc. All Rights Reserved. | 4

Joey• Where I work: Rocana – Platform Technical Lead

• Where I used to work: Cloudera (’11-’15), NSA

• Distributed systems, security, data processing, big data

Page 5: Streaming ETL for All

© Rocana, Inc. All Rights Reserved. | 5

Page 6: Streaming ETL for All

© Rocana, Inc. All Rights Reserved. | 6

History

Page 7: Streaming ETL for All

© Rocana, Inc. All Rights Reserved. | 7

Spark

Impala

“Legacy” data architecture

HDFS

Avro/Parquet FilesFlume/Sqoop

Data Producers MapReduce

Visualization/Query

Page 8: Streaming ETL for All

© Rocana, Inc. All Rights Reserved. | 8

Flink

Storm

Stream data architecture

Kafka

Avro Serialized Recrods

Data Producers Spark Streaming

Real-time Visualization

HDFS

Avro/Parquet FilesKafka Consumers

Page 9: Streaming ETL for All

© Rocana, Inc. All Rights Reserved. | 9

Flink

Storm

Stream data architecture

Kafka

Avro Serialized Recrods

Data Producers Spark Streaming

Real-time Visualization

HDFS

Avro/Parquet FilesKafka Consumers

Page 10: Streaming ETL for All

© Rocana, Inc. All Rights Reserved. | 10

Stream processingA primer

Page 11: Streaming ETL for All

© Rocana, Inc. All Rights Reserved. | 11

Stream processing• Filter

• Extract

• Project

• Aggregate

• Join

• Model

Page 12: Streaming ETL for All

© Rocana, Inc. All Rights Reserved. | 12

Stream processing• Filter

• Extract

• Project

• Aggregate

• Join

• Model

Page 13: Streaming ETL for All

© Rocana, Inc. All Rights Reserved. | 13

Stream processing• Filter

• Extract

• Project

• Aggregate

• Join

• Model

• Data transformation

Page 14: Streaming ETL for All

© Rocana, Inc. All Rights Reserved. | 14

Apache Storm• "Distributed real-time computation system"

• Applications packaged into topologies (think MapReduce job)

• Topologies operate over streams of tuples

• Spout: source of a stream

• Bolt: arbitrary operation such as filtering, aggregating, joining, or executing arbitrary functions

Page 15: Streaming ETL for All

© Rocana, Inc. All Rights Reserved. | 15

Apache Spark• Supports batch and stream processing

• Continuous stream of records discretized into a DStream

• DStream: a sequence of RDDs (batches of records)

• Micro-batch

Page 16: Streaming ETL for All

© Rocana, Inc. All Rights Reserved. | 16

Apache Flink• Supports batch and stream processing

• DataStream: unbounded collection of records

• Operations can apply to individual records or windows of records

• Supports record-at-a-time processing (like Storm)

Page 17: Streaming ETL for All

© Rocana, Inc. All Rights Reserved. | 17

Apache Kafka• Pub-sub messaging system implemented as a distributed commit log

• Popular as a source and sink for data streams

• Scalability, durability, and easy-to-understand delivery guarantees

• Can do stream processing directly in Kafka consumers

• Kafka Streams

Page 18: Streaming ETL for All

© Rocana, Inc. All Rights Reserved. | 18

Data transformation

Page 19: Streaming ETL for All

© Rocana, Inc. All Rights Reserved. | 19

Filter

filter

Page 20: Streaming ETL for All

© Rocana, Inc. All Rights Reserved. | 20

Extract

127.0.0.1 Mozilla/5.0 laura [31/Mar/2016] "GET /index.html HTTP/1.0" 200 2326

ts: 1436576671000body: <binary blob>event_type_id: 100...

extract

ts: 1436576671000body: <binary blob>event_type_id: 100attributes: { ip: "127.0.0.1" user_agent: "Mozilla/5.0" user_id: "laura" date: "[31/March/2016]" request: "GET /index.html HTTP/1.0" status_code: "200" size: "2326"}

Page 21: Streaming ETL for All

© Rocana, Inc. All Rights Reserved. | 21

Project

ts: 1436576671000body: <binary blob>event_type_id: 100attributes: { ip: "127.0.0.1" user_agent: "Mozilla/5.0" user_id: "laura" date: "[31/March/2016]" request: "GET /index.html HTTP/1.0" status_code: "200" size: "2326"}

ts: 1459444413000ip: "127.0.0.1"user_agent: "Mozilla/5.0"user_id: "laura"request: "GET /index.html HTTP/1.0"status_code: 200size: 2326

project

Page 22: Streaming ETL for All

© Rocana, Inc. All Rights Reserved. | 22

Problem

Page 23: Streaming ETL for All

© Rocana, Inc. All Rights Reserved. | 23

Who• Developers

• Data engineers

• Sysadmins

• Analysts

Page 24: Streaming ETL for All

© Rocana, Inc. All Rights Reserved. | 24

Tools

Page 25: Streaming ETL for All

© Rocana, Inc. All Rights Reserved. | 25

The dark art of data science• Feature engineering

• “Getting a mess of raw data that can be used as input to a machine learning algorithm” - @josh_wills

• Video from Midwest.io 2014

Page 26: Streaming ETL for All

© Rocana, Inc. All Rights Reserved. | 26

Data transformation for all

Page 27: Streaming ETL for All

© Rocana, Inc. All Rights Reserved. | 27

Rocana Transform• Library

• Java

• Rocana configuration• JSON + comments + specific numeric types - excess quoting

Page 28: Streaming ETL for All

© Rocana, Inc. All Rights Reserved. | 28

Data model• Event schema

• id: A globally unique identifier for this event• ts: Epoch timestamp in milliseconds• event_type_id: ID indicating the type of the event• location: Location from which the event was generated• host: Hostname, IP, or other device identifier from which the event was

generated• service: Service or process from which the event was generated• body: Raw event content in bytes• attributes: Event type-specific key/value pairs

Page 29: Streaming ETL for All

© Rocana, Inc. All Rights Reserved. | 29

Example event{ "id": "JRHAIDMLCKLEAPMIQDHFLO3MXYXV7NVBEJNDKZGS2XVSEINGGBHA====", "event_type_id": 100, "ts": 1436576671000, "location": "aws/us-west-2a", "host": "example01.rocana.com", "service": "dhclient", "body": "<36>Jul 10 18:04:31 gs09.example.com dhclient[865] DHCPACK from …", "attributes": { "syslog_timestamp": "1436576671000", "syslog_process": "dhclient", "syslog_pid": "865", "syslog_facility": "3", "syslog_severity": "6", "syslog_hostname": "example01", "syslog_message": "DHCPACK from 10.10.1.1 (xid=0x5c64bdb0)" }}

Page 30: Streaming ETL for All

© Rocana, Inc. All Rights Reserved. | 30

Filter, extract, and flatten

Page 31: Streaming ETL for All

© Rocana, Inc. All Rights Reserved. | 31

Filter, extract, and flatten• Filter out events without type id 100

• Filter out events without hostname prefix "ex"

• Extract a numeric prefix from the syslog message

• Flatten syslog attributes to top-level fields in a different avro schema

Page 32: Streaming ETL for All

© Rocana, Inc. All Rights Reserved. | 32

Filter, extract, and flatten{ load-event: {}, // Filter by event_type_id filter: { expression: "${event_type_id == 100}" }, // Extract hostname prefix regex: { ... }, filter: { expression: "${host_prefix.match.group.1 == 'ex'}", // Extract a numeric prefix from the syslog message regex: { ... }, // Build flattened record build-avro-record: { ... }, // Accumulate output record accumulate-output: { value: "${output_record}" }}

Page 33: Streaming ETL for All

© Rocana, Inc. All Rights Reserved. | 33

Extract hostname prefix{ load-event: {}, filter: { expression: "${event_type_id == 100}" }, regex: { pattern: "^(.{2}).*$", value: "${attr.syslog_hostname}", destination: "host_prefix" }, filter: { expression: "${host_prefix.match.group.1 == 'ex'}", ...}

Page 34: Streaming ETL for All

© Rocana, Inc. All Rights Reserved. | 34

Extract numeric prefix ... filter: { expression: "${host_prefix.match.group.1 == 'ex'}", regex: { pattern: "^([0-9]*)", value: "${attributes['syslog_message']}", destination: "msg", match-actions: { set-values: { extracted_field: "${msg.match.group.1}" } }, no-match-actions: { set-values: { extracted_field: "" } } }, ...

Page 35: Streaming ETL for All

© Rocana, Inc. All Rights Reserved. | 35

Build flattened record... build-avro-record: { schema-uri: "resource:avro-schemas/flattened-syslog.avsc", destination: "output_record", field-mapping: { ts: "${ts}", event_type_id: "${event_type_id}", source: "${source}", syslog_facility: "${convert:toInt(attributes['syslog_facility'])}", syslog_severity: "${convert:toInt(attributes['syslog_severity'])}", ... syslog_message: "${attributes['syslog_message']}", syslog_pid: "${convert:toInt(attributes['syslog_pid)}", extracted_field: "${extracted_field}" }, },...

Page 36: Streaming ETL for All

© Rocana, Inc. All Rights Reserved. | 36

Extract metrics from log data

Page 37: Streaming ETL for All

© Rocana, Inc. All Rights Reserved. | 37

Extract metrics• Input: HTTP status logs

• Extract request latency

• Extract counts by HTTP status code

• Metric types• Guage: A value that varies over time (think latency, CPU %, etc.)• Counter: A value that accumulates over time (think event volume, status codes,

etc.)

Page 38: Streaming ETL for All

© Rocana, Inc. All Rights Reserved. | 38

Example metric event{ "id": "JRHAIDMLCKLEAPMIQDHFLO3MXBBQ7NVBEJNDKZGS2XVSEINGGBHA====", "event_type_id": 107, "ts": 1436576671000, "location": "aws/us-west-2a", "host": "web01.rocana.com", "service": "httpd", "attributes": { "m.http.request.latency": "4.2000000000E1|g", "m.http.status.401.count": "1.0000000000E0|c", }}

Page 39: Streaming ETL for All

© Rocana, Inc. All Rights Reserved. | 39

Extract metrics{ load-event: {}, build-metric: { gauge-mapping: { http.request.latency: "${convert:toDouble(attributes['latency'])}" }, destination: "latency_metric" }, accumulate-output: { value: "${latency_metric}" }, build-metric: { dynamic-counter-mapping: [ "${string:format('http.status.%s.count', attributes['sc_status'])}", 1D ], destination: "status_metric" }, accumulate-output: { value: "${status_metric}" }}

Page 40: Streaming ETL for All

© Rocana, Inc. All Rights Reserved. | 40

Architecture

Page 41: Streaming ETL for All

© Rocana, Inc. All Rights Reserved. | 41

Java action objects

Architecture

Configuration file Java action objects Context

Variables

Driver

1. Parse config

2. Initialize context

5. Copy output3. Execute actions

4. Read/write variables

Page 42: Streaming ETL for All

© Rocana, Inc. All Rights Reserved. | 42

Custom actions• Actions loaded at runtime using Java services framework

• Add your jar to the classpath

• Custom actions appear as top-level keywords just like regular actions

• Implement the execute() method of the Action interface

• Implement the build() method of the ActionBuilder interface

Page 43: Streaming ETL for All

© Rocana, Inc. All Rights Reserved. | 43

Custom actions• Parse custom log formats

• Cisco ACS• Citrix• Juniper• Customer-specific formats

• Lookup IP addresses in the MaxMind GeoIP2 database

• Reference dataset lookups• Device id to device name

Page 44: Streaming ETL for All

© Rocana, Inc. All Rights Reserved. | 44

Putting it all together• Stream processing is causing us to re-think how we analyze data

• Limiting accessibility of data transformation side increases costs and decreases velocity

• Reduce your reliance on developers to code custom pipelines

• Re-use transformation configuration in any stream processing framework or batch job

Page 45: Streaming ETL for All

© Rocana, Inc. All Rights Reserved. | 45

Coming soon• Rocana transform will be released under the ASL 2.0

• The configuration library is available today:• https://github.com/scalingdata/rocana-configuration

Page 46: Streaming ETL for All

© Rocana, Inc. All Rights Reserved. | 46

Questions?