47
© Rocana, Inc. All Rights Reserved. | 1 Joey Echeverria, Platform Technical Lead Strata+Hadoop World, March 31st 2016 San Jose, CA Embeddable data transformation for real-time streams

Embeddable data transformation for real time streams

Embed Size (px)

Citation preview

Page 1: Embeddable data transformation for real time streams

© Rocana, Inc. All Rights Reserved. | 1

Joey Echeverria, Platform Technical Lead

Strata+Hadoop World, March 31st 2016

San Jose, CA

Embeddable data transformation for real-time streams

Page 2: Embeddable data transformation for real time streams

© Rocana, Inc. All Rights Reserved. | 2 http://j.mp/hw-questions

Slides

http://j.mp/rocana-transform-slides

Page 3: Embeddable data transformation for real time streams

© Rocana, Inc. All Rights Reserved. | 3 http://j.mp/hw-questions

Questions

http://j.mp/hw-questions

Page 4: Embeddable data transformation for real time streams

© Rocana, Inc. All Rights Reserved. | 4 http://j.mp/hw-questions

Context

Page 5: Embeddable data transformation for real time streams

© Rocana, Inc. All Rights Reserved. | 5 http://j.mp/hw-questions

Joey• Where I work: Rocana – Platform Technical Lead

• Where I used to work: Cloudera (’11-’15), NSA

• Distributed systems, security, data processing, big data

Page 6: Embeddable data transformation for real time streams

© Rocana, Inc. All Rights Reserved. | 6

Signing today at 1pm at the Cloudera booth

Page 7: Embeddable data transformation for real time streams

© Rocana, Inc. All Rights Reserved. | 7 http://j.mp/hw-questions

History

Page 8: Embeddable data transformation for real time streams

© Rocana, Inc. All Rights Reserved. | 8 http://j.mp/hw-questions

Spark

Impala

“Legacy” data architecture

HDFS

Avro/Parquet FilesFlume/Sqoop

Data Producers MapReduce

Visualization/Query

Page 9: Embeddable data transformation for real time streams

© Rocana, Inc. All Rights Reserved. | 9 http://j.mp/hw-questions

Flink

Storm

Stream data architecture

Kafka

Avro Serialized Recrods

Data Producers Spark Streaming

Real-time Visualization

HDFS

Avro/Parquet FilesKafka Consumers

Page 10: Embeddable data transformation for real time streams

© Rocana, Inc. All Rights Reserved. | 10 http://j.mp/hw-questions

Flink

Storm

Stream data architecture

Kafka

Avro Serialized Recrods

Data Producers Spark Streaming

Real-time Visualization

HDFS

Avro/Parquet FilesKafka Consumers

Page 11: Embeddable data transformation for real time streams

© Rocana, Inc. All Rights Reserved. | 11 http://j.mp/hw-questions

Stream processingA primer

Page 12: Embeddable data transformation for real time streams

© Rocana, Inc. All Rights Reserved. | 12 http://j.mp/hw-questions

Stream processing• Filter

• Extract

• Project

• Aggregate

• Join

• Model

Page 13: Embeddable data transformation for real time streams

© Rocana, Inc. All Rights Reserved. | 13 http://j.mp/hw-questions

Stream processing• Filter

• Extract

• Project

• Aggregate

• Join

• Model

Page 14: Embeddable data transformation for real time streams

© Rocana, Inc. All Rights Reserved. | 14 http://j.mp/hw-questions

Stream processing• Filter

• Extract

• Project

• Aggregate

• Join

• Model

• Data transformation

Page 15: Embeddable data transformation for real time streams

© Rocana, Inc. All Rights Reserved. | 15 http://j.mp/hw-questions

Apache Storm• "Distributed real-time computation system"

• Applications packaged into topologies (think MapReduce job)

• Topologies operate over streams of tuples

• Spout: source of a stream

• Bolt: arbitrary operation such as filtering, aggregating, joining, or executing arbitrary functions

Page 16: Embeddable data transformation for real time streams

© Rocana, Inc. All Rights Reserved. | 16 http://j.mp/hw-questions

Apache Spark• Supports batch and stream processing

• Continuous stream of records discretized into a DStream

• DStream: a sequence of RDDs (batches of records)

• Micro-batch

Page 17: Embeddable data transformation for real time streams

© Rocana, Inc. All Rights Reserved. | 17 http://j.mp/hw-questions

Apache Flink• Supports batch and stream processing

• DataStream: unbounded collection of records

• Operations can apply to individual records or windows of records

• Supports record-at-a-time processing (like Storm)

Page 18: Embeddable data transformation for real time streams

© Rocana, Inc. All Rights Reserved. | 18 http://j.mp/hw-questions

Apache Kafka• Pub-sub messaging system implemented as a distributed commit log

• Popular as a source and sink for data streams

• Scalability, durability, and easy-to-understand delivery guarantees

• Can do stream processing directly in Kafka consumers

Page 19: Embeddable data transformation for real time streams

© Rocana, Inc. All Rights Reserved. | 19 http://j.mp/hw-questions

Data transformation

Page 20: Embeddable data transformation for real time streams

© Rocana, Inc. All Rights Reserved. | 20 http://j.mp/hw-questions

Filter

filter

Page 21: Embeddable data transformation for real time streams

© Rocana, Inc. All Rights Reserved. | 21 http://j.mp/hw-questions

Extract

127.0.0.1 Mozilla/5.0 laura [31/Mar/2016] "GET /index.html HTTP/1.0" 200 2326

ts: 1436576671000body: <binary blob>event_type_id: 100...

extract

ts: 1436576671000body: <binary blob>event_type_id: 100attributes: { ip: "127.0.0.1" user_agent: "Mozilla/5.0" user_id: "laura" date: "[31/March/2016]" request: "GET /index.html HTTP/1.0" status_code: "200" size: "2326"}

Page 22: Embeddable data transformation for real time streams

© Rocana, Inc. All Rights Reserved. | 22 http://j.mp/hw-questions

Project

ts: 1436576671000body: <binary blob>event_type_id: 100attributes: { ip: "127.0.0.1" user_agent: "Mozilla/5.0" user_id: "laura" date: "[31/March/2016]" request: "GET /index.html HTTP/1.0" status_code: "200" size: "2326"}

ts: 1459444413000ip: "127.0.0.1"user_agent: "Mozilla/5.0"user_id: "laura"request: "GET /index.html HTTP/1.0"status_code: 200size: 2326

project

Page 23: Embeddable data transformation for real time streams

© Rocana, Inc. All Rights Reserved. | 23 http://j.mp/hw-questions

Problem

Page 24: Embeddable data transformation for real time streams

© Rocana, Inc. All Rights Reserved. | 24 http://j.mp/hw-questions

Who• Developers

• Data engineers

• Sysadmins

• Analysts

Page 25: Embeddable data transformation for real time streams

© Rocana, Inc. All Rights Reserved. | 25 http://j.mp/hw-questions

Tools

Page 26: Embeddable data transformation for real time streams

© Rocana, Inc. All Rights Reserved. | 26 http://j.mp/hw-questions

The dark art of data science• Feature engineering

• “Getting a mess of raw data that can be used as input to a machine learning algorithm” - @josh_wills

• Video from Midwest.io 2014

Page 27: Embeddable data transformation for real time streams

© Rocana, Inc. All Rights Reserved. | 27 http://j.mp/hw-questions

Data transformation for all

Page 28: Embeddable data transformation for real time streams

© Rocana, Inc. All Rights Reserved. | 28 http://j.mp/hw-questions

Rocana Transform• Library

• Java

• Rocana configuration• JSON + comments + specific numeric types - excess quoting

Page 29: Embeddable data transformation for real time streams

© Rocana, Inc. All Rights Reserved. | 29 http://j.mp/hw-questions

Data model• Event schema

• id: A globally unique identifier for this event• ts: Epoch timestamp in milliseconds• event_type_id: ID indicating the type of the event• location: Location from which the event was generated• host: Hostname, IP, or other device identifier from which the event was

generated• service: Service or process from which the event was generated• body: Raw event content in bytes• attributes: Event type-specific key/value pairs

Page 30: Embeddable data transformation for real time streams

© Rocana, Inc. All Rights Reserved. | 30 http://j.mp/hw-questions

Example event{ "id": "JRHAIDMLCKLEAPMIQDHFLO3MXYXV7NVBEJNDKZGS2XVSEINGGBHA====", "event_type_id": 100, "ts": 1436576671000, "location": "aws/us-west-2a", "host": "example01.rocana.com", "service": "dhclient", "body": "<36>Jul 10 18:04:31 gs09.example.com dhclient[865] DHCPACK from …", "attributes": { "syslog_timestamp": "1436576671000", "syslog_process": "dhclient", "syslog_pid": "865", "syslog_facility": "3", "syslog_severity": "6", "syslog_hostname": "example01", "syslog_message": "DHCPACK from 10.10.1.1 (xid=0x5c64bdb0)" }}

Page 31: Embeddable data transformation for real time streams

© Rocana, Inc. All Rights Reserved. | 31 http://j.mp/hw-questions

Filter, extract, and flatten

Page 32: Embeddable data transformation for real time streams

© Rocana, Inc. All Rights Reserved. | 32 http://j.mp/hw-questions

Filter, extract, and flatten• Filter out events without type id 100

• Filter out events without hostname prefix "ex"

• Extract a numeric prefix from the syslog message

• Flatten syslog attributes to top-level fields in a different avro schema

Page 33: Embeddable data transformation for real time streams

© Rocana, Inc. All Rights Reserved. | 33 http://j.mp/hw-questions

Filter, extract, and flatten{ load-event: {}, // Filter by event_type_id filter: { expression: "${event_type_id == 100}" }, // Extract hostname prefix regex: { ... }, filter: { expression: "${host_prefix.match.group.1 == 'ex'}", // Extract a numeric prefix from the syslog message regex: { ... }, // Build flattened record build-avro-record: { ... }, // Accumulate output record accumulate-output: { value: "${output_record}" }}

Page 34: Embeddable data transformation for real time streams

© Rocana, Inc. All Rights Reserved. | 34 http://j.mp/hw-questions

Extract hostname prefix{ load-event: {}, filter: { expression: "${event_type_id == 100}" }, regex: { pattern: "^(.{2}).*$", value: "${attr.syslog_hostname}", destination: "host_prefix" }, filter: { expression: "${host_prefix.match.group.1 == 'ex'}", ...}

Page 35: Embeddable data transformation for real time streams

© Rocana, Inc. All Rights Reserved. | 35 http://j.mp/hw-questions

Extract numeric prefix ... filter: { expression: "${host_prefix.match.group.1 == 'ex'}", regex: { pattern: "^([0-9]*)", value: "${attributes['syslog_message']}", destination: "msg", match-actions: { set-values: { extracted_field: "${msg.match.group.1}" } }, no-match-actions: { set-values: { extracted_field: "" } } }, ...

Page 36: Embeddable data transformation for real time streams

© Rocana, Inc. All Rights Reserved. | 36 http://j.mp/hw-questions

Build flattened record... build-avro-record: { schema-uri: "resource:avro-schemas/flattened-syslog.avsc", destination: "output_record", field-mapping: { ts: "${ts}", event_type_id: "${event_type_id}", source: "${source}", syslog_facility: "${convert:toInt(attributes['syslog_facility'])}", syslog_severity: "${convert:toInt(attributes['syslog_severity'])}", ... syslog_message: "${attributes['syslog_message']}", syslog_pid: "${convert:toInt(attributes['syslog_pid)}", extracted_field: "${extracted_field}" }, },...

Page 37: Embeddable data transformation for real time streams

© Rocana, Inc. All Rights Reserved. | 37 http://j.mp/hw-questions

Extract metrics from log data

Page 38: Embeddable data transformation for real time streams

© Rocana, Inc. All Rights Reserved. | 38 http://j.mp/hw-questions

Extract metrics• Input: HTTP status logs

• Extract request latency

• Extract counts by HTTP status code

• Metric types• Guage: A value that varies over time (think latency, CPU %, etc.)• Counter: A value that accumulates over time (think event volume, status codes,

etc.)

Page 39: Embeddable data transformation for real time streams

© Rocana, Inc. All Rights Reserved. | 39 http://j.mp/hw-questions

Example metric event{ "id": "JRHAIDMLCKLEAPMIQDHFLO3MXBBQ7NVBEJNDKZGS2XVSEINGGBHA====", "event_type_id": 107, "ts": 1436576671000, "location": "aws/us-west-2a", "host": "web01.rocana.com", "service": "httpd", "attributes": { "m.http.request.latency": "4.2000000000E1|g", "m.http.status.401.count": "1.0000000000E0|c", }}

Page 40: Embeddable data transformation for real time streams

© Rocana, Inc. All Rights Reserved. | 40 http://j.mp/hw-questions

Extract metrics{ load-event: {}, build-metric: { gauge-mapping: { http.request.latency: "${convert:toDouble(attributes['latency'])}" }, destination: "latency_metric" }, accumulate-output: { value: "${latency_metric}" }, build-metric: { dynamic-counter-mapping: [ "${string:format('http.status.%s.count', attributes['sc_status'])}", 1D ], destination: "status_metric" }, accumulate-output: { value: "${status_metric}" }}

Page 41: Embeddable data transformation for real time streams

© Rocana, Inc. All Rights Reserved. | 41 http://j.mp/hw-questions

Architecture

Page 42: Embeddable data transformation for real time streams

© Rocana, Inc. All Rights Reserved. | 42 http://j.mp/hw-questions

Java action objects

Architecture

Configuration file Java action objects Context

Variables

Driver

1. Parse config

2. Initialize context

5. Copy output3. Execute actions

4. Read/write variables

Page 43: Embeddable data transformation for real time streams

© Rocana, Inc. All Rights Reserved. | 43 http://j.mp/hw-questions

Custom actions• Actions loaded at runtime using Java services framework

• Add your jar to the classpath

• Custom actions appear as top-level keywords just like regular actions

• Implement the execute() method of the Action interface

• Implement the build() method of the ActionBuilder interface

Page 44: Embeddable data transformation for real time streams

© Rocana, Inc. All Rights Reserved. | 44 http://j.mp/hw-questions

Custom actions• Parse custom log formats

• Cisco ACS• Citrix• Juniper• Customer-specific formats

• Lookup IP addresses in the MaxMind GeoIP2 database

• Reference dataset lookups• Device id to device name

Page 45: Embeddable data transformation for real time streams

© Rocana, Inc. All Rights Reserved. | 45 http://j.mp/hw-questions

Putting it all together• Stream processing is causing us to re-think how we analyze data

• Limiting accessibility of data transformation side increases costs and decreases velocity

• Reduce your reliance on developers to code custom pipelines

• Re-use transformation configuration in any stream processing framework or batch job

Page 46: Embeddable data transformation for real time streams

© Rocana, Inc. All Rights Reserved. | 46 http://j.mp/hw-questions

Coming soon• Rocana transform will be released under the ASL 2.0

• The base configuration library is available today:• https://github.com/scalingdata/rocana-configuration

Page 47: Embeddable data transformation for real time streams

© Rocana, Inc. All Rights Reserved. | 47 http://j.mp/hw-questions

Questions?

• Signing "Hadoop Security" today at 1pm at the Cloudera booth