Building a system for machine and event-oriented data with Rocana

building a system for machine andevent-oriented datae. sammer | @esammer | september 9, 2015silicon valley data engineering meetup

context

me• i work here: rocana – cto and cofounder• i used to work here: cloudera (‘10 – ’14), magnetic, experian, …• i do this: systems / distributed systems (storage, query, messaging, ...)• i wrote this:

what we do• we build a system for the operation of modern data centers• triage and diagnostics, exploration, trends, advanced analytics of complex

systems• our data: logs, metrics, human activity, anything that occurs in the data center• “enterprise software” (i.e. we build for others.)

• today: how we built what we built

our typical customer use cases• >100K events / sec (8.6B events / day), sub-second end to end latency, full

fidelity retention, critical use cases• quality of service - “are credit card transactions happening fast enough?”• fraud detection - “detect, investigate, prosecute, and learn from fraud.”• forensic diagnostics - “what really caused the outage last friday?”• security - “who’s doing what, where, when, why, and how, and is that ok?”• user behavior - ”capture and correlate user behavior with system performance,

then feed it to downstream systems in realtime.”

depth: 3 meters

high level architecture

guarantees• no single point of failure exists• all components scale horizontally[1]

• data retention and latency is a function of cost, not tech[1]

• every event is delivered provided no more than N - 1 failures occur (where N is the kafka replication level)

• all operations, including upgrade, are online[2]

• every event is (or appears to be) delivered exactly once[3]

[1] we’re positive there’s a limit, but thus far it has been cost.[2] from the user’s perspective, at a system level.[3] when queried via our UI. lots of details here.

events

modeling our world• everything is an event• each event contains a timestamp, type, location, host, service, body, and type-

specific attributes (k/v pairs)• build specialized aggregates as necessary - just optimized views of the data

event schema{ id: string, ts: long, event_type_id: int, location: string, host: string, service: string, body: [ null, string ], attributes: map<string>}

event types• some event types are standard

– syslog, http, log4j, generic text record, …• users define custom event types• producers populate event type• transformations can turn one event type into another• event type metadata tells downstream systems how to interpret body and

attributes

ex: generic syslog eventevent_type_id: 100, // rfc3164, rfc5424 (syslog)body: … // raw syslog message bytesattributes: { // extracted fields from body syslog_message: “DHCPACK from 10.10.0.1 (xid=0x45b63bdc)”, syslog_severity: “6”, // info severity syslog_facility: “3”, // daemon facility syslog_process: “dhclient”, syslog_pid: “668”, …}

ex: generic http eventevent_type_id: 102, // generic http eventbody: … // raw http log message bytesattributes: { http_req_method: “GET”, http_req_vhost: “w2a-demo-02”, http_req_path: “/api/v1/search?q=service%3Asshd&p=1&s=200”, http_req_query: “q=service%3Asshd&p=1&s=200”, http_resp_code: “200”, …}

consumers

consumers• …do most of the work• parallelism• kafka offset management• message de-duplication• transformation (embedded library)• dead letter queue support• downstream system knowledge

inside a consumer

metrics and time series

aggregation• mostly for time series metrics• two halves: on write and on query• data model: (dimensions) => (aggregates)• on write

– reduce(a: A, b: A) => A over window– store “base” aggregates, all associative and commutative

• on query– perform same aggregate or derivative aggregates– group by the same dimensions– we use SQL (Impala)

aside: late arriving data (it’s a thing)• never trust a (wall) clock• producer determines observation time, rest of the system uses this always• data that shows up late always processed according to observation time• aggregation consequences

– the same time window can appear multiple times– solution: aggregate every N seconds, potentially generating multiple aggregates for

the same time bin• this is real and you must deal with it

– do what we did or– build a system that mutates/replaces aggregates already output (eww) or– delay aggregate output for some slop time; drop it if late data shows up

ex: service event volume by host and minute• dimensions: ts, window, location, host, service, metric• on write, aggregates: count, sum, min, max, last• epoch, 60000, us-west-2a, w2a-demo-1, sshd, event_volume =>

17, 42, 1, 10, 8• on query:

– SELECT floor(ts / 60000) as bin, host, service, metric, sum(value_sum) FROM events WHERE ts BETWEEN x AND y AND metric = ”event_volume” GROUP BY bin, host, service, metric

• if late arriving data existed in events, the same dimensions would repeat with a another set of aggregates and would be rolled up as a result of the group by

• tl;dr: normal window aggregation operations

extension, pain, and advice

extending the system• custom producers• custom consumers• event types• parser / transformation plugins• custom metric definition and aggregate functions• custom processing jobs on landed data

pain (aka: the struggle is real)• lots of tradeoffs when picking a stream processing solution

– samza: right features, but low level programming model, not supported by vendors. missing security features.

– storm: too rigid, too slow. not supported by all Hadoop vendors.– spark streaming: tons of issues initially, but lots of community energy. improving.– @digitallogic: “my heart says samza, but my head says spark streaming.”– our (current) needs are meager; do work inside consumers.

• stack complexity, (relative im)maturity• scaling solr cloud to billions of events per day

if you’re going to try this…• read all the literature on stream processing[1]

• treat it like the distributed systems problem it is• understand, make, and make good on guarantees• find the right abstractions• never trust the hand waving or “hello worlds”• fully evaluate the projects/products in this space• understand it’s not just about search

[1] wait, like all of it? yea, like all of it.

things I didn’t talk about• reprocessing data when bad code / transformations are detected• dealing with data quality issues (“the struggle is real” part 2)• the user interface and all the fancy analytics

– data visualization and exploration– event search– anomalous trend and event detection– metric, source, and event correlation– motif finding– noise reduction and dithering

• event delivery semantics (e.g. at least once, exactly once, etc.)• alerting

questions?

thank you.

@esammer | esammer@rocana.com

Building a system for machine and event-oriented data with Rocana

Technology

Flexibility oriented design of a horizontal wrapping machine · 2020. 6. 4. · 110 H. Giberti and A. Pagani: Flexibility oriented design of a horizontal wrapping machine It is possible

BindsNET: A Machine Learning-Oriented Spiking Neural

DataEngConf: Apache Kafka at Rocana: a scalable, distributed log for machine data

MEGA MACHINE CO.,LTD. No. 180, Industrial Road, Tai-ping ...€¦ · Taiwan R. o. C. CNC Thread Grinding Machine Keep Improving & Quality Oriented -Skillful engineering and imaginative

Micro machine tool oriented optimum design of 3-RPS

Machine tools for future oriented production...Machine tools for future oriented production N. Hennes DS Technologie GmbH, Werkzeumaschinenbau GmbH, Monchengladbach, Germany 1 Introduction

Building a system for machine and event-oriented data - Velocity, Santa Clara 2015

SENSORLESS DIRECT FIELD ORIENTED CONTROL OF …etd.lib.metu.edu.tr/upload/12604965/index.pdf · iii abstract sensorless direct field oriented control of induction machine by flux

Higher Tie Layer Adhesion in Machine Direction Oriented ... · PDF fileHigher Tie Layer Adhesion in Machine Direction Oriented ... is a post-extrusion process used to ... the formulation

Machine (Assembly) Languagecyy/courses/introCS/...Machine language ( = instruction set) can be viewed as a programmer-oriented abstraction of the hardware platform The hardware platform

CATALOGO - Rocana | Import Export

Learning Center Of Excellence with Google DEPARTMENT ......4 Overview of Java, Object Oriented Concepts in Java. Abstraction, java Abstraction, java virtual machine, object oriented,

National Institutes of Health, Bethesda, MD 20892 arXiv ... · remains a challenging job for learning-oriented machine intelligence, due to (1) shortage of large-scale machine-

Service Oriented Machine-learning Application for ...Service Oriented Machine-learning Application for Reconfigurable Predictive Maintenance System M. Mostafizur Rahman, Nandini Chakravorti,

‘Performance-Oriented’ and ‘Work-Oriented’ … · ‘Performance-Oriented’ and ‘Work-Oriented ... of machine-assisted transcriptions of a performance of ... which the

Building a system for machine and event-oriented data - Data Day Seattle 2015

Solr At Scale For Time-Oriented Data: Presented by Brett Hoerner, Rocana

Flexibility oriented design of a horizontal wrapping machine · 2016-01-09 · 110 H. Giberti and A. Pagani: Flexibility oriented design of a horizontal wrapping machine It is possible

Efﬁciently Securing Systems from Code Reuse Attacksnael/pubs/tc12-cra.pdf · machine. CRAs, exempliﬁed by return-oriented and jump-oriented programming approaches, reuse fragments

Two-motor single-inverter ﬁeld-oriented induction machine