Upload
simon-belak
View
470
Download
0
Embed Size (px)
Citation preview
• Started in machine learning • Turned to data science and
helped 20+ companies become data-driven
• Now leading data science department at GoOpti
Self-service infrastructure for data scientists
The analytics chasmIdeal. Almost real-time, can be done during brainstorming without disrupting flow
< 2min < 20min project
squeeze in somewhere in the day
fail
roadmapahoy!
My goto architecture
KafkaDB EventsOnyx Onyx
Onyx
Persist all events to S3 • time travel • query with AWS Athena
Onyxa masterless, cloud scale, fault tolerant, high performance distributed computation system
… written entirely in Clojure
Clojure at a glance• Lisp running on JVM
• Functional, dynamic, immutable
• Excellent concurrency and state management support
• Unparalleled data manipulation
• Good Java interoperability
Onyx at• In production for almost a year
• ETL
• online machine learning
• offline (batch) machine learning
• ad-hoc analysis
Onyx at a glance
Job =
[[:input :processing-1] [:input :processing-2] [:processing-1 :output-1] [:processing-2 :output-2]]
[{:flow/from :input-stream :flow/to [:process-adults] :flow/predicate :my.ns/adult? :flow/doc "Emits segment if an adult.”}]
workflow + flow conditions + catalogue [{:onyx/name :add-5
:onyx/fn :my/adder :onyx/type :function :my/n 5 :onyx/params [:my/n]}
{:onyx/name :in :onyx/plugin :onyx.plugin.core-async/input :onyx/type :input :onyx/medium :core.async :onyx/batch-size batch-size :onyx/max-peers 1 :onyx/doc "Reads segments from a core.async channel"}
{:onyx/name :out :onyx/plugin :onyx.plugin.core-async/output :onyx/type :output :onyx/medium :core.async :onyx/doc "Writes segments to a core.async channel"}]
Catalogue[{:onyx/name :add-5 :onyx/fn :my/adder :onyx/type :function :my/n 5 :onyx/params [:my/n]}
{:onyx/name :in :onyx/plugin :onyx.plugin.core-async/input :onyx/type :input :onyx/medium :core.async :onyx/batch-size batch-size :onyx/max-peers 1 :onyx/doc "Reads segments from a core.async channel"}
{:onyx/name :out :onyx/plugin :onyx.plugin.core-async/output :onyx/type :output :onyx/medium :core.async :onyx/doc "Writes segments to a core.async channel"}]
Vanilla Clojure function(defn adder [n {:keys [x] :as segment}] (assoc segment :x (+ n x))))
Plugins (I/O)seq, async, Kafka, Datomic, SQL, S3, SQS, …
parameter
self-documenting
Computation entirely described with data
data is
code!
Everything can be run locally!
Testing without mocking
Resilience and handling state
• Activity log
• Window and trigger states checkpointed
• Resume points
• Configurable flux policies
How Onyx rewired my brain
It’s not about scaling, but clean architecture
Decomplect everything
Computation graphs
Machine learning with Onyx
• Hyperparameter server build on top of Onyx parameters
• Batch & streaming mode
• Monitoring: performance metrics, side channels for partial results/introspection into computiation
• Everything is data so easy to build tools around
Onyx/Pyroclast
Putting “data is code” to work
Describing data with clojure.spec
composing smaller parts into the whole }
code i
s data
!
Queryable data descriptions
Turn spec into a graph
A fully interactive and open type system!
order
promo code
useraccount age
countryalways always
alwaysmaybe
“Composition is about decomposing.”
— E. Normand
Case study: autogenerating materialised views
KafkaMaterialised views
Events External data
Automatic view generation• Event & attribute ontology
• Manual (via spec) • Inferred
• Statistical analysis (seasonality detection, outlier removal, …)
Onyx Onyx
Onyx
Automatic view generation
1. Walk spec registry
2. Apply rules
1. Define new view (spec)
2. Trigger Onyx job that creates the view
⤾
Takeouts
Everything should be live and interactive
Computation graphs are a great way to structure data processing code
Queryable data and computation descriptions supercharge interactive development and are a great building block for automation
viebel.github.io/klipse/examples/onyx.html
onyxplatform.org
onyxplatform.org/jekyll/update/2017/02/08/Pyroclast-Preview-Simulation.html