Building data processing apps in Scala: the Snowplow experience
London Scala Users’ Group
Building data processing apps in Scala
1. Snowplow – what is it?
2. Snowplow and Scala
3. Deep dive into our Scala code
4. Modularization and non-Snowplow code you can use
5. Roadmap
6. Questions
7. Appendix: even more roadmap
Snowplow – what is it?
Today, Snowplow is primarily an open source web analytics platform
Website / webappSnowplow: data pipeline
Collect Transform and enrich
Amazon Redshift /
PostgreSQL
Amazon S3
• Your granular, event-level and customer-level data, in your own data warehouse
• Connect any analytics tool to your data• Join your web analytics data with any other data set
Snowplow was born out of our frustration with traditional web analytics tools…• Limited set of reports that don’t answer business questions
• Traffic levels by source• Conversion levels• Bounce rates• Pages / visit
• Web analytics tools don’t understand the entities that matter to business• Customers, intentions, behaviours, articles, videos, authors,
subjects, services… • …vs pages, conversions, goals, clicks, transactions
• Web analytics tools are siloed• Hard to integrate with other data sets incl. digital (marketing
spend, ad server data), customer data (CRM), financial data (cost of goods, customer lifetime value)
…and out of the opportunities to tame new “big data” technologies
These tools make it possible to capture, transform, store and analyse all your granular, event-level data, to you can perform any analysis
Snowplow is composed of a set of loosely coupled subsystems, architected to be robust and scalable
1. Trackers 2. Collectors 3. Enrich 4. Storage 5. AnalyticsA B C D
A D Standardised data protocols
Generate event data
Examples:• JavaScript
tracker• Python /
Lua / No-JS / Arduino tracker
Receive data from trackers and log it to S3
Examples:• Cloudfront
collector• Clojure
collector for Amazon EB
Clean and enrich raw data
Built on Scalding / Cascading / Hadoop and powered by Amazon EMR
Store data ready for analysis
Examples:• Amazon
Redshift• PostgreSQL• Amazon S3
• Batch-based• Normally run overnight; sometimes
every 4-6 hours
Snowplow and Scala
Our initial skunkworks version of Snowplow had no Scala
Website / webapp
Snowplow data pipeline v1
CloudFront-based pixel
collector
HiveQL + Java UDF
“ETL” Amazon S3
JavaScript event tracker
But our schema-first, loosely coupled approach made it possible to start swapping out existing components…
Website / webapp
Snowplow data pipeline v2
CloudFront-based event
collectorScalding-
based enrichment
JavaScript event tracker
HiveQL + Java UDF
“ETL”
Amazon Redshift /
PostgreSQL
Amazon S3
or
Clojure-based event
collector
What is Scalding?
• Scalding is a Scala API over Cascading, the Java framework for building data processing pipelines on Hadoop:
Hadoop DFS
Hadoop MapReduce
Cascading Hive Pig
Java
Scalding Cascalog PyCascading cascading. jruby
We chose Cascading because we liked their “plumbing” abstraction over vanilla MapReduce
Why did we choose Scalding instead of one of the other Cascading DSLs/APIs?
• Lots of internal experience with Scala – could hit the ground running (only very basic awareness of Clojure when we started the project)
• Scalding created and supported by Twitter, who use it throughout their organization – so we knew it was a safe long-term bet
• More controversial opinion (although maybe not at a Scala UG): we believe that data pipelines should be as strongly typed as possible – all the other DSLs/APIs on top of Cascading encourage dynamic typing
Strongly typed data pipelines – why?
• Catch errors as soon as possible – and report them in a strongly typed way too
• Define the inputs and outputs of each of your data processing steps in an unambiguous way
• Forces you to formerly address the data types flowing through your system
• Lets you write code like this:
Deep dive into our Scala code
The secret sauce for data processing in Scala: the Scalaz Validation (1/3)
• Our basic processing model for Snowplow looks like this:
• This fits incredibly well onto the Validation applicative functor from the Scalaz project
Raw eventsScalding
enrichment process
“Bad” raw events +
reasons why they are bad
“Good” enriched
events
The secret sauce for data processing in Scala: the Scalaz Validation (2/3)
• We were able to express our data flow in terms of some relatively simple types:
The secret sauce for data processing in Scala: the Scalaz Validation (3/3)
• Scalaz Validation lets us do a variety of different validations and enrichments, and then collate the failures
• This is really powerful!
On the testing side: we love Specs2 data tables…
• They let us test a variety of inputs and expected outputs without making the mistake of just duplicating the data processing functionality in the test:
… and are starting to do more with ScalaCheck
• ScalaCheck is a property-based testing framework, originally inspired by Haskell’s QuickCheck
• We use it in a few places – including to generate unpredictable bad data and also to validate our new Thrift schema for raw Snowplow events:
Build and deployment: we have learnt to love (or at least peacefully co-exist with) SBT
• .scala based SBT build, not .sbt
• We use sbt assemble to create a fat jar for our Scalding ETL process – with some custom exclusions to play nicely on Amazon Elastic MapReduce
• Deployment is incredibly easy compared to the pain we have had with our two Ruby instrumentation apps (EmrEtlRunner and StorageLoader)
Modularization and non-Snowplow code you can use
We try to make our validation and enrichment process as modular as possible
Enrichment Manager
Not yet integrated
• This encourages testability and re-use – also it widens the number of contributors vs this functionality being embedded in Snowplow
• The Enrichment Manager uses external libraries (hosted in a Snowplow repository) which can be used in non-Snowplow projects:
We also have a few standalone Scala projects which might be of interest
• None of these projects assume that you are running Snowplow:
Snowplow roadmap
CLOUD VENDOR / OWN DATA CENTER
Search
Silo
SOME LOW LATENCY LOCAL LOOPS
E-comm
Silo
CRM
SAAS VENDOR #2
Email marketing
ERP
Silo
CMS
Silo
SAAS VENDOR #1
NARROW DATA SILOES
Streaming APIs / web hooks
Unified log
Archiving
Hadoop
< WIDE DATA
COVERAGE >
< FULL DATA
HISTORY >
Systems monitoring
Eventstream
HIGH LATENCY LOW LATENCY
Product rec’sAd hoc analytics
Management reporting
Fraud detection
Churn prevention
APIs
We want to move Snowplow to a unified log-based architecture
Again, our schema-first approach is letting us get to this architecture through a set of baby steps (1/2)
hadoop-etl
Record-level enrichment functionality
scala-common-enrich
scala-hadoop-enrich scala-kinesis-enrich
0.8.12pre-0.8.12
• In 0.8.12 at the start of the year we performed some surgery to de-couple our core enrichment code from its Scalding harness:
Then in 0.9.0 we released our first new Scala components leveraging Amazon Kinesis:
Scala Stream Collector
Raw event stream
Enrich Kinesis app
Bad raw events stream
Enriched event
stream
S3
Redshift
S3 sink Kinesis app
Redshift sink Kinesis app
Snowplow Trackers
• The parts in grey are still under development – we are working with Snowplow community members on these collaboratively
Questions?
http://snowplowanalytics.comhttps://github.com/snowplow/snowplow
@snowplowdataTo have a coffee or beer and talk Scala/data – @alexcrdean or
Appendix: even more roadmap!
Separately, we want to re-architect our data processing pipeline to make it even more schema’ed! (1/3)
• Our current approach involves a “Tracker Protocol” which is defined in a wiki page, processed in the Enrichment Manager and then written out to TSV files for loading into Redshift and Postgres (see over)
Separately, we want to re-architect our data processing pipeline to make it even more schema’ed! (3/3)
• We are planning to replace the existing flow with a JSON Schema-driven approach:
Enrichment Manager
Raw events in JSON format
JSON Schema defining events
Enriched events in Thrift or
Arvo format
Shredder
1. Define structure
2. Validate events
3. Define structure
4. Drive shredding
Enriched events in TSV ready for loading
into db
5. Define structure