LSUG talk - Building data processing apps in Scala, the Snowplow experience

Preview:

DESCRIPTION

I talked to the London Scala Users' Group about building Snowplow, an open source event analytics platform, on top of Scala and key libraries and frameworks including Scalding, Scalaz and Spray. He will highlight some of the data processing tricks and techniques picked up along the way, particularly: schema-first development; monadic ETL; datatable-based testing; data transformation maps. He will also introduce some of the Scala libraries the Snowplow team have open sourced along the way (such as scala-forex, referer-parser, scala-maxmind-geoip).

Citation preview

Building data processing apps in Scala: the Snowplow experience

London Scala Users’ Group

Building data processing apps in Scala

1. Snowplow – what is it?

2. Snowplow and Scala

3. Deep dive into our Scala code

4. Modularization and non-Snowplow code you can use

5. Roadmap

6. Questions

7. Appendix: even more roadmap

Snowplow – what is it?

Today, Snowplow is primarily an open source web analytics platform

Website / webappSnowplow: data pipeline

Collect Transform and enrich

Amazon Redshift /

PostgreSQL

Amazon S3

• Your granular, event-level and customer-level data, in your own data warehouse

• Connect any analytics tool to your data• Join your web analytics data with any other data set

Snowplow was born out of our frustration with traditional web analytics tools…• Limited set of reports that don’t answer business questions

• Traffic levels by source• Conversion levels• Bounce rates• Pages / visit

• Web analytics tools don’t understand the entities that matter to business• Customers, intentions, behaviours, articles, videos, authors,

subjects, services… • …vs pages, conversions, goals, clicks, transactions

• Web analytics tools are siloed• Hard to integrate with other data sets incl. digital (marketing

spend, ad server data), customer data (CRM), financial data (cost of goods, customer lifetime value)

…and out of the opportunities to tame new “big data” technologies

These tools make it possible to capture, transform, store and analyse all your granular, event-level data, to you can perform any analysis

Snowplow is composed of a set of loosely coupled subsystems, architected to be robust and scalable

1. Trackers 2. Collectors 3. Enrich 4. Storage 5. AnalyticsA B C D

A D Standardised data protocols

Generate event data

Examples:• JavaScript

tracker• Python /

Lua / No-JS / Arduino tracker

Receive data from trackers and log it to S3

Examples:• Cloudfront

collector• Clojure

collector for Amazon EB

Clean and enrich raw data

Built on Scalding / Cascading / Hadoop and powered by Amazon EMR

Store data ready for analysis

Examples:• Amazon

Redshift• PostgreSQL• Amazon S3

• Batch-based• Normally run overnight; sometimes

every 4-6 hours

Snowplow and Scala

Our initial skunkworks version of Snowplow had no Scala

Website / webapp

Snowplow data pipeline v1

CloudFront-based pixel

collector

HiveQL + Java UDF

“ETL” Amazon S3

JavaScript event tracker

But our schema-first, loosely coupled approach made it possible to start swapping out existing components…

Website / webapp

Snowplow data pipeline v2

CloudFront-based event

collectorScalding-

based enrichment

JavaScript event tracker

HiveQL + Java UDF

“ETL”

Amazon Redshift /

PostgreSQL

Amazon S3

or

Clojure-based event

collector

What is Scalding?

• Scalding is a Scala API over Cascading, the Java framework for building data processing pipelines on Hadoop:

Hadoop DFS

Hadoop MapReduce

Cascading Hive Pig

Java

Scalding Cascalog PyCascading cascading. jruby

We chose Cascading because we liked their “plumbing” abstraction over vanilla MapReduce

Why did we choose Scalding instead of one of the other Cascading DSLs/APIs?

• Lots of internal experience with Scala – could hit the ground running (only very basic awareness of Clojure when we started the project)

• Scalding created and supported by Twitter, who use it throughout their organization – so we knew it was a safe long-term bet

• More controversial opinion (although maybe not at a Scala UG): we believe that data pipelines should be as strongly typed as possible – all the other DSLs/APIs on top of Cascading encourage dynamic typing

Strongly typed data pipelines – why?

• Catch errors as soon as possible – and report them in a strongly typed way too

• Define the inputs and outputs of each of your data processing steps in an unambiguous way

• Forces you to formerly address the data types flowing through your system

• Lets you write code like this:

Deep dive into our Scala code

The secret sauce for data processing in Scala: the Scalaz Validation (1/3)

• Our basic processing model for Snowplow looks like this:

• This fits incredibly well onto the Validation applicative functor from the Scalaz project

Raw eventsScalding

enrichment process

“Bad” raw events +

reasons why they are bad

“Good” enriched

events

The secret sauce for data processing in Scala: the Scalaz Validation (2/3)

• We were able to express our data flow in terms of some relatively simple types:

The secret sauce for data processing in Scala: the Scalaz Validation (3/3)

• Scalaz Validation lets us do a variety of different validations and enrichments, and then collate the failures

• This is really powerful!

On the testing side: we love Specs2 data tables…

• They let us test a variety of inputs and expected outputs without making the mistake of just duplicating the data processing functionality in the test:

… and are starting to do more with ScalaCheck

• ScalaCheck is a property-based testing framework, originally inspired by Haskell’s QuickCheck

• We use it in a few places – including to generate unpredictable bad data and also to validate our new Thrift schema for raw Snowplow events:

Build and deployment: we have learnt to love (or at least peacefully co-exist with) SBT

• .scala based SBT build, not .sbt

• We use sbt assemble to create a fat jar for our Scalding ETL process – with some custom exclusions to play nicely on Amazon Elastic MapReduce

• Deployment is incredibly easy compared to the pain we have had with our two Ruby instrumentation apps (EmrEtlRunner and StorageLoader)

Modularization and non-Snowplow code you can use

We try to make our validation and enrichment process as modular as possible

Enrichment Manager

Not yet integrated

• This encourages testability and re-use – also it widens the number of contributors vs this functionality being embedded in Snowplow

• The Enrichment Manager uses external libraries (hosted in a Snowplow repository) which can be used in non-Snowplow projects:

We also have a few standalone Scala projects which might be of interest

• None of these projects assume that you are running Snowplow:

Snowplow roadmap

CLOUD VENDOR / OWN DATA CENTER

Search

Silo

SOME LOW LATENCY LOCAL LOOPS

E-comm

Silo

CRM

SAAS VENDOR #2

Email marketing

ERP

Silo

CMS

Silo

SAAS VENDOR #1

NARROW DATA SILOES

Streaming APIs / web hooks

Unified log

Archiving

Hadoop

< WIDE DATA

COVERAGE >

< FULL DATA

HISTORY >

Systems monitoring

Eventstream

HIGH LATENCY LOW LATENCY

Product rec’sAd hoc analytics

Management reporting

Fraud detection

Churn prevention

APIs

We want to move Snowplow to a unified log-based architecture

Again, our schema-first approach is letting us get to this architecture through a set of baby steps (1/2)

hadoop-etl

Record-level enrichment functionality

scala-common-enrich

scala-hadoop-enrich scala-kinesis-enrich

0.8.12pre-0.8.12

• In 0.8.12 at the start of the year we performed some surgery to de-couple our core enrichment code from its Scalding harness:

Then in 0.9.0 we released our first new Scala components leveraging Amazon Kinesis:

Scala Stream Collector

Raw event stream

Enrich Kinesis app

Bad raw events stream

Enriched event

stream

S3

Redshift

S3 sink Kinesis app

Redshift sink Kinesis app

Snowplow Trackers

• The parts in grey are still under development – we are working with Snowplow community members on these collaboratively

Questions?

http://snowplowanalytics.comhttps://github.com/snowplow/snowplow

@snowplowdataTo have a coffee or beer and talk Scala/data – @alexcrdean or

alex@snowplowanalytics.com

Appendix: even more roadmap!

Separately, we want to re-architect our data processing pipeline to make it even more schema’ed! (1/3)

• Our current approach involves a “Tracker Protocol” which is defined in a wiki page, processed in the Enrichment Manager and then written out to TSV files for loading into Redshift and Postgres (see over)

Separately, we want to re-architect our data processing pipeline to make it even more schema’ed! (3/3)

• We are planning to replace the existing flow with a JSON Schema-driven approach:

Enrichment Manager

Raw events in JSON format

JSON Schema defining events

Enriched events in Thrift or

Arvo format

Shredder

1. Define structure

2. Validate events

3. Define structure

4. Drive shredding

Enriched events in TSV ready for loading

into db

5. Define structure

Recommended