33
Building data processing apps in Scala: the Snowplow experience London Scala Users’ Group

LSUG talk - Building data processing apps in Scala, the Snowplow experience

Embed Size (px)

DESCRIPTION

I talked to the London Scala Users' Group about building Snowplow, an open source event analytics platform, on top of Scala and key libraries and frameworks including Scalding, Scalaz and Spray. He will highlight some of the data processing tricks and techniques picked up along the way, particularly: schema-first development; monadic ETL; datatable-based testing; data transformation maps. He will also introduce some of the Scala libraries the Snowplow team have open sourced along the way (such as scala-forex, referer-parser, scala-maxmind-geoip).

Citation preview

Page 1: LSUG talk - Building data processing apps in Scala, the Snowplow experience

Building data processing apps in Scala: the Snowplow experience

London Scala Users’ Group

Page 2: LSUG talk - Building data processing apps in Scala, the Snowplow experience

Building data processing apps in Scala

1. Snowplow – what is it?

2. Snowplow and Scala

3. Deep dive into our Scala code

4. Modularization and non-Snowplow code you can use

5. Roadmap

6. Questions

7. Appendix: even more roadmap

Page 3: LSUG talk - Building data processing apps in Scala, the Snowplow experience

Snowplow – what is it?

Page 4: LSUG talk - Building data processing apps in Scala, the Snowplow experience

Today, Snowplow is primarily an open source web analytics platform

Website / webappSnowplow: data pipeline

Collect Transform and enrich

Amazon Redshift /

PostgreSQL

Amazon S3

• Your granular, event-level and customer-level data, in your own data warehouse

• Connect any analytics tool to your data• Join your web analytics data with any other data set

Page 5: LSUG talk - Building data processing apps in Scala, the Snowplow experience

Snowplow was born out of our frustration with traditional web analytics tools…• Limited set of reports that don’t answer business questions

• Traffic levels by source• Conversion levels• Bounce rates• Pages / visit

• Web analytics tools don’t understand the entities that matter to business• Customers, intentions, behaviours, articles, videos, authors,

subjects, services… • …vs pages, conversions, goals, clicks, transactions

• Web analytics tools are siloed• Hard to integrate with other data sets incl. digital (marketing

spend, ad server data), customer data (CRM), financial data (cost of goods, customer lifetime value)

Page 6: LSUG talk - Building data processing apps in Scala, the Snowplow experience

…and out of the opportunities to tame new “big data” technologies

These tools make it possible to capture, transform, store and analyse all your granular, event-level data, to you can perform any analysis

Page 7: LSUG talk - Building data processing apps in Scala, the Snowplow experience

Snowplow is composed of a set of loosely coupled subsystems, architected to be robust and scalable

1. Trackers 2. Collectors 3. Enrich 4. Storage 5. AnalyticsA B C D

A D Standardised data protocols

Generate event data

Examples:• JavaScript

tracker• Python /

Lua / No-JS / Arduino tracker

Receive data from trackers and log it to S3

Examples:• Cloudfront

collector• Clojure

collector for Amazon EB

Clean and enrich raw data

Built on Scalding / Cascading / Hadoop and powered by Amazon EMR

Store data ready for analysis

Examples:• Amazon

Redshift• PostgreSQL• Amazon S3

• Batch-based• Normally run overnight; sometimes

every 4-6 hours

Page 8: LSUG talk - Building data processing apps in Scala, the Snowplow experience

Snowplow and Scala

Page 9: LSUG talk - Building data processing apps in Scala, the Snowplow experience

Our initial skunkworks version of Snowplow had no Scala

Website / webapp

Snowplow data pipeline v1

CloudFront-based pixel

collector

HiveQL + Java UDF

“ETL” Amazon S3

JavaScript event tracker

Page 10: LSUG talk - Building data processing apps in Scala, the Snowplow experience

But our schema-first, loosely coupled approach made it possible to start swapping out existing components…

Website / webapp

Snowplow data pipeline v2

CloudFront-based event

collectorScalding-

based enrichment

JavaScript event tracker

HiveQL + Java UDF

“ETL”

Amazon Redshift /

PostgreSQL

Amazon S3

or

Clojure-based event

collector

Page 11: LSUG talk - Building data processing apps in Scala, the Snowplow experience

What is Scalding?

• Scalding is a Scala API over Cascading, the Java framework for building data processing pipelines on Hadoop:

Hadoop DFS

Hadoop MapReduce

Cascading Hive Pig

Java

Scalding Cascalog PyCascading cascading. jruby

Page 12: LSUG talk - Building data processing apps in Scala, the Snowplow experience

We chose Cascading because we liked their “plumbing” abstraction over vanilla MapReduce

Page 13: LSUG talk - Building data processing apps in Scala, the Snowplow experience

Why did we choose Scalding instead of one of the other Cascading DSLs/APIs?

• Lots of internal experience with Scala – could hit the ground running (only very basic awareness of Clojure when we started the project)

• Scalding created and supported by Twitter, who use it throughout their organization – so we knew it was a safe long-term bet

• More controversial opinion (although maybe not at a Scala UG): we believe that data pipelines should be as strongly typed as possible – all the other DSLs/APIs on top of Cascading encourage dynamic typing

Page 14: LSUG talk - Building data processing apps in Scala, the Snowplow experience

Strongly typed data pipelines – why?

• Catch errors as soon as possible – and report them in a strongly typed way too

• Define the inputs and outputs of each of your data processing steps in an unambiguous way

• Forces you to formerly address the data types flowing through your system

• Lets you write code like this:

Page 15: LSUG talk - Building data processing apps in Scala, the Snowplow experience

Deep dive into our Scala code

Page 16: LSUG talk - Building data processing apps in Scala, the Snowplow experience

The secret sauce for data processing in Scala: the Scalaz Validation (1/3)

• Our basic processing model for Snowplow looks like this:

• This fits incredibly well onto the Validation applicative functor from the Scalaz project

Raw eventsScalding

enrichment process

“Bad” raw events +

reasons why they are bad

“Good” enriched

events

Page 17: LSUG talk - Building data processing apps in Scala, the Snowplow experience

The secret sauce for data processing in Scala: the Scalaz Validation (2/3)

• We were able to express our data flow in terms of some relatively simple types:

Page 18: LSUG talk - Building data processing apps in Scala, the Snowplow experience

The secret sauce for data processing in Scala: the Scalaz Validation (3/3)

• Scalaz Validation lets us do a variety of different validations and enrichments, and then collate the failures

• This is really powerful!

Page 19: LSUG talk - Building data processing apps in Scala, the Snowplow experience

On the testing side: we love Specs2 data tables…

• They let us test a variety of inputs and expected outputs without making the mistake of just duplicating the data processing functionality in the test:

Page 20: LSUG talk - Building data processing apps in Scala, the Snowplow experience

… and are starting to do more with ScalaCheck

• ScalaCheck is a property-based testing framework, originally inspired by Haskell’s QuickCheck

• We use it in a few places – including to generate unpredictable bad data and also to validate our new Thrift schema for raw Snowplow events:

Page 21: LSUG talk - Building data processing apps in Scala, the Snowplow experience

Build and deployment: we have learnt to love (or at least peacefully co-exist with) SBT

• .scala based SBT build, not .sbt

• We use sbt assemble to create a fat jar for our Scalding ETL process – with some custom exclusions to play nicely on Amazon Elastic MapReduce

• Deployment is incredibly easy compared to the pain we have had with our two Ruby instrumentation apps (EmrEtlRunner and StorageLoader)

Page 22: LSUG talk - Building data processing apps in Scala, the Snowplow experience

Modularization and non-Snowplow code you can use

Page 23: LSUG talk - Building data processing apps in Scala, the Snowplow experience

We try to make our validation and enrichment process as modular as possible

Enrichment Manager

Not yet integrated

• This encourages testability and re-use – also it widens the number of contributors vs this functionality being embedded in Snowplow

• The Enrichment Manager uses external libraries (hosted in a Snowplow repository) which can be used in non-Snowplow projects:

Page 24: LSUG talk - Building data processing apps in Scala, the Snowplow experience

We also have a few standalone Scala projects which might be of interest

• None of these projects assume that you are running Snowplow:

Page 25: LSUG talk - Building data processing apps in Scala, the Snowplow experience

Snowplow roadmap

Page 26: LSUG talk - Building data processing apps in Scala, the Snowplow experience

CLOUD VENDOR / OWN DATA CENTER

Search

Silo

SOME LOW LATENCY LOCAL LOOPS

E-comm

Silo

CRM

SAAS VENDOR #2

Email marketing

ERP

Silo

CMS

Silo

SAAS VENDOR #1

NARROW DATA SILOES

Streaming APIs / web hooks

Unified log

Archiving

Hadoop

< WIDE DATA

COVERAGE >

< FULL DATA

HISTORY >

Systems monitoring

Eventstream

HIGH LATENCY LOW LATENCY

Product rec’sAd hoc analytics

Management reporting

Fraud detection

Churn prevention

APIs

We want to move Snowplow to a unified log-based architecture

Page 27: LSUG talk - Building data processing apps in Scala, the Snowplow experience

Again, our schema-first approach is letting us get to this architecture through a set of baby steps (1/2)

hadoop-etl

Record-level enrichment functionality

scala-common-enrich

scala-hadoop-enrich scala-kinesis-enrich

0.8.12pre-0.8.12

• In 0.8.12 at the start of the year we performed some surgery to de-couple our core enrichment code from its Scalding harness:

Page 28: LSUG talk - Building data processing apps in Scala, the Snowplow experience

Then in 0.9.0 we released our first new Scala components leveraging Amazon Kinesis:

Scala Stream Collector

Raw event stream

Enrich Kinesis app

Bad raw events stream

Enriched event

stream

S3

Redshift

S3 sink Kinesis app

Redshift sink Kinesis app

Snowplow Trackers

• The parts in grey are still under development – we are working with Snowplow community members on these collaboratively

Page 29: LSUG talk - Building data processing apps in Scala, the Snowplow experience

Questions?

http://snowplowanalytics.comhttps://github.com/snowplow/snowplow

@snowplowdataTo have a coffee or beer and talk Scala/data – @alexcrdean or

[email protected]

Page 30: LSUG talk - Building data processing apps in Scala, the Snowplow experience

Appendix: even more roadmap!

Page 31: LSUG talk - Building data processing apps in Scala, the Snowplow experience

Separately, we want to re-architect our data processing pipeline to make it even more schema’ed! (1/3)

• Our current approach involves a “Tracker Protocol” which is defined in a wiki page, processed in the Enrichment Manager and then written out to TSV files for loading into Redshift and Postgres (see over)

Page 32: LSUG talk - Building data processing apps in Scala, the Snowplow experience
Page 33: LSUG talk - Building data processing apps in Scala, the Snowplow experience

Separately, we want to re-architect our data processing pipeline to make it even more schema’ed! (3/3)

• We are planning to replace the existing flow with a JSON Schema-driven approach:

Enrichment Manager

Raw events in JSON format

JSON Schema defining events

Enriched events in Thrift or

Arvo format

Shredder

1. Define structure

2. Validate events

3. Define structure

4. Drive shredding

Enriched events in TSV ready for loading

into db

5. Define structure