32

IDML Deep Dive Strata

Embed Size (px)

Citation preview

Page 1: IDML Deep Dive Strata
Page 2: IDML Deep Dive Strata

IDML Deep DiveData preparation without the painJon DaveyStrata 2015

Page 3: IDML Deep Dive Strata

Background1

Page 4: IDML Deep Dive Strata

Remember 2011?

4

That McKinsey whitepaper The world discovered Hadoop

The first Strata

Page 5: IDML Deep Dive Strata

Also in 2011

5

๏DataSift launched

๏Twitter firehose re-syndication

๏+ handful of other data sources

๏500-1000 lines of data preparation code per source

Page 6: IDML Deep Dive Strata

Now

6

๏More data sources - More to build and maintain

๏Many people with an interest in how data is prepared - Support, Product, Solutions

๏Lots of problems to solve - Scaling, stability, training customers and new staff

Page 7: IDML Deep Dive Strata

Many stakeholders in data ingestion

๏Support - “Why can’t customer X see field Y?”

๏Data Science - “Is field A populated enough to be statistically significant?”

๏ Documentation - “What is the purpose of field A and how does it relate to field B?”

๏Test - “How do we measure the entropy in random IDs so we can be sure we aren’t losing data during de-duplication after redundancy?”

7

Page 8: IDML Deep Dive Strata

Engineering challenges

๏Detecting upstream schema changes

๏Supporting multiple data versions

๏Reducing boilerplate code

๏Software reusability

8

Page 9: IDML Deep Dive Strata

IDML (Ingestion Data Mapping Language)

๏Cleaner than a general purpose programming language

๏Readable by people who aren’t writing code every day

๏Wide range of features, extensible

9

Page 10: IDML Deep Dive Strata

What it does2

Page 11: IDML Deep Dive Strata

A sample preparation task: Sanitize scraped content

11

=>

Page 12: IDML Deep Dive Strata

Data preparation can be verbose..

12

Page 13: IDML Deep Dive Strata

Data preparation can be verbose..

13

Page 14: IDML Deep Dive Strata

It’s simpler if you use something designed for it

14

Page 15: IDML Deep Dive Strata

IDML is designed for data preparation

15

Page 16: IDML Deep Dive Strata

Closer look at features3

Page 17: IDML Deep Dive Strata

Deeply nested structures (without NPEs)

17

Page 18: IDML Deep Dive Strata

Aliasing with coalesce

18

Page 19: IDML Deep Dive Strata

Wide range of validation and transform functions

19

Page 20: IDML Deep Dive Strata

It’s there or it’s not - No try..catch

20

Page 21: IDML Deep Dive Strata

Lenient but consistent

21

Page 22: IDML Deep Dive Strata

The runtime figures things out

22

Page 23: IDML Deep Dive Strata

Arrays are easy to work with

23

Page 24: IDML Deep Dive Strata

Filter things

24

Page 25: IDML Deep Dive Strata

In-place validation

25

Page 26: IDML Deep Dive Strata

Other features

๏Detects fields that have not been mapped, making it easy to find data that’s not understood

๏Generates metrics about why a rule failed

๏Uniform interface allows the same syntax for JSON and XML

26

Page 27: IDML Deep Dive Strata

Where it fits4

Page 28: IDML Deep Dive Strata

Multiple deployment patterns

๏Deployable as a standalone service

๏Usable as a library

๏ Kafka consumer

๏ MapReduce mapper

๏ NSQ consumer

๏ Amazon SQS consumer

๏Command line, including REPL

28

Page 29: IDML Deep Dive Strata

Performance

๏ It’s an interpreter so it’s noticeably slower than hand-written code in contrived benchmarks

๏ In real cases, IO has usually been the bottleneck

๏Unstructured data is inherently suboptimal - dynamic structures like JsonNode are backed with HashMaps and Trees

๏One day it might be faster. Runtimes can often be optimized in much smarter ways: Consider why Java is faster than C++ at virtual method calls

29

Page 30: IDML Deep Dive Strata

Open sourcing it soon

๏May be rebranded as Ptolemy

๏Support for JSON and XML (and SGML - don’t ask)

๏May improve any of these areas, depending on interest:

๏ Performance

๏ More input and output types

๏ More integration: Spark, Kinesis

๏Would you use it on your own projects? Would you help?

30

Page 31: IDML Deep Dive Strata

QUESTIONS?

Page 32: IDML Deep Dive Strata

THANK YOU!