
Dirty Data? Clean it up!Or, how to do data science in the real world.

Dan LynnCEO, AgilData

@[email protected]

Patrick RussellIndependent Consultant (formerly Data Science @Craftsy) @[email protected]

© Phil Mislinksi -

Patrick Russell - BassData Scientist between things ;)

Dan Lynn - GuitarCEO, AgilData

© Phil Mislinksi -


At AgilData, we help you get the most out of your data. We provide Software and Services to help firms deliver on

the promise of Big Data and complex data infrastructures:

● AgilData Scalable Cluster for MySQL – Massively scalable and performant MySQL databases combined

with 24×7 remote managed services for DBA/DevOps

● Trusted Big Data experts to solve problems, set strategy and develop solutions for BI, data

pipeline orchestration, ETL, Data Engineering & DevOps, APIs and custom applications.

Hey, you’re a data scientist, right? Great!

We have millions of users. How we can use email to monetize our user base better?

— Marketing

1 / 1 + exp(-x)


Data Cleansing

Data Cleansing

● Dates & Times

● Numbers & Strings

● Addresses

● Clickstream Data

● Handling missing data

● Tidy Data

Dates & Times

● Timestamps can mean different things○ ingested_date, event_timestamp

● Clocks can’t be trusted○ Server time: which server? Is it synchronized?

○ Client time? Is there a synchronizing time scheme?

● Timezones○ What tz is your own data in?

○ Your email provider? Your adwords account? Your Google Analytics?

Numbers & Strings

● Use the right types for your numbers (int, bigint, float, numeric


● Murphy’s Law of text inputs: If a user can put something in a text

field, anything and everything will happen.

● Watch out for floating point precision mistakes


● Parsing / validation is not something you want to do yourself

○ USPS has validation and zip lookup for US addresses:

● Remember zip codes are strings. And the rest of the world does not

use U.S. zips.

● IP geolocation: Get lat/long, state, city, postal & ISP, from visitor


○ This is ALWAYS approximate

● If working with GIS, recommend○ Vanilla postgres also has earthdistance for great circle distance

Clickstream Data

● User agent => Device: Don’t do this yourself (we use WURFL and Google


● Query strings follow the rules of text. Everything will show up○ They might be truncated

○ URL encoding might be missing characters (%2 instead of %20)

○ Use a library to parse params (ie Python ships with urlparse.parse_qs)

● If your system creates sessions (tomcat, Google Analytics), don’t be

afraid to create your own sessions on top of the pageview data○ You’ll get cross channel and cross device behavior this way

Clickstream Data

Missing / empty data

● Easy to overlook but important

● What does missing data in the context of your analysis mean?○ Not collected (why not?)

○ Error state

○ N/A or undefined

○ Especially for histograms, missing data lead to very poor conclusions.

● Does your data use sentinel values? (ie -9999 or “null”)○ df[‘nps_score’].replace(-9999, np.nan)

● Imputation

● Storage

Tidy Data

● Conceptual framework for structuring data for analysis and fitting○ Each variable forms a column

○ Each observation is a row

○ Each type of observational unit forms a table

● Pretty much normal form from relational databases for stats

● Tidy can be different depending on the question asked

● R (dplyr, tidyr) and Python (pandas) have functions for making your

long data wide & wide data long (stack, unstack, melt, pivot)

● Paper:

● Python tutorial:

Tidy Data

● Example might be marketplace transaction data with 1 row per


● You might want to do analysis on participants, 1 row per participant

Hey, that’s a great model. How can we build it into our decision-making process?

— Marketing

Operationalizing Data Science

● Doing an analysis once rarely delivers lasting value.

● The business needs continuous insight, so you need to get this stuff

into production.○ Hosting


○ Pipelines

Operationalizing Data Science


● Delivering continuous analyses requires operational infrastructure○ Database(s)

○ Visualization tools (e.g. Chartio, Arcadia Data, Tableau, Looker, Qlik, etc..)

○ REST services / microservices

● These all have uptime requirements. You need to involve your (dev)ops

team earlier rather than later.

● Microservices / REST endpoints have architectural implications

● Visualization tools○ Local (e.g. Jupyter, Zeppelin)

○ On-premise (Arcadia Data, Tableau, Qlik)

○ Hosted (Chartio)

● Visualization tools often require a SQL interface, thus….

ETL - Extract, Transform, Load

● Often used to herd data into some kind of data warehouse (e.g. RDBMS

+ star schema, Hadoop w/ unstructured data, etc..)

● Not just for data warehousing

● Not just for modeling

● No general solution

● Tooling○ Apache Spark, Apache Sqoop

○ Commercial Tools: Informatica, Vertica, SQL Server, DataVirtuality etc…

● And then there is Apache Kafka…and the “NoETL” movement○ Book: “I <3 Logs” - by Jay kreps

○ Replay history from the beginning of time as needed

ETL - Extract, Transform, Load - Example

● Not just for production runs

○ For example, Patrick does a lot of ad hoc time-to-event analysis on email opens,

transactions, visits.

■ Survival functions, etc...

○ Setup ETL that builds tables With the right shape to throw right into models

Pipelines● From data to model output

● Define dependencies and define DAG for the work○ Steps defined by assigning input as output of prior steps

○ Luigi (

○ Drake (

○ Scikit learn has its own Pipeline

■ That can be part of your bigger pipeline

● Scheduling can be trickier than you think○ Resource contention

○ Loose dependencies

○ Cron is fine but Jenkins works really well for this!

● Don’t be afraid to create and teardown full environments as steps○ For example, spin up and configure an EMR cluster, do stuff, tear it down*

* make your VP of Infrastructure less miserable

Pipelines - Luigi

● Written in Python. Steps implemented by subclassing Task

● Visualize your DAG

● Supports data in relational DBs, Redshift, HDFS, S3, file system

● Flexible and extensible

● Can parallelize jobs

● Workflow runs by executing last step which schedules all dependencies

Pipelines - Luigi

Pipelines - Drake

● JVM (written in Clojure)

● Like a Makefile but for data work

● Supports commands in Shell, Python, Ruby, Clojure

Pipelines - More Tools● Oozie

○ The default job orchestration engine for Hadoop. Can chain together multiple jobs

to form a complete DAG.

○ Open source

● Kettle○ Old-school, but still relevant.

○ Visual pipeline designer. Execution engine

○ Open source

● Informatica○ Visual pipeline designer, mature toolset

○ Commercial

● Datavirtuality ○ Treats all your stores (including Google Analytics) like schemas in a single db

○ Great for microservice architectures

○ Commercial

© Patrick Coppinger

[email protected][email protected]

@danklynn — @patrickrm101


● I Heart Logs○

● Tidy Data○

Additional Tools

● Scientific python stack (ipython, numpy, scipy, pandas, matplotlib…)

● Hadleyverse for R (dplyr, ggplot, tidyr, lubridate…)

● csvkit: command line tools (csvcut, csvgrep, csvjoin...) for CSV data

● jq: fast command line tool for working with json (ie pipe cURL to jq)

● psql (if you use postgresql or Redshift)
