DataEngConf SF16 - Data Asserts: Defensive Data Science

Preview:

Citation preview

Data AssertsDefensive Data Science

Tommy Guy

Microsoft

Observation: Complexity In Pipeline

Our pipeline:

DATA!!!

Insight! Direction! Strategy!

Our pipeline in reality: bugs tend to compound

DATA!!!

How do Engineers Manage Complexity?

Encapsulate: create functions/classes/subsystems with clear APIs. This helps isolate complexity

Integration Tests: ensure that the components interact correctly. This helps identify breaking changes.

Data introduces a few complications

Pipelines take many upstream dependencies

Researcher use cases are frequently unknown and unanticipated by data providers.

Pushing requirements upstream to all producers is Sisyphean.

We are not talking about data pipeline tests

The data pipeline teams:

Are all rows that are produced stored• Counter fields to ensure no dropped rows• Sentinel events to measure join fidelity

Are availability SLAs being met?• Progressive server-client merging

Data Scientists Require Semantic Correctness

Does this field mean what I think it does?

How do Data Scientists identify potential errors?

How do Data Scientists identify potential errors?Some follow-on fact is absurd…

… which leads to investigation …

… which finds a broader problem

If [potential conclusion], then we must have 3 billion OneDrive users…

… because my user table doesn’t have a primary key …

… so I should aggregate by user.

What are your Assumptions?

If I conclude “Users who upload files to OneDrive are XXX% more likely to buy Office if they also sent mail through Mobile Outlook”, I’m making many silent assumptions:

Field Assumptions

User Id • Logged and PII-encrypted similarly in Outlook and OneDrive• Correctly logging timestamp for Office purchase• User Id isn’t empty or missing

OneDrive activity • Wasn’t automated traffic [identified by a certain flag].

Email Activity • Mobile client identifiers are correct.

All • Any upstream changes to OneDrive, Office, or Exchange data have been communicated to pipeline owners.

What are your Sanity Checks?

• If a column “OfficeId” is really a user id, it has certain known properties:

• Observation: these sorts of checks take place when the pipeline is set up, but they may not be re-checked very often.

Assumption Why does it matter?

Never null/empty Causes job-breaking data skew issues

Users are 1:* with Tenants Logical constraint: sign you are missing something.

Very high cardinality If this isn’t true, it’s unlikely that it’s a user-id.

All rows in event data join to it Otherwise, your data is incomplete.

Matches a certain regex Sanity check: if this isn’t true, it’s unlikely that it’s a user-id.

Data Asserts: Defensive Data Science

Data Asserts: Maintain Quality

Data Asserts: Clear Trust Boundaries

These should match!

Data Asserts: Defensive Data Science

Data Asserts in Production: A few Observations• Most of the analysis-impacting assertion failures we’ve seen were

actually errors in our assumptions not errors in the pipeline.

• Good tests beget good code: we’ve had to modularize our code in order to produce testable chunks that get re-used in pipelines.

• Data Asserts is the backbone to data provenance. A data conclusion can directly link all of the assumptions about the input that we made.

Recommended