DataEngConf SF16 - Data Asserts: Defensive Data Science

Data AssertsDefensive Data Science

Tommy Guy

Microsoft

Observation: Complexity In Pipeline

Our pipeline:

DATA!!!

Insight! Direction! Strategy!

Our pipeline in reality: bugs tend to compound

DATA!!!

How do Engineers Manage Complexity?

Encapsulate: create functions/classes/subsystems with clear APIs. This helps isolate complexity

Integration Tests: ensure that the components interact correctly. This helps identify breaking changes.

Data introduces a few complications

Pipelines take many upstream dependencies

Researcher use cases are frequently unknown and unanticipated by data providers.

Pushing requirements upstream to all producers is Sisyphean.

We are not talking about data pipeline tests

The data pipeline teams:

Are all rows that are produced stored• Counter fields to ensure no dropped rows• Sentinel events to measure join fidelity

Are availability SLAs being met?• Progressive server-client merging

Data Scientists Require Semantic Correctness

Does this field mean what I think it does?

How do Data Scientists identify potential errors?

How do Data Scientists identify potential errors?Some follow-on fact is absurd…

… which leads to investigation …

… which finds a broader problem

If [potential conclusion], then we must have 3 billion OneDrive users…

… because my user table doesn’t have a primary key …

… so I should aggregate by user.

What are your Assumptions?

If I conclude “Users who upload files to OneDrive are XXX% more likely to buy Office if they also sent mail through Mobile Outlook”, I’m making many silent assumptions:

Field Assumptions

User Id • Logged and PII-encrypted similarly in Outlook and OneDrive• Correctly logging timestamp for Office purchase• User Id isn’t empty or missing

OneDrive activity • Wasn’t automated traffic [identified by a certain flag].

Email Activity • Mobile client identifiers are correct.

All • Any upstream changes to OneDrive, Office, or Exchange data have been communicated to pipeline owners.

What are your Sanity Checks?

• If a column “OfficeId” is really a user id, it has certain known properties:

• Observation: these sorts of checks take place when the pipeline is set up, but they may not be re-checked very often.

Assumption Why does it matter?

Never null/empty Causes job-breaking data skew issues

Users are 1:* with Tenants Logical constraint: sign you are missing something.

Very high cardinality If this isn’t true, it’s unlikely that it’s a user-id.

All rows in event data join to it Otherwise, your data is incomplete.

Matches a certain regex Sanity check: if this isn’t true, it’s unlikely that it’s a user-id.

Data Asserts: Defensive Data Science

Data Asserts: Maintain Quality

Data Asserts: Clear Trust Boundaries

These should match!

Data Asserts: Defensive Data Science

Data Asserts in Production: A few Observations• Most of the analysis-impacting assertion failures we’ve seen were

actually errors in our assumptions not errors in the pipeline.

• Good tests beget good code: we’ve had to modularize our code in order to produce testable chunks that get re-used in pipelines.

• Data Asserts is the backbone to data provenance. A data conclusion can directly link all of the assumptions about the input that we made.

DataEngConf SF16 - Data Asserts: Defensive Data Science

Technology

DataEngConf: Apache Spark in Financial Modeling at BlackRock

Bharatiya Jain Sanghatana asserts the moral right to …bjsindia.org/.../PDFs/SS_Documents/SOP_Parichay_Sammelan.pdfBharatiya Jain Sanghatana asserts the moral right to be named as

What Do the Asserts in a Unit Test Tell us About Code Quality?

511 MEDI SPA CENNIK 2017-ANG · 2018-04-19 · shape of the face with hyaluronic acid - Restylane, Perlane, Emervel, Belotero, Amalian SF16 PLASMA RegenACR CLASSIC (supplied through

Chemicals Replacement Parts Shop · PDF fileChemicals Replacement Parts Shop Supplies SF SF16 Motor Treatment 16 oz. Dex-Cool Antifreeze, 1 Gallon ZRX ZXELRU1 50/50 $699 ZRX ZXEL1

sf16-muse-va.ibytedtos.com · 2020. 4. 16. · importance of the employee’s position, to ensure that the recruitment meets the Company's rules and regulations. In addition, new

DataEngConf SF16 - Routing Billions of Analytics Events with High Deliverability

DataEngConf SF16 - Methods for Content Relevance at LinkedIn

SF16 - preference.com.sg

DataEngConf SF16 - Spark SQL Workshop

Male breast tumour with androgen insensitivity …svimstpt.ap.nic.in/jcsr/apr-jun16_files/sf16.pdfMale breast tumour with androgen insensitivity Srinivasa Rao et al Created Date 5/17/2016

DataEngConf SF16 - High cardinality time series search

Showdown of the Asserts by Philipp Krenn

DataEngConf SF16 - BYOMQ: Why We [re]Built IronMQ

Cloud Native Data Pipelines (DataEngConf SF 2017)

What Do the Asserts in a Unit Test Tell Us About Code Quality? (CSMR2013)

DataEngConf SF16 - Unifying Real Time and Historical Analytics with the Lambda Architecture

Why Brands asserts we deliver Outstanding Web Services? Regard Out the accompanying to Know More

index theory asserts that a price index - Census.gov index theory asserts that a price index should measure the change In the cost of what you need to pay to maintain a fixed, or constant,

DataEngConf SF16 - Scalable and Reliable Logging at Pinterest