120
Unit Testing Data: Can I Trust This Data?

Unit Testing Data: Can I Trust This Data? · Deep Linking - Trail of Bread Crumbs Test Recipes (10 min) Basic Tests Time Series tests Meta testing Conclusion and Questions (5 min)

  • Upload
    others

  • View
    0

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Unit Testing Data: Can I Trust This Data? · Deep Linking - Trail of Bread Crumbs Test Recipes (10 min) Basic Tests Time Series tests Meta testing Conclusion and Questions (5 min)

Unit Testing Data:Can I Trust This Data?

Page 2: Unit Testing Data: Can I Trust This Data? · Deep Linking - Trail of Bread Crumbs Test Recipes (10 min) Basic Tests Time Series tests Meta testing Conclusion and Questions (5 min)

"Can I trust this data?" When asked this question it can be a difficult task to objectively measure and answer.

Similar to how unit tests have provided metrics for code coverage and bug regressions, this talk aims to show techniques and recipes developed to quantify data sanitisation and coverage. It also demonstrates an extensible design pattern that allows further tests to be developed.

If you can write a query, you can write data unit tests.

These strategies have been implemented at Invoice2go in their ETL pipeline for the last 2 years to detect data regressions in their Amazon Redshift data warehouse.

Abstract

● Introduction / Scenario (5 min)● Design Pattern (10 min)

○ Test History Table○ Test Scaffold○ Current Status View○ Deep Linking - Trail of Bread Crumbs

● Test Recipes (10 min)○ Basic Tests○ Time Series tests○ Meta testing

● Conclusion and Questions (5 min)

Page 3: Unit Testing Data: Can I Trust This Data? · Deep Linking - Trail of Bread Crumbs Test Recipes (10 min) Basic Tests Time Series tests Meta testing Conclusion and Questions (5 min)

Who am I?

Josh Wilson

Senior Software Engineer @ Invoice2go

Data Engineering Team

Twitter: @_neozenithLinkedIn: https://au.linkedin.com/in/neozenith

Page 4: Unit Testing Data: Can I Trust This Data? · Deep Linking - Trail of Bread Crumbs Test Recipes (10 min) Basic Tests Time Series tests Meta testing Conclusion and Questions (5 min)

Scenario

Page 5: Unit Testing Data: Can I Trust This Data? · Deep Linking - Trail of Bread Crumbs Test Recipes (10 min) Basic Tests Time Series tests Meta testing Conclusion and Questions (5 min)

Scenario

● You work for the Data Engineering and/or Data Analytics team within your company.

http://www.traveltechsolutions.com/images/graph-icon1.png

Page 6: Unit Testing Data: Can I Trust This Data? · Deep Linking - Trail of Bread Crumbs Test Recipes (10 min) Basic Tests Time Series tests Meta testing Conclusion and Questions (5 min)

Scenario

● You work for the Data Engineering and/or Data Analytics team within your company.

● You have a Daily KPI dashboard for Business Managers

http://www.traveltechsolutions.com/images/graph-icon1.png

Page 7: Unit Testing Data: Can I Trust This Data? · Deep Linking - Trail of Bread Crumbs Test Recipes (10 min) Basic Tests Time Series tests Meta testing Conclusion and Questions (5 min)

Scenario

● You work for the Data Engineering and/or Data Analytics team within your company.

● You have a Daily KPI dashboard for Business Managers

● Everyday at 6am your boss gets emailed this dashboard

http://www.traveltechsolutions.com/images/graph-icon1.png

Page 8: Unit Testing Data: Can I Trust This Data? · Deep Linking - Trail of Bread Crumbs Test Recipes (10 min) Basic Tests Time Series tests Meta testing Conclusion and Questions (5 min)

Scenario

● You work for the Data Engineering and/or Data Analytics team within your company.

● You have a Daily KPI dashboard for Business Managers

● Everyday at 6am your boss gets emailed this dashboard

● 6.05am you get an email from your boss saying:

http://www.traveltechsolutions.com/images/graph-icon1.png

Page 9: Unit Testing Data: Can I Trust This Data? · Deep Linking - Trail of Bread Crumbs Test Recipes (10 min) Basic Tests Time Series tests Meta testing Conclusion and Questions (5 min)

Scenario

● You work for the Data Engineering and/or Data Analytics team within your company.

● You have a Daily KPI dashboard for Business Managers

● Everyday at 6am your boss gets emailed this dashboard

● 6.05am you get an email from your boss saying:

http://www.traveltechsolutions.com/images/graph-icon1.pngCan I Trust This Data?

Page 10: Unit Testing Data: Can I Trust This Data? · Deep Linking - Trail of Bread Crumbs Test Recipes (10 min) Basic Tests Time Series tests Meta testing Conclusion and Questions (5 min)

Scenario

Page 11: Unit Testing Data: Can I Trust This Data? · Deep Linking - Trail of Bread Crumbs Test Recipes (10 min) Basic Tests Time Series tests Meta testing Conclusion and Questions (5 min)

Scenario - Can I trust this data?

● How do you even define trust?

Page 12: Unit Testing Data: Can I Trust This Data? · Deep Linking - Trail of Bread Crumbs Test Recipes (10 min) Basic Tests Time Series tests Meta testing Conclusion and Questions (5 min)

Scenario - Can I trust this data?

● How do you even define trust?

Page 13: Unit Testing Data: Can I Trust This Data? · Deep Linking - Trail of Bread Crumbs Test Recipes (10 min) Basic Tests Time Series tests Meta testing Conclusion and Questions (5 min)

Scenario - Can I trust this data?

● How do you even define trust?

○ “firm belief in the reliability, truth, or ability of someone or something”

Page 14: Unit Testing Data: Can I Trust This Data? · Deep Linking - Trail of Bread Crumbs Test Recipes (10 min) Basic Tests Time Series tests Meta testing Conclusion and Questions (5 min)

Scenario - Can I trust this data?

● How do you even define trust?

○ “firm belief in the reliability, truth, or ability of someone or something”

● What do you do?

Page 15: Unit Testing Data: Can I Trust This Data? · Deep Linking - Trail of Bread Crumbs Test Recipes (10 min) Basic Tests Time Series tests Meta testing Conclusion and Questions (5 min)

Scenario - Can I trust this data?

● How do you even define trust?

○ “firm belief in the reliability, truth, or ability of someone or something”

● What do you do?

○ Run a bunch of queries

Page 16: Unit Testing Data: Can I Trust This Data? · Deep Linking - Trail of Bread Crumbs Test Recipes (10 min) Basic Tests Time Series tests Meta testing Conclusion and Questions (5 min)

Scenario - Can I trust this data?

● How do you even define trust?

○ “firm belief in the reliability, truth, or ability of someone or something”

● What do you do?

○ Run a bunch of queries

● Then how do you measure trust?

Page 17: Unit Testing Data: Can I Trust This Data? · Deep Linking - Trail of Bread Crumbs Test Recipes (10 min) Basic Tests Time Series tests Meta testing Conclusion and Questions (5 min)

Scenario - Can I trust this data?

● How do you even define trust?

○ “firm belief in the reliability, truth, or ability of someone or something”

● What do you do?

○ Run a bunch of queries

● Then how do you measure trust?

○ Build a series of queries (tests) which affirm if an assumption is true.

Page 18: Unit Testing Data: Can I Trust This Data? · Deep Linking - Trail of Bread Crumbs Test Recipes (10 min) Basic Tests Time Series tests Meta testing Conclusion and Questions (5 min)

Scenario - Can I trust this data?

● How do you even define trust?

○ “firm belief in the reliability, truth, or ability of someone or something”

● What do you do?

○ Run a bunch of queries

● Then how do you measure trust?

○ Build a series of queries (tests) which affirm if an assumption is true.○ Assertions of truth… sounds like Unit Testing.

Page 19: Unit Testing Data: Can I Trust This Data? · Deep Linking - Trail of Bread Crumbs Test Recipes (10 min) Basic Tests Time Series tests Meta testing Conclusion and Questions (5 min)

Scenario - Can I trust this data?

Page 20: Unit Testing Data: Can I Trust This Data? · Deep Linking - Trail of Bread Crumbs Test Recipes (10 min) Basic Tests Time Series tests Meta testing Conclusion and Questions (5 min)

Design Pattern

Page 21: Unit Testing Data: Can I Trust This Data? · Deep Linking - Trail of Bread Crumbs Test Recipes (10 min) Basic Tests Time Series tests Meta testing Conclusion and Questions (5 min)

● Test History Table– a place to save all test results

● Test Scaffold– What’s the basic boilerplate for a test

● Current Status View– What is the latest status for each test

● Deep Links– Reproducibility, traceability, deeper analysis

Design Pattern

Page 22: Unit Testing Data: Can I Trust This Data? · Deep Linking - Trail of Bread Crumbs Test Recipes (10 min) Basic Tests Time Series tests Meta testing Conclusion and Questions (5 min)

● Test History Table– a place to save all test results

● Test Scaffold– What’s the basic boilerplate for a test

● Current Status View– What is the latest status for each test

● Deep Links– Reproducibility, traceability, deeper analysis

Design Pattern

Page 23: Unit Testing Data: Can I Trust This Data? · Deep Linking - Trail of Bread Crumbs Test Recipes (10 min) Basic Tests Time Series tests Meta testing Conclusion and Questions (5 min)

Design - Test History Table

CREATE TABLE "public"."etl_qa_history" ("id" int4 NOT NULL DEFAULT "identity"(1, 0, '0,1'::text),"table_name" varchar(255), -- each table is a unit"test_name" varchar(255), -- create a naming convention for tests"test_time_utc" timestamp NULL DEFAULT getdate(), -- timestamp everything"status" varchar(4), -- PASS / FAIL / WARN / NA"description" varchar(MAX), -- human readable error messages"url" varchar(256) -- deep linking

);

Page 24: Unit Testing Data: Can I Trust This Data? · Deep Linking - Trail of Bread Crumbs Test Recipes (10 min) Basic Tests Time Series tests Meta testing Conclusion and Questions (5 min)

Design - Test History Table

CREATE TABLE "public"."etl_qa_history" ("id" int4 NOT NULL DEFAULT "identity"(1, 0, '0,1'::text),"table_name" varchar(255), -- each table is a unit"test_name" varchar(255), -- create a naming convention for tests"test_time_utc" timestamp NULL DEFAULT getdate(), -- timestamp everything"status" varchar(4), -- PASS / FAIL / WARN / NA"description" varchar(MAX), -- human readable error messages"url" varchar(256) -- deep linking

);

Page 25: Unit Testing Data: Can I Trust This Data? · Deep Linking - Trail of Bread Crumbs Test Recipes (10 min) Basic Tests Time Series tests Meta testing Conclusion and Questions (5 min)

Design - Test History Table

CREATE TABLE "public"."etl_qa_history" ("id" int4 NOT NULL DEFAULT "identity"(1, 0, '0,1'::text),"table_name" varchar(255), -- each table is a unit"test_name" varchar(255), -- create a naming convention for tests"test_time_utc" timestamp NULL DEFAULT getdate(), -- timestamp everything"status" varchar(4), -- PASS / FAIL / WARN / NA"description" varchar(MAX), -- human readable error messages"url" varchar(256) -- deep linking

);

Page 26: Unit Testing Data: Can I Trust This Data? · Deep Linking - Trail of Bread Crumbs Test Recipes (10 min) Basic Tests Time Series tests Meta testing Conclusion and Questions (5 min)

Design - Test History Table

CREATE TABLE "public"."etl_qa_history" ("id" int4 NOT NULL DEFAULT "identity"(1, 0, '0,1'::text),"table_name" varchar(255), -- each table is a unit"test_name" varchar(255), -- create a naming convention for tests"test_time_utc" timestamp NULL DEFAULT getdate(), -- timestamp everything"status" varchar(4), -- PASS / FAIL / WARN / NA"description" varchar(MAX), -- human readable error messages"url" varchar(256) -- deep linking

);

Page 27: Unit Testing Data: Can I Trust This Data? · Deep Linking - Trail of Bread Crumbs Test Recipes (10 min) Basic Tests Time Series tests Meta testing Conclusion and Questions (5 min)

Design - Test History Table

CREATE TABLE "public"."etl_qa_history" ("id" int4 NOT NULL DEFAULT "identity"(1, 0, '0,1'::text),"table_name" varchar(255), -- each table is a unit"test_name" varchar(255), -- create a naming convention for tests"test_time_utc" timestamp NULL DEFAULT getdate(), -- timestamp everything"status" varchar(4), -- PASS / FAIL / WARN / NA"description" varchar(MAX), -- human readable error messages"url" varchar(256) -- deep linking

);

Page 28: Unit Testing Data: Can I Trust This Data? · Deep Linking - Trail of Bread Crumbs Test Recipes (10 min) Basic Tests Time Series tests Meta testing Conclusion and Questions (5 min)

Design - Test History Table

CREATE TABLE "public"."etl_qa_history" ("id" int4 NOT NULL DEFAULT "identity"(1, 0, '0,1'::text),"table_name" varchar(255), -- each table is a unit"test_name" varchar(255), -- create a naming convention for tests"test_time_utc" timestamp NULL DEFAULT getdate(), -- timestamp everything"status" varchar(4), -- PASS / FAIL / WARN / NA"description" varchar(MAX), -- human readable error messages"url" varchar(256) -- deep linking

);

Page 29: Unit Testing Data: Can I Trust This Data? · Deep Linking - Trail of Bread Crumbs Test Recipes (10 min) Basic Tests Time Series tests Meta testing Conclusion and Questions (5 min)

Design - Test History Table

CREATE TABLE "public"."etl_qa_history" ("id" int4 NOT NULL DEFAULT "identity"(1, 0, '0,1'::text),"table_name" varchar(255), -- each table is a unit"test_name" varchar(255), -- create a naming convention for tests"test_time_utc" timestamp NULL DEFAULT getdate(), -- timestamp everything"status" varchar(4), -- PASS / FAIL / WARN / NA"description" varchar(MAX), -- human readable error messages"url" varchar(256) -- deep linking

);

Page 30: Unit Testing Data: Can I Trust This Data? · Deep Linking - Trail of Bread Crumbs Test Recipes (10 min) Basic Tests Time Series tests Meta testing Conclusion and Questions (5 min)

Design - Test History Table

CREATE TABLE "public"."etl_qa_history" ("id" int4 NOT NULL DEFAULT "identity"(1, 0, '0,1'::text),"table_name" varchar(255), -- each table is a unit"test_name" varchar(255), -- create a naming convention for tests"test_time_utc" timestamp NULL DEFAULT getdate(), -- timestamp everything"status" varchar(4), -- PASS / FAIL / WARN / NA"description" varchar(MAX), -- human readable error messages"url" varchar(256) -- deep linking

);

Page 31: Unit Testing Data: Can I Trust This Data? · Deep Linking - Trail of Bread Crumbs Test Recipes (10 min) Basic Tests Time Series tests Meta testing Conclusion and Questions (5 min)

● Test History Table– a place to save all test results

● Test Scaffold– The basic boilerplate for a test

● Current Status View– What is the latest status for each test

● Deep Links– Reproducibility, traceability, deeper analysis

Design Pattern

Page 32: Unit Testing Data: Can I Trust This Data? · Deep Linking - Trail of Bread Crumbs Test Recipes (10 min) Basic Tests Time Series tests Meta testing Conclusion and Questions (5 min)

Design - Test Scaffold

INSERT INTO public.etl_qa_history ( table_name, test_name, status, description, url)

WITH -- make use of CTEs to keep queries neattest_source AS (...), test_cases AS (...)

SELECT 'internal.ios_hardware_strings'::text as table_name, 'detect unmapped ios hardware strings'::text as test_name, status::text, description::text, url::textFROM test_cases;

Page 33: Unit Testing Data: Can I Trust This Data? · Deep Linking - Trail of Bread Crumbs Test Recipes (10 min) Basic Tests Time Series tests Meta testing Conclusion and Questions (5 min)

Design - Test Scaffold

INSERT INTO public.etl_qa_history ( table_name, test_name, status, description, url)

WITH -- make use of CTEs to keep queries neattest_source AS (...), test_cases AS (...)

SELECT 'internal.ios_hardware_strings'::text as table_name, 'detect unmapped ios hardware strings'::text as test_name, status::text, description::text, url::textFROM test_cases;

Page 34: Unit Testing Data: Can I Trust This Data? · Deep Linking - Trail of Bread Crumbs Test Recipes (10 min) Basic Tests Time Series tests Meta testing Conclusion and Questions (5 min)

Design - Test Scaffold

INSERT INTO public.etl_qa_history ( table_name, test_name, status, description, url)

WITH -- make use of CTEs to keep queries neattest_source AS (...), test_cases AS (...)

SELECT 'internal.ios_hardware_strings'::text as table_name, 'detect unmapped ios hardware strings'::text as test_name, status::text, description::text, url::textFROM test_cases;

Page 35: Unit Testing Data: Can I Trust This Data? · Deep Linking - Trail of Bread Crumbs Test Recipes (10 min) Basic Tests Time Series tests Meta testing Conclusion and Questions (5 min)

INSERT INTO public.etl_qa_history ( table_name, test_name, status, description, url)

WITH test_duplicate_test_name as (...), watchdog_issues as (...), warning_watchdog as (...), valid_status as (...)

, all_tests as (SELECT table_name, test_name, status, description, url FROM test_duplicate_test_nameUNION ALLSELECT table_name, test_name, status, description, url FROM warning_watchdog UNION ALLSELECT table_name, test_name, status, description, url FROM valid_status)

SELECT table_name:: text, test_name:: text, status:: text, description:: text, url::textFROM all_tests;

Design - Test Scaffold - Combine Multiple Tests

Page 36: Unit Testing Data: Can I Trust This Data? · Deep Linking - Trail of Bread Crumbs Test Recipes (10 min) Basic Tests Time Series tests Meta testing Conclusion and Questions (5 min)

INSERT INTO public.etl_qa_history ( table_name, test_name, status, description, url)

WITH test_duplicate_test_name as (...), watchdog_issues as (...), warning_watchdog as (...), valid_status as (...)

, all_tests as (SELECT table_name, test_name, status, description, url FROM test_duplicate_test_nameUNION ALLSELECT table_name, test_name, status, description, url FROM warning_watchdog UNION ALLSELECT table_name, test_name, status, description, url FROM valid_status)

SELECT table_name::text, test_name::text, status::text, description::text, url::textFROM all_tests;

Design - Test Scaffold - Combine Multiple Tests

Page 37: Unit Testing Data: Can I Trust This Data? · Deep Linking - Trail of Bread Crumbs Test Recipes (10 min) Basic Tests Time Series tests Meta testing Conclusion and Questions (5 min)

INSERT INTO public.etl_qa_history ( table_name, test_name, status, description, url)

WITH test_duplicate_test_name as (...), watchdog_issues as (...), warning_watchdog as (...), valid_status as (...)

, all_tests as (SELECT table_name, test_name, status, description, url FROM test_duplicate_test_nameUNION ALLSELECT table_name, test_name, status, description, url FROM warning_watchdog UNION ALLSELECT table_name, test_name, status, description, url FROM valid_status)

SELECT table_name:: text, test_name:: text, status:: text, description:: text, url::textFROM all_tests;

Design - Test Scaffold - Combine Multiple Tests

Page 38: Unit Testing Data: Can I Trust This Data? · Deep Linking - Trail of Bread Crumbs Test Recipes (10 min) Basic Tests Time Series tests Meta testing Conclusion and Questions (5 min)

INSERT INTO public.etl_qa_history ( table_name, test_name, status, description, url)

WITH test_duplicate_test_name as (...), watchdog_issues as (...), warning_watchdog as (...), valid_status as (...)

, all_tests as (SELECT table_name, test_name, status, description, url FROM test_duplicate_test_nameUNION ALLSELECT table_name, test_name, status, description, url FROM warning_watchdog UNION ALLSELECT table_name, test_name, status, description, url FROM valid_status)

SELECT table_name:: text, test_name:: text, status:: text, description:: text, url::textFROM all_tests;

Design - Test Scaffold - Combine Multiple Tests

Page 39: Unit Testing Data: Can I Trust This Data? · Deep Linking - Trail of Bread Crumbs Test Recipes (10 min) Basic Tests Time Series tests Meta testing Conclusion and Questions (5 min)

INSERT INTO public.etl_qa_history ( table_name, test_name, status, description, url)

WITH test_duplicate_test_name as (...), watchdog_issues as (...), warning_watchdog as (...), valid_status as (...)

, all_tests as (SELECT table_name, test_name, status, description, url FROM test_duplicate_test_nameUNION ALLSELECT table_name, test_name, status, description, url FROM warning_watchdog UNION ALLSELECT table_name, test_name, status, description, url FROM valid_status)

SELECT table_name:: text, test_name:: text, status:: text, description:: text, url::textFROM all_tests;

Design - Test Scaffold - Combine Multiple Tests

Page 40: Unit Testing Data: Can I Trust This Data? · Deep Linking - Trail of Bread Crumbs Test Recipes (10 min) Basic Tests Time Series tests Meta testing Conclusion and Questions (5 min)

“If you can write a query, you can write a data unit test”- Josh Wilson

Design - Test Scaffold

Page 41: Unit Testing Data: Can I Trust This Data? · Deep Linking - Trail of Bread Crumbs Test Recipes (10 min) Basic Tests Time Series tests Meta testing Conclusion and Questions (5 min)

● Test History Table– a place to save all test results

● Test Scaffold– The basic boilerplate for a test

● Current Status View– What is the latest status for each test

● Deep Links– Reproducibility, traceability, deeper analysis

Design Pattern

Page 42: Unit Testing Data: Can I Trust This Data? · Deep Linking - Trail of Bread Crumbs Test Recipes (10 min) Basic Tests Time Series tests Meta testing Conclusion and Questions (5 min)

Design - Current Status View

CREATE OR REPLACE VIEW "public"."etl_qa_current_status" ASSELECT table_name, test_name, last_run_time_utc, status, description, url FROM ( SELECT table_name, test_name, test_time_utc as last_run_time_utc, status, description, url, rank() OVER (PARTITION BY table_name, test_name ORDER BY test_time_utc DESC) as rank FROM public.etl_qa_history ) as rank_eventWHERE ( (rank_event.rank = 1) -- Only the most recent test result AND (rank_event.last_run_time_utc >= date_add('day', -7, getdate())) -- window only a week worth of tests);

Page 43: Unit Testing Data: Can I Trust This Data? · Deep Linking - Trail of Bread Crumbs Test Recipes (10 min) Basic Tests Time Series tests Meta testing Conclusion and Questions (5 min)

Design - Current Status View

CREATE OR REPLACE VIEW "public"."etl_qa_current_status" ASSELECT table_name, test_name, last_run_time_utc, status, description, url FROM ( SELECT table_name, test_name, test_time_utc as last_run_time_utc, status, description, url, rank() OVER (PARTITION BY table_name, test_name ORDER BY test_time_utc DESC) as rank FROM public.etl_qa_history ) as rank_eventWHERE ( (rank_event.rank = 1) -- Only the most recent test result AND (rank_event.last_run_time_utc >= date_add('day', -7, getdate())) -- window only a week worth of tests);

Page 44: Unit Testing Data: Can I Trust This Data? · Deep Linking - Trail of Bread Crumbs Test Recipes (10 min) Basic Tests Time Series tests Meta testing Conclusion and Questions (5 min)

Design - Current Status View

CREATE OR REPLACE VIEW "public"."etl_qa_current_status" ASSELECT table_name, test_name, last_run_time_utc, status, description, url FROM ( SELECT table_name, test_name, test_time_utc as last_run_time_utc, status, description, url, rank() OVER (PARTITION BY table_name, test_name ORDER BY test_time_utc DESC) as rank FROM public.etl_qa_history ) as rank_eventWHERE ( (rank_event.rank = 1) -- Only the most recent test result AND (rank_event.last_run_time_utc >= date_add('day', -7, getdate())) -- window only a week worth of tests);

Page 45: Unit Testing Data: Can I Trust This Data? · Deep Linking - Trail of Bread Crumbs Test Recipes (10 min) Basic Tests Time Series tests Meta testing Conclusion and Questions (5 min)

Design - Current Status View

Page 46: Unit Testing Data: Can I Trust This Data? · Deep Linking - Trail of Bread Crumbs Test Recipes (10 min) Basic Tests Time Series tests Meta testing Conclusion and Questions (5 min)

Design - Current Status View

Page 47: Unit Testing Data: Can I Trust This Data? · Deep Linking - Trail of Bread Crumbs Test Recipes (10 min) Basic Tests Time Series tests Meta testing Conclusion and Questions (5 min)

● Test History Table– a place to save all test results

● Test Scaffold– The basic boilerplate for a test

● Current Status View– What is the latest status for each test

● Deep Links– Reproducibility, traceability, deeper analysis

Design Pattern

Page 48: Unit Testing Data: Can I Trust This Data? · Deep Linking - Trail of Bread Crumbs Test Recipes (10 min) Basic Tests Time Series tests Meta testing Conclusion and Questions (5 min)

● Test History Table– a place to save all test results

● Test Scaffold– The basic boilerplate for a test

● Current Status View– What is the latest status for each test

● Deep Links– Reproducibility, traceability, deeper analysis and 3am PagerDuty

Design Pattern

Page 49: Unit Testing Data: Can I Trust This Data? · Deep Linking - Trail of Bread Crumbs Test Recipes (10 min) Basic Tests Time Series tests Meta testing Conclusion and Questions (5 min)

Design - Deep Linking

Page 50: Unit Testing Data: Can I Trust This Data? · Deep Linking - Trail of Bread Crumbs Test Recipes (10 min) Basic Tests Time Series tests Meta testing Conclusion and Questions (5 min)

Design - Deep Linking

Page 51: Unit Testing Data: Can I Trust This Data? · Deep Linking - Trail of Bread Crumbs Test Recipes (10 min) Basic Tests Time Series tests Meta testing Conclusion and Questions (5 min)

Design - Deep Linking

Page 52: Unit Testing Data: Can I Trust This Data? · Deep Linking - Trail of Bread Crumbs Test Recipes (10 min) Basic Tests Time Series tests Meta testing Conclusion and Questions (5 min)

Design - Deep Linking + Parameters

Page 53: Unit Testing Data: Can I Trust This Data? · Deep Linking - Trail of Bread Crumbs Test Recipes (10 min) Basic Tests Time Series tests Meta testing Conclusion and Questions (5 min)

Design - Deep Linking + Parameters

Page 54: Unit Testing Data: Can I Trust This Data? · Deep Linking - Trail of Bread Crumbs Test Recipes (10 min) Basic Tests Time Series tests Meta testing Conclusion and Questions (5 min)

Test Recipes

Page 55: Unit Testing Data: Can I Trust This Data? · Deep Linking - Trail of Bread Crumbs Test Recipes (10 min) Basic Tests Time Series tests Meta testing Conclusion and Questions (5 min)

Test Recipes

Page 56: Unit Testing Data: Can I Trust This Data? · Deep Linking - Trail of Bread Crumbs Test Recipes (10 min) Basic Tests Time Series tests Meta testing Conclusion and Questions (5 min)

Test Recipes

● Basic tests○ Some simple property tests

● Time Series tests○ Time evolving tests

● Meta testing○ Testing the tests

Page 57: Unit Testing Data: Can I Trust This Data? · Deep Linking - Trail of Bread Crumbs Test Recipes (10 min) Basic Tests Time Series tests Meta testing Conclusion and Questions (5 min)

Test Recipes

● Basic tests○ Some simple property tests

● Time Series tests○ Time evolving tests

● Meta testing○ Testing the tests

Page 58: Unit Testing Data: Can I Trust This Data? · Deep Linking - Trail of Bread Crumbs Test Recipes (10 min) Basic Tests Time Series tests Meta testing Conclusion and Questions (5 min)

Test Recipes - Basic Tests

● 0 row count

● Duplicates

● Data Age

Page 59: Unit Testing Data: Can I Trust This Data? · Deep Linking - Trail of Bread Crumbs Test Recipes (10 min) Basic Tests Time Series tests Meta testing Conclusion and Questions (5 min)

INSERT INTO public.etl_qa_history ( table_name, test_name, status, description, url)

WITHsource_tables as (...), test_row_count as (...), test_dedupe as (...), test_data_age as (...)

, all_tests as ( SELECT table_name, test_name, status, description, url FROM test_row_count UNION SELECT table_name, test_name, status, description, url FROM test_dedupe UNION SELECT table_name, test_name, status, description, url FROM test_data_age)

SELECT table_name, test_name,status, description, urlFROM all_tests;

Test Recipes - Basic Tests

Page 60: Unit Testing Data: Can I Trust This Data? · Deep Linking - Trail of Bread Crumbs Test Recipes (10 min) Basic Tests Time Series tests Meta testing Conclusion and Questions (5 min)

INSERT INTO public.etl_qa_history ( table_name, test_name, status, description, url)

WITHsource_tables as (...), test_row_count as (...), test_dedupe as (...), test_data_age as (...)

, all_tests as ( SELECT table_name, test_name, status, description, url FROM test_row_count UNION SELECT table_name, test_name, status, description, url FROM test_dedupe UNION SELECT table_name, test_name, status, description, url FROM test_data_age)

SELECT table_name, test_name,status, description, urlFROM all_tests;

Test Recipes - Basic Tests

Page 61: Unit Testing Data: Can I Trust This Data? · Deep Linking - Trail of Bread Crumbs Test Recipes (10 min) Basic Tests Time Series tests Meta testing Conclusion and Questions (5 min)

source_tables as ( SELECT 'internal.reminders' as table_name -- Give table data a label to identify it , id as pk_id -- Alias primary key id to make it uniform , updated_at as timestamp_utc -- Alias timestamp to make it uniform FROM internal.reminders

UNION ALL

SELECT 'internal.reminder_notifications' as table_name , id as pk_id , updated_at as timestamp_utc FROM internal.reminder_notifications)

Test Recipes - Basic Tests

Page 62: Unit Testing Data: Can I Trust This Data? · Deep Linking - Trail of Bread Crumbs Test Recipes (10 min) Basic Tests Time Series tests Meta testing Conclusion and Questions (5 min)

source_tables as ( SELECT 'internal.reminders' as table_name -- Give table data a label to identify it , id as pk_id -- Alias primary key id to make it uniform , updated_at as timestamp_utc -- Alias timestamp to make it uniform FROM internal.reminders

UNION ALL

SELECT 'internal.reminder_notifications' as table_name , id as pk_id , updated_at as timestamp_utc FROM internal.reminder_notifications)

Test Recipes - Basic Tests

Page 63: Unit Testing Data: Can I Trust This Data? · Deep Linking - Trail of Bread Crumbs Test Recipes (10 min) Basic Tests Time Series tests Meta testing Conclusion and Questions (5 min)

source_tables as ( SELECT 'internal.reminders' as table_name -- Give table data a label to identify it , id as pk_id -- Alias primary key id to make it uniform , updated_at as timestamp_utc -- Alias timestamp to make it uniform FROM internal.reminders

UNION ALL

SELECT 'internal.reminder_notifications' as table_name , id as pk_id , updated_at as timestamp_utc FROM internal.reminder_notifications)

Test Recipes - Basic Tests

Page 64: Unit Testing Data: Can I Trust This Data? · Deep Linking - Trail of Bread Crumbs Test Recipes (10 min) Basic Tests Time Series tests Meta testing Conclusion and Questions (5 min)

INSERT INTO public.etl_qa_history ( table_name, test_name, status, description, url)

WITHsource_tables as (...), test_row_count as (...), test_dedupe as (...), test_data_age as (...)

, all_tests as ( SELECT table_name, test_name, status, description, url FROM test_row_count UNION SELECT table_name, test_name, status, description, url FROM test_dedupe UNION SELECT table_name, test_name, status, description, url FROM test_data_age)

SELECT table_name, test_name,status, description, urlFROM all_tests;

Test Recipes - Basic Tests - Row Count

Page 65: Unit Testing Data: Can I Trust This Data? · Deep Linking - Trail of Bread Crumbs Test Recipes (10 min) Basic Tests Time Series tests Meta testing Conclusion and Questions (5 min)

test_row_count as ( SELECT table_name , 'row_count' as test_name , CASE WHEN count(1) > 0 THEN 'PASS'::text ELSE 'FAIL'::text END as status , (count(1))::text || ' rows found between ' || min(timestamp_utc)::text || ' UTC and ' || max(timestamp_utc)::text || ' UTC.' as description , NULL::text as url FROM source_tables WHERE timestamp_utc >= dateadd(hour, -24, getdate()) -- SLA check last 24 hour window GROUP BY 1)

Test Recipes - Basic Tests - Row Count

Page 66: Unit Testing Data: Can I Trust This Data? · Deep Linking - Trail of Bread Crumbs Test Recipes (10 min) Basic Tests Time Series tests Meta testing Conclusion and Questions (5 min)

test_row_count as ( SELECT table_name , 'row_count' as test_name , CASE WHEN count(1) > 0 THEN 'PASS'::text ELSE 'FAIL'::text END as status , (count(1))::text || ' rows found between ' || min(timestamp_utc)::text || ' UTC and ' || max(timestamp_utc)::text || ' UTC.' as description , NULL::text as url FROM source_tables WHERE timestamp_utc >= dateadd(hour, -24, getdate()) -- SLA check last 24 hour window GROUP BY 1)

Test Recipes - Basic Tests - Row Count

Page 67: Unit Testing Data: Can I Trust This Data? · Deep Linking - Trail of Bread Crumbs Test Recipes (10 min) Basic Tests Time Series tests Meta testing Conclusion and Questions (5 min)

test_row_count as ( SELECT table_name , 'row_count' as test_name , CASE WHEN count(1) > 0 THEN 'PASS'::text ELSE 'FAIL'::text END as status , (count(1))::text || ' rows found between ' || min(timestamp_utc)::text || ' UTC and ' || max(timestamp_utc)::text || ' UTC.' as description , NULL::text as url FROM source_tables WHERE timestamp_utc >= dateadd(hour, -24, getdate()) -- SLA check last 24 hour window GROUP BY 1)

Test Recipes - Basic Tests - Row Count

Page 68: Unit Testing Data: Can I Trust This Data? · Deep Linking - Trail of Bread Crumbs Test Recipes (10 min) Basic Tests Time Series tests Meta testing Conclusion and Questions (5 min)

test_row_count as ( SELECT table_name , 'row_count' as test_name , CASE WHEN count(1) > 0 THEN 'PASS'::text ELSE 'FAIL'::text END as status , (count(1))::text || ' rows found between ' || min(timestamp_utc)::text || ' UTC and ' || max(timestamp_utc)::text || ' UTC.' as description , NULL::text as url FROM source_tables WHERE timestamp_utc >= dateadd(hour, -24, getdate()) -- SLA check last 24 hour window GROUP BY 1)

Test Recipes - Basic Tests - Row Count

Page 69: Unit Testing Data: Can I Trust This Data? · Deep Linking - Trail of Bread Crumbs Test Recipes (10 min) Basic Tests Time Series tests Meta testing Conclusion and Questions (5 min)

test_row_count as ( SELECT table_name , 'row_count' as test_name , CASE WHEN count(1) > 0 THEN 'PASS'::text ELSE 'FAIL'::text END as status , (count(1))::text || ' rows found between ' || min(timestamp_utc)::text || ' UTC and ' || max(timestamp_utc)::text || ' UTC.' as description , NULL::text as url FROM source_tables WHERE timestamp_utc >= dateadd(hour, -24, getdate()) -- SLA check last 24 hour window GROUP BY 1)

Test Recipes - Basic Tests - Row Count

Page 70: Unit Testing Data: Can I Trust This Data? · Deep Linking - Trail of Bread Crumbs Test Recipes (10 min) Basic Tests Time Series tests Meta testing Conclusion and Questions (5 min)

INSERT INTO public.etl_qa_history ( table_name, test_name, status, description, url)

WITHsource_tables as (...), test_row_count as (...), test_dedupe as (...), test_data_age as (...)

, all_tests as ( SELECT table_name, test_name, status, description, url FROM test_row_count UNION SELECT table_name, test_name, status, description, url FROM test_dedupe UNION SELECT table_name, test_name, status, description, url FROM test_data_age)

SELECT table_name, test_name,status, description, urlFROM all_tests;

Test Recipes - Basic Tests - Duplicates

Page 71: Unit Testing Data: Can I Trust This Data? · Deep Linking - Trail of Bread Crumbs Test Recipes (10 min) Basic Tests Time Series tests Meta testing Conclusion and Questions (5 min)

test_dedupe as ( SELECT table_name , 'duplicates' as test_name , CASE WHEN count(1) - count(distinct pk_id) = 0 THEN 'PASS'::text ELSE 'FAIL'::text END as status , (count(1) - count(distinct pk_id))::text || ' duplicates found' as description , NULL::text as url FROM source_tables GROUP BY 1)

Test Recipes - Basic Tests - Duplicates

Page 72: Unit Testing Data: Can I Trust This Data? · Deep Linking - Trail of Bread Crumbs Test Recipes (10 min) Basic Tests Time Series tests Meta testing Conclusion and Questions (5 min)

test_dedupe as ( SELECT table_name , 'duplicates' as test_name , CASE WHEN count(1) - count(distinct pk_id) = 0 THEN 'PASS'::text ELSE 'FAIL'::text END as status , (count(1) - count(distinct pk_id))::text || ' duplicates found' as description , NULL::text as url FROM source_tables GROUP BY 1)

Test Recipes - Basic Tests - Duplicates

Page 73: Unit Testing Data: Can I Trust This Data? · Deep Linking - Trail of Bread Crumbs Test Recipes (10 min) Basic Tests Time Series tests Meta testing Conclusion and Questions (5 min)

test_dedupe as ( SELECT table_name , 'duplicates' as test_name , CASE WHEN count(1) - count(distinct pk_id) = 0 THEN 'PASS'::text ELSE 'FAIL'::text END as status , (count(1) - count(distinct pk_id))::text || ' duplicates found' as description , NULL::text as url FROM source_tables GROUP BY 1)

Test Recipes - Basic Tests - Duplicates

Page 74: Unit Testing Data: Can I Trust This Data? · Deep Linking - Trail of Bread Crumbs Test Recipes (10 min) Basic Tests Time Series tests Meta testing Conclusion and Questions (5 min)

test_dedupe as ( SELECT table_name , 'duplicates' as test_name , CASE WHEN count(1) - count(distinct pk_id) = 0 THEN 'PASS'::text ELSE 'FAIL'::text END as status , (count(1) - count(distinct pk_id))::text || ' duplicates found' as description , NULL::text as url FROM source_tables GROUP BY 1)

Test Recipes - Basic Tests - Duplicates

Page 75: Unit Testing Data: Can I Trust This Data? · Deep Linking - Trail of Bread Crumbs Test Recipes (10 min) Basic Tests Time Series tests Meta testing Conclusion and Questions (5 min)

INSERT INTO public.etl_qa_history ( table_name, test_name, status, description, url)

WITHsource_tables as (...), test_row_count as (...), test_dedupe as (...), test_data_age as (...)

, all_tests as ( SELECT table_name, test_name, status, description, url FROM test_row_count UNION SELECT table_name, test_name, status, description, url FROM test_dedupe UNION SELECT table_name, test_name, status, description, url FROM test_data_age)

SELECT table_name, test_name,status, description, urlFROM all_tests;

Test Recipes - Basic Tests - Data Age

Page 76: Unit Testing Data: Can I Trust This Data? · Deep Linking - Trail of Bread Crumbs Test Recipes (10 min) Basic Tests Time Series tests Meta testing Conclusion and Questions (5 min)

test_data_age as ( SELECT table_name , 'data_age' as test_name , CASE WHEN date_diff('hour', max(timestamp_utc), getdate())::int <= 24 AND date_diff('hour', max(timestamp_utc), getdate())::int >= 0 THEN 'PASS'::text ELSE 'FAIL'::text END as status , (date_diff('hour', max(timestamp_utc), getdate())::int)::text || ' hour(s) since last update at ' || (max(timestamp_utc))::text as description , NULL::text as url FROM source_tables GROUP BY 1)

Test Recipes - Basic Tests - Data Age

Page 77: Unit Testing Data: Can I Trust This Data? · Deep Linking - Trail of Bread Crumbs Test Recipes (10 min) Basic Tests Time Series tests Meta testing Conclusion and Questions (5 min)

test_data_age as ( SELECT table_name , 'data_age' as test_name , CASE WHEN date_diff('hour', max(timestamp_utc), getdate())::int <= 24 AND date_diff('hour', max(timestamp_utc), getdate())::int >= 0 THEN 'PASS'::text ELSE 'FAIL'::text END as status , (date_diff('hour', max(timestamp_utc), getdate())::int)::text || ' hour(s) since last update at ' || (max(timestamp_utc))::text as description , NULL::text as url FROM source_tables GROUP BY 1)

Test Recipes - Basic Tests - Data Age

Page 78: Unit Testing Data: Can I Trust This Data? · Deep Linking - Trail of Bread Crumbs Test Recipes (10 min) Basic Tests Time Series tests Meta testing Conclusion and Questions (5 min)

test_data_age as ( SELECT table_name , 'data_age' as test_name , CASE WHEN date_diff('hour', max(timestamp_utc), getdate())::int <= 24 AND date_diff('hour', max(timestamp_utc), getdate())::int >= 0 THEN 'PASS'::text ELSE 'FAIL'::text END as status , (date_diff('hour', max(timestamp_utc), getdate())::int)::text || ' hour(s) since last update at ' || (max(timestamp_utc))::text as description , NULL::text as url FROM source_tables GROUP BY 1)

Test Recipes - Basic Tests - Data Age

Page 79: Unit Testing Data: Can I Trust This Data? · Deep Linking - Trail of Bread Crumbs Test Recipes (10 min) Basic Tests Time Series tests Meta testing Conclusion and Questions (5 min)

test_data_age as ( SELECT table_name , 'data_age' as test_name , CASE WHEN date_diff('hour', max(timestamp_utc), getdate())::int <= 24 AND date_diff('hour', max(timestamp_utc), getdate())::int >= 0 THEN 'PASS'::text ELSE 'FAIL'::text END as status , (date_diff('hour', max(timestamp_utc), getdate())::int)::text || ' hour(s) since last update at ' || (max(timestamp_utc))::text as description , NULL::text as url FROM source_tables GROUP BY 1)

Test Recipes - Basic Tests - Data Age

Page 80: Unit Testing Data: Can I Trust This Data? · Deep Linking - Trail of Bread Crumbs Test Recipes (10 min) Basic Tests Time Series tests Meta testing Conclusion and Questions (5 min)

INSERT INTO public.etl_qa_history ( table_name, test_name, status, description, url)

WITHsource_tables as (...), test_row_count as (...), test_dedupe as (...), test_data_age as (...)

, all_tests as ( SELECT table_name, test_name, status, description, url FROM test_row_count UNION SELECT table_name, test_name, status, description, url FROM test_dedupe UNION SELECT table_name, test_name, status, description, url FROM test_data_age)

SELECT table_name, test_name,status, description, urlFROM all_tests;

Test Recipes - Basic Tests

Page 81: Unit Testing Data: Can I Trust This Data? · Deep Linking - Trail of Bread Crumbs Test Recipes (10 min) Basic Tests Time Series tests Meta testing Conclusion and Questions (5 min)

INSERT INTO public.etl_qa_history ( table_name, test_name, status, description, url)

WITHsource_tables as (...) -- Add more tables in here and have them receieve the same test treatment, test_row_count as (...), test_dedupe as (...), test_data_age as (...)

, all_tests as ( SELECT table_name, test_name, status, description, url FROM test_row_count UNION SELECT table_name, test_name, status, description, url FROM test_dedupe UNION SELECT table_name, test_name, status, description, url FROM test_data_age)

SELECT table_name, test_name,status, description, urlFROM all_tests;

Test Recipes - Basic Tests

Page 82: Unit Testing Data: Can I Trust This Data? · Deep Linking - Trail of Bread Crumbs Test Recipes (10 min) Basic Tests Time Series tests Meta testing Conclusion and Questions (5 min)

Test Recipes

● Basic tests○ Some simple property tests

● Time Series tests○ Time evolving tests

● Meta testing○ Testing the tests

Page 83: Unit Testing Data: Can I Trust This Data? · Deep Linking - Trail of Bread Crumbs Test Recipes (10 min) Basic Tests Time Series tests Meta testing Conclusion and Questions (5 min)

Test Recipes - Time Series

“This has been left as an exercise for the reader”

Page 84: Unit Testing Data: Can I Trust This Data? · Deep Linking - Trail of Bread Crumbs Test Recipes (10 min) Basic Tests Time Series tests Meta testing Conclusion and Questions (5 min)

Test Recipes - Time Series

Page 85: Unit Testing Data: Can I Trust This Data? · Deep Linking - Trail of Bread Crumbs Test Recipes (10 min) Basic Tests Time Series tests Meta testing Conclusion and Questions (5 min)

Test Recipes - Time Series

Page 86: Unit Testing Data: Can I Trust This Data? · Deep Linking - Trail of Bread Crumbs Test Recipes (10 min) Basic Tests Time Series tests Meta testing Conclusion and Questions (5 min)

Test Recipes - Time Series

Page 87: Unit Testing Data: Can I Trust This Data? · Deep Linking - Trail of Bread Crumbs Test Recipes (10 min) Basic Tests Time Series tests Meta testing Conclusion and Questions (5 min)

Test Recipes - Time Series

Page 88: Unit Testing Data: Can I Trust This Data? · Deep Linking - Trail of Bread Crumbs Test Recipes (10 min) Basic Tests Time Series tests Meta testing Conclusion and Questions (5 min)

Test Recipes - Time Series

Page 89: Unit Testing Data: Can I Trust This Data? · Deep Linking - Trail of Bread Crumbs Test Recipes (10 min) Basic Tests Time Series tests Meta testing Conclusion and Questions (5 min)

Test Recipes

● Basic tests○ Some simple property tests

● Time Series tests○ Time evolving tests

● Meta testing○ Testing the tests

Page 90: Unit Testing Data: Can I Trust This Data? · Deep Linking - Trail of Bread Crumbs Test Recipes (10 min) Basic Tests Time Series tests Meta testing Conclusion and Questions (5 min)

Test Recipes - Meta Testing

Page 91: Unit Testing Data: Can I Trust This Data? · Deep Linking - Trail of Bread Crumbs Test Recipes (10 min) Basic Tests Time Series tests Meta testing Conclusion and Questions (5 min)

Test Recipes - Meta Testing

● Warnings Watchdog

○ Promote persistent warnings to failures. Make yourself not lazy.

● Dependencies

○ When derived tables use other failing tables we want to know about errors propagating.

Page 92: Unit Testing Data: Can I Trust This Data? · Deep Linking - Trail of Bread Crumbs Test Recipes (10 min) Basic Tests Time Series tests Meta testing Conclusion and Questions (5 min)

Test Recipes - Meta Testing

● Warnings Watchdog

○ Promote persistent warnings to failures. Make yourself not lazy.

● Dependencies

○ When derived tables use other failing tables we want to know about errors propagating.

Page 93: Unit Testing Data: Can I Trust This Data? · Deep Linking - Trail of Bread Crumbs Test Recipes (10 min) Basic Tests Time Series tests Meta testing Conclusion and Questions (5 min)

Test Recipes - Meta Testing - Watchdog

Page 94: Unit Testing Data: Can I Trust This Data? · Deep Linking - Trail of Bread Crumbs Test Recipes (10 min) Basic Tests Time Series tests Meta testing Conclusion and Questions (5 min)

INSERT INTO public.etl_qa_history ( table_name, test_name, status, description, url)

WITH test_duplicate_test_name as (...), watchdog_issues as (...), warning_watchdog as (...), valid_status as (...)

, all_tests as (SELECT table_name, test_name, status, description, url FROM test_duplicate_test_nameUNION ALLSELECT table_name, test_name, status, description, url FROM warning_watchdog UNION ALLSELECT table_name, test_name, status, description, url FROM valid_status)

SELECT table_name:: text, test_name:: text, status:: text, description:: text, url::textFROM all_tests;

Test Recipes - Meta Testing - Watchdog

Page 95: Unit Testing Data: Can I Trust This Data? · Deep Linking - Trail of Bread Crumbs Test Recipes (10 min) Basic Tests Time Series tests Meta testing Conclusion and Questions (5 min)

Test Recipes - Meta Testing - Watchdog, watchdog_issues as ( SELECT table_name, test_name FROM public.etl_qa_history WHERE test_time_utc >= dateadd(day, -5, getdate()) AND upper(status) = 'WARN' -- 5 / 5 distinct days this test had warnings and did not pass once GROUP BY 1,2 HAVING count(distinct test_time_utc::date) >= 5

EXCEPT

-- Once the test has 1 pass it will get excluded from these results -- And won't have the FAIL override generated SELECT distinct table_name, test_name FROM public.etl_qa_history WHERE test_time_utc >= dateadd(day, -5, getdate()) AND upper(status) = 'PASS')

Page 96: Unit Testing Data: Can I Trust This Data? · Deep Linking - Trail of Bread Crumbs Test Recipes (10 min) Basic Tests Time Series tests Meta testing Conclusion and Questions (5 min)

Test Recipes - Meta Testing - Watchdog, watchdog_issues as ( SELECT table_name, test_name FROM public.etl_qa_history WHERE test_time_utc >= dateadd(day, -5, getdate()) AND upper(status) = 'WARN' -- 5 / 5 distinct days this test had warnings and did not pass once GROUP BY 1,2 HAVING count(distinct test_time_utc::date) >= 5

EXCEPT

-- Once the test has 1 pass it will get excluded from these results -- And won't have the FAIL override generated SELECT distinct table_name, test_name FROM public.etl_qa_history WHERE test_time_utc >= dateadd(day, -5, getdate()) AND upper(status) = 'PASS')

Page 97: Unit Testing Data: Can I Trust This Data? · Deep Linking - Trail of Bread Crumbs Test Recipes (10 min) Basic Tests Time Series tests Meta testing Conclusion and Questions (5 min)

Test Recipes - Meta Testing - Watchdog, watchdog_issues as ( SELECT table_name, test_name FROM public.etl_qa_history WHERE test_time_utc >= dateadd(day, -5, getdate()) AND upper(status) = 'WARN' -- 5 / 5 distinct days this test had warnings and did not pass once GROUP BY 1,2 HAVING count(distinct test_time_utc::date) >= 5

EXCEPT

-- Once the test has 1 pass it will get excluded from these results -- And won't have the FAIL override generated SELECT distinct table_name, test_name FROM public.etl_qa_history WHERE test_time_utc >= dateadd(day, -5, getdate()) AND upper(status) = 'PASS')

Page 98: Unit Testing Data: Can I Trust This Data? · Deep Linking - Trail of Bread Crumbs Test Recipes (10 min) Basic Tests Time Series tests Meta testing Conclusion and Questions (5 min)

Test Recipes - Meta Testing - Watchdog, watchdog_issues as ( SELECT table_name, test_name FROM public.etl_qa_history WHERE test_time_utc >= dateadd(day, -5, getdate()) AND upper(status) = 'WARN' -- 5 / 5 distinct days this test had warnings and did not pass once GROUP BY 1,2 HAVING count(distinct test_time_utc::date) >= 5

EXCEPT

-- Once the test has 1 pass it will get excluded from these results -- And won't have the FAIL override generated SELECT distinct table_name, test_name FROM public.etl_qa_history WHERE test_time_utc >= dateadd(day, -5, getdate()) AND upper(status) = 'PASS')

Page 99: Unit Testing Data: Can I Trust This Data? · Deep Linking - Trail of Bread Crumbs Test Recipes (10 min) Basic Tests Time Series tests Meta testing Conclusion and Questions (5 min)

Test Recipes - Meta Testing - Watchdog, watchdog_issues as ( SELECT table_name, test_name FROM public.etl_qa_history WHERE test_time_utc >= dateadd(day, -5, getdate()) AND upper(status) = 'WARN' -- 5 / 5 distinct days this test had warnings and did not pass once GROUP BY 1,2 HAVING count(distinct test_time_utc::date) >= 5

EXCEPT

-- Once the test has 1 pass it will get excluded from these results -- And won't have the FAIL override generated SELECT distinct table_name, test_name FROM public.etl_qa_history WHERE test_time_utc >= dateadd(day, -5, getdate()) AND upper(status) = 'PASS')

Page 100: Unit Testing Data: Can I Trust This Data? · Deep Linking - Trail of Bread Crumbs Test Recipes (10 min) Basic Tests Time Series tests Meta testing Conclusion and Questions (5 min)

Test Recipes - Meta Testing - Watchdog

, warning_watchdog as ( SELECT W.table_name, W.test_name , 'FAIL' as status -- hard code this test to FAIL , 'WARNING WATCHDOG: ' || CS.description as description -- Prefix why this warning became a FAIL , CS.url FROM public.etl_qa_current_status as CS -- Intersection of current issues and watchdog flagged issues INNER JOIN watchdog_issues as W ON CS.table_name = W.table_name AND CS.test_name = W.test_name )

Page 101: Unit Testing Data: Can I Trust This Data? · Deep Linking - Trail of Bread Crumbs Test Recipes (10 min) Basic Tests Time Series tests Meta testing Conclusion and Questions (5 min)

Test Recipes - Meta Testing - Watchdog

, warning_watchdog as ( SELECT W.table_name, W.test_name , 'FAIL' as status -- hard code this test to FAIL , 'WARNING WATCHDOG: ' || CS.description as description -- Prefix why this warning became a FAIL , CS.url FROM public.etl_qa_current_status as CS -- Intersection of current issues and watchdog flagged issues INNER JOIN watchdog_issues as W ON CS.table_name = W.table_name AND CS.test_name = W.test_name )

Page 102: Unit Testing Data: Can I Trust This Data? · Deep Linking - Trail of Bread Crumbs Test Recipes (10 min) Basic Tests Time Series tests Meta testing Conclusion and Questions (5 min)

Test Recipes - Meta Testing - Watchdog

, warning_watchdog as ( SELECT W.table_name, W.test_name , 'FAIL' as status -- hard code this test to FAIL , 'WARNING WATCHDOG: ' || CS.description as description -- Prefix why this warning became a FAIL , CS.url FROM public.etl_qa_current_status as CS -- Intersection of current issues and watchdog flagged issues INNER JOIN watchdog_issues as W ON CS.table_name = W.table_name AND CS.test_name = W.test_name )

Page 103: Unit Testing Data: Can I Trust This Data? · Deep Linking - Trail of Bread Crumbs Test Recipes (10 min) Basic Tests Time Series tests Meta testing Conclusion and Questions (5 min)

Test Recipes - Meta Testing

● Warnings Watchdog

○ Promote persistent warnings to failures. Make yourself not lazy.

● Dependencies

○ When derived tables use other failing tables we want to know about errors propagating.

Page 104: Unit Testing Data: Can I Trust This Data? · Deep Linking - Trail of Bread Crumbs Test Recipes (10 min) Basic Tests Time Series tests Meta testing Conclusion and Questions (5 min)

Test Recipes - Meta Testing - Dependencies

Page 105: Unit Testing Data: Can I Trust This Data? · Deep Linking - Trail of Bread Crumbs Test Recipes (10 min) Basic Tests Time Series tests Meta testing Conclusion and Questions (5 min)

Test Recipes - Meta Testing - Dependencies

INSERT INTO public.etl_qa_history ( table_name, test_name, status, description)

WITH test_source AS (...), test_stats AS (...), test_cases AS (...)

SELECT 'public.customer'::text as table_name, ('data_dependency_'||table_name)::text as test_name, -- each dependency needs to be it's own test status::text, description::text FROM test_cases;

Page 106: Unit Testing Data: Can I Trust This Data? · Deep Linking - Trail of Bread Crumbs Test Recipes (10 min) Basic Tests Time Series tests Meta testing Conclusion and Questions (5 min)

Test Recipes - Meta Testing - Dependenciestest_source AS ( SELECT table_name, status , count(*)::float as test_count FROM public.etl_qa_current_status WHERE --dependency list table_name IN ( 'public.account' , 'public.subscription' , 'public.document' -- etc etc etc )), test_stats AS (...), test_cases AS (...)

SELECT 'public.customer'::text as table_name, ('data_dependency_'||table_name)::text as test_name, -- each dependency needs to be it's own test status::text, description::textFROM test_cases;

Page 107: Unit Testing Data: Can I Trust This Data? · Deep Linking - Trail of Bread Crumbs Test Recipes (10 min) Basic Tests Time Series tests Meta testing Conclusion and Questions (5 min)

Test Recipes - Meta Testing - Dependenciestest_source AS ( SELECT table_name, status , count(*)::float as test_count FROM public.etl_qa_current_status WHERE --dependency list table_name IN ( 'public.account' , 'public.subscription' , 'public.document' -- etc etc etc )), test_stats AS (...), test_cases AS (...)

SELECT 'public.customer'::text as table_name, ('data_dependency_'||table_name)::text as test_name, -- each dependency needs to be it's own test status::text, description::textFROM test_cases;

Page 108: Unit Testing Data: Can I Trust This Data? · Deep Linking - Trail of Bread Crumbs Test Recipes (10 min) Basic Tests Time Series tests Meta testing Conclusion and Questions (5 min)

Test Recipes - Meta Testing - Dependencies

test_source AS (...), test_stats AS ( SELECT table_name -- pivot on status , sum(CASE WHEN status = 'PASS' THEN test_count ELSE 0 END) / sum(test_count) as PASS , sum(CASE WHEN status = 'WARN' THEN test_count ELSE 0 END) / sum(test_count) as WARN , sum(CASE WHEN status = 'FAIL' THEN test_count ELSE 0 END) / sum(test_count) as FAIL , sum(CASE WHEN status = 'N/A' THEN test_count ELSE 0 END) / sum(test_count) as NA , sum(test_count) as total FROM test_source GROUP BY 1), test_cases AS (...)

SELECT ...FROM test_cases;

Page 109: Unit Testing Data: Can I Trust This Data? · Deep Linking - Trail of Bread Crumbs Test Recipes (10 min) Basic Tests Time Series tests Meta testing Conclusion and Questions (5 min)

Test Recipes - Meta Testing - Dependenciestest_source AS (...), test_stats AS (...), test_cases AS (SELECT table_name , CASE WHEN pass = 1 THEN 'PASS'::text WHEN fail > 0 THEN 'FAIL'::text WHEN warn > 0 THEN 'WARN'::text WHEN na > 0 THEN 'N/A'::text ELSE 'FAIL'::text END as status -- Generate Description , round(pass*100,1) || '% tests PASS; ' || round(fail*100,1) || '% tests FAIL; ' || round(warn*100,1) || '% tests WARN; ' || round(na*100,1) || '% tests N/A; ' || round((1-pass-fail-warn-na)*100,1) || '% tests Unaccounted; ' -- catch test unexpected test statuses causing non-passes as descriptionFROM test_stats)

SELECT ...FROM test_cases;

Page 110: Unit Testing Data: Can I Trust This Data? · Deep Linking - Trail of Bread Crumbs Test Recipes (10 min) Basic Tests Time Series tests Meta testing Conclusion and Questions (5 min)

Test Recipes - Meta Testing - Dependenciestest_source AS (...), test_stats AS (...), test_cases AS (SELECT table_name , CASE WHEN pass = 1 THEN 'PASS'::text WHEN fail > 0 THEN 'FAIL'::text WHEN warn > 0 THEN 'WARN'::text WHEN na > 0 THEN 'N/A'::text ELSE 'FAIL'::text END as status -- Generate Description , round(pass*100,1) || '% tests PASS; ' || round(fail*100,1) || '% tests FAIL; ' || round(warn*100,1) || '% tests WARN; ' || round(na*100,1) || '% tests N/A; ' || round((1-pass-fail-warn-na)*100,1) || '% tests Unaccounted; ' -- catch test unexpected test statuses causing non-passes as descriptionFROM test_stats)

SELECT ...FROM test_cases;

Page 111: Unit Testing Data: Can I Trust This Data? · Deep Linking - Trail of Bread Crumbs Test Recipes (10 min) Basic Tests Time Series tests Meta testing Conclusion and Questions (5 min)

Test Recipes - Meta Testing - Dependenciestest_source AS (...), test_stats AS (...), test_cases AS (SELECT table_name , CASE WHEN pass = 1 THEN 'PASS'::text WHEN fail > 0 THEN 'FAIL'::text WHEN warn > 0 THEN 'WARN'::text WHEN na > 0 THEN 'N/A'::text ELSE 'FAIL'::text END as status -- Generate Description , round(pass*100,1) || '% tests PASS; ' || round(fail*100,1) || '% tests FAIL; ' || round(warn*100,1) || '% tests WARN; ' || round(na*100,1) || '% tests N/A; ' || round((1-pass-fail-warn-na)*100,1) || '% tests Unaccounted; ' -- catch test unexpected test statuses causing non-passes as descriptionFROM test_stats)

SELECT ...FROM test_cases;

Page 112: Unit Testing Data: Can I Trust This Data? · Deep Linking - Trail of Bread Crumbs Test Recipes (10 min) Basic Tests Time Series tests Meta testing Conclusion and Questions (5 min)

Conclusion

● These tests are an evolution, they are not complete and probably never will be. Which is a good thing.

● This is an engineering process that others in your team can contribute to and expand upon.

● Rather than frantically running a variety of queries from memory, reuse those skills in an automated manner.

● It also isolates scope very quickly of what is and is not passing.

● It helps you in advance to leave a breadcrumb trail of deeper diagnostics.

Page 113: Unit Testing Data: Can I Trust This Data? · Deep Linking - Trail of Bread Crumbs Test Recipes (10 min) Basic Tests Time Series tests Meta testing Conclusion and Questions (5 min)

Conclusion

● These tests are an evolution, they are not complete and probably never will be. Which is a good thing.

● This is an engineering process that others in your team can contribute to and expand upon.

● Rather than frantically running a variety of queries from memory, reuse those skills in an automated manner.

● It also isolates scope very quickly of what is and is not passing.

● It helps you in advance to leave a breadcrumb trail of deeper diagnostics.

Page 114: Unit Testing Data: Can I Trust This Data? · Deep Linking - Trail of Bread Crumbs Test Recipes (10 min) Basic Tests Time Series tests Meta testing Conclusion and Questions (5 min)

Conclusion

● These tests are an evolution, they are not complete and probably never will be. Which is a good thing.

● This is an engineering process that others in your team can contribute to and expand upon.

● Rather than frantically running a variety of queries from memory, reuse those skills in an automated manner.

● It also isolates scope very quickly of what is and is not passing.

● It helps you in advance to leave a breadcrumb trail of deeper diagnostics.

Page 115: Unit Testing Data: Can I Trust This Data? · Deep Linking - Trail of Bread Crumbs Test Recipes (10 min) Basic Tests Time Series tests Meta testing Conclusion and Questions (5 min)

Conclusion

● These tests are an evolution, they are not complete and probably never will be. Which is a good thing.

● This is an engineering process that others in your team can contribute to and expand upon.

● Rather than frantically running a variety of queries from memory, reuse those skills in an automated manner.

● It also isolates scope very quickly of what is and is not passing.

● It helps you in advance to leave a breadcrumb trail of deeper diagnostics.

Page 116: Unit Testing Data: Can I Trust This Data? · Deep Linking - Trail of Bread Crumbs Test Recipes (10 min) Basic Tests Time Series tests Meta testing Conclusion and Questions (5 min)

Conclusion

● These tests are an evolution, they are not complete and probably never will be. Which is a good thing.

● This is an engineering process that others in your team can contribute to and expand upon.

● Rather than frantically running a variety of queries from memory, reuse those skills in an automated manner.

● It also isolates scope very quickly of what is and is not passing.

● It helps you in advance to leave a breadcrumb trail of deeper diagnostics.

Page 117: Unit Testing Data: Can I Trust This Data? · Deep Linking - Trail of Bread Crumbs Test Recipes (10 min) Basic Tests Time Series tests Meta testing Conclusion and Questions (5 min)

Conclusion

● These tests are an evolution, they are not complete and probably never will be. Which is a good thing.

● This is an engineering process that others in your team can contribute to and expand upon.

● Rather than frantically running a variety of queries from memory, reuse those skills in an automated manner.

● It also isolates scope very quickly of what is and is not passing.

● It helps you in advance to leave a breadcrumb trail of deeper diagnostics.

Page 118: Unit Testing Data: Can I Trust This Data? · Deep Linking - Trail of Bread Crumbs Test Recipes (10 min) Basic Tests Time Series tests Meta testing Conclusion and Questions (5 min)

Conclusion

Page 119: Unit Testing Data: Can I Trust This Data? · Deep Linking - Trail of Bread Crumbs Test Recipes (10 min) Basic Tests Time Series tests Meta testing Conclusion and Questions (5 min)

Thank you.

Page 120: Unit Testing Data: Can I Trust This Data? · Deep Linking - Trail of Bread Crumbs Test Recipes (10 min) Basic Tests Time Series tests Meta testing Conclusion and Questions (5 min)

Questions?