Unit Testing Data: Can I Trust This Data? · Deep Linking - Trail of Bread Crumbs Test Recipes (10 min) Basic Tests Time Series tests Meta testing Conclusion and Questions (5 min)

Unit Testing Data:Can I Trust This Data?

"Can I trust this data?" When asked this question it can be a difficult task to objectively measure and answer.

Similar to how unit tests have provided metrics for code coverage and bug regressions, this talk aims to show techniques and recipes developed to quantify data sanitisation and coverage. It also demonstrates an extensible design pattern that allows further tests to be developed.

If you can write a query, you can write data unit tests.

These strategies have been implemented at Invoice2go in their ETL pipeline for the last 2 years to detect data regressions in their Amazon Redshift data warehouse.

Abstract

● Introduction / Scenario (5 min)● Design Pattern (10 min)

○ Test History Table○ Test Scaffold○ Current Status View○ Deep Linking - Trail of Bread Crumbs

● Test Recipes (10 min)○ Basic Tests○ Time Series tests○ Meta testing

● Conclusion and Questions (5 min)

Who am I?

Josh Wilson

Senior Software Engineer @ Invoice2go

Data Engineering Team

Twitter: @_neozenithLinkedIn: https://au.linkedin.com/in/neozenith

https://au.linkedin.com/in/neozenith

Scenario

Scenario

● You work for the Data Engineering and/or Data Analytics team within your company.

http://www.traveltechsolutions.com/images/graph-icon1.png

Scenario


● You have a Daily KPI dashboard for Business Managers


Scenario



● Everyday at 6am your boss gets emailed this dashboard


Scenario




● 6.05am you get an email from your boss saying:


Scenario




● 6.05am you get an email from your boss saying:

http://www.traveltechsolutions.com/images/graph-icon1.pngCan I Trust This Data?

Scenario

Scenario - Can I trust this data?

● How do you even define trust?





○ “firm belief in the reliability, truth, or ability of someone or something”




● What do you do?




● What do you do?

○ Run a bunch of queries




● What do you do?


● Then how do you measure trust?




● What do you do?



○ Build a series of queries (tests) which affirm if an assumption is true.




● What do you do?



○ Build a series of queries (tests) which affirm if an assumption is true.○ Assertions of truth… sounds like Unit Testing.


Design Pattern

● Test History Table– a place to save all test results

● Test Scaffold– What’s the basic boilerplate for a test

● Current Status View– What is the latest status for each test

● Deep Links– Reproducibility, traceability, deeper analysis

Design Pattern


● Test Scaffold– What’s the basic boilerplate for a test



Design Pattern

Design - Test History Table

CREATE TABLE "public"."etl_qa_history" ("id" int4 NOT NULL DEFAULT "identity"(1, 0, '0,1'::text),"table_name" varchar(255), -- each table is a unit"test_name" varchar(255), -- create a naming convention for tests"test_time_utc" timestamp NULL DEFAULT getdate(), -- timestamp everything"status" varchar(4), -- PASS / FAIL / WARN / NA"description" varchar(MAX), -- human readable error messages"url" varchar(256) -- deep linking

);



);



);



);



);



);



);



);


● Test Scaffold– The basic boilerplate for a test



Design Pattern

Design - Test Scaffold

INSERT INTO public.etl_qa_history ( table_name, test_name, status, description, url)

WITH -- make use of CTEs to keep queries neattest_source AS (...), test_cases AS (...)

SELECT 'internal.ios_hardware_strings'::text as table_name, 'detect unmapped ios hardware strings'::text as test_name, status::text, description::text, url::textFROM test_cases;










WITH test_duplicate_test_name as (...), watchdog_issues as (...), warning_watchdog as (...), valid_status as (...)

, all_tests as (SELECT table_name, test_name, status, description, url FROM test_duplicate_test_nameUNION ALLSELECT table_name, test_name, status, description, url FROM warning_watchdog UNION ALLSELECT table_name, test_name, status, description, url FROM valid_status)

SELECT table_name:: text, test_name:: text, status:: text, description:: text, url::textFROM all_tests;

Design - Test Scaffold - Combine Multiple Tests




SELECT table_name::text, test_name::text, status::text, description::text, url::textFROM all_tests;

















“If you can write a query, you can write a data unit test”- Josh Wilson






Design Pattern

Design - Current Status View

CREATE OR REPLACE VIEW "public"."etl_qa_current_status" ASSELECT table_name, test_name, last_run_time_utc, status, description, url FROM ( SELECT table_name, test_name, test_time_utc as last_run_time_utc, status, description, url, rank() OVER (PARTITION BY table_name, test_name ORDER BY test_time_utc DESC) as rank FROM public.etl_qa_history ) as rank_eventWHERE ( (rank_event.rank = 1) -- Only the most recent test result AND (rank_event.last_run_time_utc >= date_add('day', -7, getdate())) -- window only a week worth of tests);











Design Pattern




● Deep Links– Reproducibility, traceability, deeper analysis and 3am PagerDuty

Design Pattern

Design - Deep Linking



Design - Deep Linking + Parameters

Design - Deep Linking + Parameters

Test Recipes

Test Recipes

Test Recipes

● Basic tests○ Some simple property tests

● Time Series tests○ Time evolving tests

● Meta testing○ Testing the tests

Test Recipes




Test Recipes - Basic Tests

● 0 row count

● Duplicates

● Data Age


WITHsource_tables as (...), test_row_count as (...), test_dedupe as (...), test_data_age as (...)

, all_tests as ( SELECT table_name, test_name, status, description, url FROM test_row_count UNION SELECT table_name, test_name, status, description, url FROM test_dedupe UNION SELECT table_name, test_name, status, description, url FROM test_data_age)

SELECT table_name, test_name,status, description, urlFROM all_tests;







source_tables as ( SELECT 'internal.reminders' as table_name -- Give table data a label to identify it , id as pk_id -- Alias primary key id to make it uniform , updated_at as timestamp_utc -- Alias timestamp to make it uniform FROM internal.reminders

UNION ALL

SELECT 'internal.reminder_notifications' as table_name , id as pk_id , updated_at as timestamp_utc FROM internal.reminder_notifications)



UNION ALL




UNION ALL







Test Recipes - Basic Tests - Row Count

test_row_count as ( SELECT table_name , 'row_count' as test_name , CASE WHEN count(1) > 0 THEN 'PASS'::text ELSE 'FAIL'::text END as status , (count(1))::text || ' rows found between ' || min(timestamp_utc)::text || ' UTC and ' || max(timestamp_utc)::text || ' UTC.' as description , NULL::text as url FROM source_tables WHERE timestamp_utc >= dateadd(hour, -24, getdate()) -- SLA check last 24 hour window GROUP BY 1)














Test Recipes - Basic Tests - Duplicates

test_dedupe as ( SELECT table_name , 'duplicates' as test_name , CASE WHEN count(1) - count(distinct pk_id) = 0 THEN 'PASS'::text ELSE 'FAIL'::text END as status , (count(1) - count(distinct pk_id))::text || ' duplicates found' as description , NULL::text as url FROM source_tables GROUP BY 1)












Test Recipes - Basic Tests - Data Age

test_data_age as ( SELECT table_name , 'data_age' as test_name , CASE WHEN date_diff('hour', max(timestamp_utc), getdate())::int <= 24 AND date_diff('hour', max(timestamp_utc), getdate())::int >= 0 THEN 'PASS'::text ELSE 'FAIL'::text END as status , (date_diff('hour', max(timestamp_utc), getdate())::int)::text || ' hour(s) since last update at ' || (max(timestamp_utc))::text as description , NULL::text as url FROM source_tables GROUP BY 1)














WITHsource_tables as (...) -- Add more tables in here and have them receieve the same test treatment, test_row_count as (...), test_dedupe as (...), test_data_age as (...)




Test Recipes




Test Recipes - Time Series

“This has been left as an exercise for the reader”






Test Recipes




Test Recipes - Meta Testing


● Warnings Watchdog

○ Promote persistent warnings to failures. Make yourself not lazy.

● Dependencies

○ When derived tables use other failing tables we want to know about errors propagating.




● Dependencies


Test Recipes - Meta Testing - Watchdog






Test Recipes - Meta Testing - Watchdog, watchdog_issues as ( SELECT table_name, test_name FROM public.etl_qa_history WHERE test_time_utc >= dateadd(day, -5, getdate()) AND upper(status) = 'WARN' -- 5 / 5 distinct days this test had warnings and did not pass once GROUP BY 1,2 HAVING count(distinct test_time_utc::date) >= 5

EXCEPT

-- Once the test has 1 pass it will get excluded from these results -- And won't have the FAIL override generated SELECT distinct table_name, test_name FROM public.etl_qa_history WHERE test_time_utc >= dateadd(day, -5, getdate()) AND upper(status) = 'PASS')


EXCEPT



EXCEPT



EXCEPT



EXCEPT



, warning_watchdog as ( SELECT W.table_name, W.test_name , 'FAIL' as status -- hard code this test to FAIL , 'WARNING WATCHDOG: ' || CS.description as description -- Prefix why this warning became a FAIL , CS.url FROM public.etl_qa_current_status as CS -- Intersection of current issues and watchdog flagged issues INNER JOIN watchdog_issues as W ON CS.table_name = W.table_name AND CS.test_name = W.test_name )








● Dependencies


Test Recipes - Meta Testing - Dependencies


INSERT INTO public.etl_qa_history ( table_name, test_name, status, description)

WITH test_source AS (...), test_stats AS (...), test_cases AS (...)

SELECT 'public.customer'::text as table_name, ('data_dependency_'||table_name)::text as test_name, -- each dependency needs to be it's own test status::text, description::text FROM test_cases;

Test Recipes - Meta Testing - Dependenciestest_source AS ( SELECT table_name, status , count(*)::float as test_count FROM public.etl_qa_current_status WHERE --dependency list table_name IN ( 'public.account' , 'public.subscription' , 'public.document' -- etc etc etc )), test_stats AS (...), test_cases AS (...)

SELECT 'public.customer'::text as table_name, ('data_dependency_'||table_name)::text as test_name, -- each dependency needs to be it's own test status::text, description::textFROM test_cases;

Test Recipes - Meta Testing - Dependenciestest_source AS ( SELECT table_name, status , count(*)::float as test_count FROM public.etl_qa_current_status WHERE --dependency list table_name IN ( 'public.account' , 'public.subscription' , 'public.document' -- etc etc etc )), test_stats AS (...), test_cases AS (...)

SELECT 'public.customer'::text as table_name, ('data_dependency_'||table_name)::text as test_name, -- each dependency needs to be it's own test status::text, description::textFROM test_cases;


test_source AS (...), test_stats AS ( SELECT table_name -- pivot on status , sum(CASE WHEN status = 'PASS' THEN test_count ELSE 0 END) / sum(test_count) as PASS , sum(CASE WHEN status = 'WARN' THEN test_count ELSE 0 END) / sum(test_count) as WARN , sum(CASE WHEN status = 'FAIL' THEN test_count ELSE 0 END) / sum(test_count) as FAIL , sum(CASE WHEN status = 'N/A' THEN test_count ELSE 0 END) / sum(test_count) as NA , sum(test_count) as total FROM test_source GROUP BY 1), test_cases AS (...)

SELECT ...FROM test_cases;

Test Recipes - Meta Testing - Dependenciestest_source AS (...), test_stats AS (...), test_cases AS (SELECT table_name , CASE WHEN pass = 1 THEN 'PASS'::text WHEN fail > 0 THEN 'FAIL'::text WHEN warn > 0 THEN 'WARN'::text WHEN na > 0 THEN 'N/A'::text ELSE 'FAIL'::text END as status -- Generate Description , round(pass*100,1) || '% tests PASS; ' || round(fail*100,1) || '% tests FAIL; ' || round(warn*100,1) || '% tests WARN; ' || round(na*100,1) || '% tests N/A; ' || round((1-pass-fail-warn-na)*100,1) || '% tests Unaccounted; ' -- catch test unexpected test statuses causing non-passes as descriptionFROM test_stats)






Conclusion

● These tests are an evolution, they are not complete and probably never will be. Which is a good thing.

● This is an engineering process that others in your team can contribute to and expand upon.

● Rather than frantically running a variety of queries from memory, reuse those skills in an automated manner.

● It also isolates scope very quickly of what is and is not passing.

● It helps you in advance to leave a breadcrumb trail of deeper diagnostics.

Conclusion






Conclusion






Conclusion






Conclusion






Conclusion






Conclusion

Thank you.

Questions?

Documents

Unit Testing Data: Can I Trust This Data? · Deep Linking - Trail of Bread Crumbs Test Recipes (10 min) Basic Tests Time Series tests Meta testing Conclusion and Questions (5 min)