25
How to Build a Data Warehouse? Martin Loetzsch Project A Ventures, Berlin http://project-a.com http://twitter.com/martin-loetzsch

How to build a data warehouse - code.talks 2014

Embed Size (px)

DESCRIPTION

Lightweight data warehouse development with open source technologies

Citation preview

Page 1: How to build a data warehouse - code.talks 2014

How to Build a Data Warehouse?Martin Loetzsch

Project A Ventures, Berlin!http://project-a.comhttp://twitter.com/martin-loetzsch

Page 2: How to build a data warehouse - code.talks 2014

The “typical startup”

‣ Has data in • application database

• Excel & csv files

• external tools

‣ Excel based reporting chains• manual sql queries, CSVs

• copy & paste from external data sources

• difficult to debug and test

• sometimes cranky

!

‣ Everybody pulls their own numbers. # Orders?

!

!

!

!

!

!

!

‣ Does not have “big data”

‣ Will not have “big data” in the relevant future

2 / 25

-- count rows!SELECT count(*) FROM orders;!!-- count everything except test orders!SELECT count(*) FROM orders!WHERE is_test IS NULL;!!-- count everything that was once paid!SELECT count(*) FROM orders!JOIN order_history ON order_fk = order_id!WHERE status_id = 17;

If Excel works for your company, stick to it

Page 3: How to build a data warehouse - code.talks 2014

Integrate data!

Data driven growth requires integrated data

‣ Integrated data = Data Warehouse

!

!

!

!

‣ Data in the Data Warehouse is• the single point of truth

• cleaned up & validated

• easy to access

• embedded in the organisation

‣ Connect data from different domains

3 / 25

applicationdatabases

json files

csv files

apis

reporting

crm

marketing

search

pricing

DWHorders

users

products

stocks

prices

emailsclicks

Page 4: How to build a data warehouse - code.talks 2014

‣ 1. Use a BI Solutions by one of the big vendors!

!

• classic agency business

• takes forever in startup time

• usually too expensive

!

‣ 2. Use a cloud based DWH solution!

!

• covers only 80% of your business questions

• usually not possible to extend

‣ 3. Build your own, it’s easy!!

!

• with technology that existed in the 1990s

• simple ETL scripts running inside Postgresql

• open source Pentaho Mondrian as query processor

• own lightweight reporting frontend

• integrated in own shop system

‣ Keep it simple & pragmatic

‣ Don’t use big data technologies if you don’t have big data

How to build a Data Warehouse?

4 / 25Invest in own BI infrastructure

Page 5: How to build a data warehouse - code.talks 2014

Works with Excel, SQL frontends, Elasticsearch, Mondrian & other BI front ends

Basis of any Data Warehouse: fact tables

!

!

!

‣ KPIs: aggregations on single columns

‣ All time orders?!

‣ Revenue October 1st?!

‣ Sales by product?

!

!

‣ Allowed query operations• Aggregations (count, distinct-count, sum, avg)

• Filtering

• Grouping

5 / 25

item id

order id

has voucher price day product

1 1 20 09-30 Cat

2 1 10 09-30 Dog

3 2 2 20 09-30 Cat

4 3 30 09-30 Cow

5 4 4 10 10-01 Dog

6 4 4 30 10-01 Cow

# Sold items: count(item_id)

# Orders: distinct-count(order_id)

# Orders with vouchers: distinct-count(has_voucher)

Revenue: sum(price)

Avg product price: avg(price)

SELECT count(distinct order_id) FROM order_item;

SELECT sum(price) FROM order_item WHERE day = ’10-01';

SELECT count(item_id) FROM order_item GROUP BY product;

Page 6: How to build a data warehouse - code.talks 2014

‣ Move redundant categorial data to “dimension” tables

order_item

item_idorder_idhas_voucherpriceday_fkproduct_fk

day

day_idday_namemonth_idmonth_name

product

product_idproduct_name

Key challenge: finding good keys

Dimensional modelling

6 / 25

item id

order id

has voucher price day

fkproduct

fk1 1 20 930 1

2 1 10 930 2

3 2 2 20 930 1

4 3 30 930 3

5 4 4 10 1001 2

6 4 4 30 1001 3

day id

day name

month id

month name

930 09-30 9 Sep

1001 10-01 10 Oct

product id

product name

1 Cat

2 Dog

3 Cow

Page 7: How to build a data warehouse - code.talks 2014

order_item_status_mapping

order_item_status_fkorder_item_status_partition_fk

order_item_status

order_item_status_idorder_item_status_sort_idorder_item_status_name

order_item_status_partition

order_item_status_partition_idorder_item_status_perspective_idorder_item_status_perspective_nameorder_item_status_process_idorder_item_status_process_nameorder_item_status_group_idorder_item_status_group_name

sales_event

sales_event_idorder_item_fkorder_item_current_status_fkorder_item_status_partition_fkorder_timestampevent_timestamphours_since_orderhours_since_last_eventhours_to_next_eventestimated_net_revenue

order_item

order_item_idorder_fkmerchant_fkproduct_fkcategory_fkcategory_tree_fkorder_process_fkorder_item_status_fkprocessed_order_item_idnet_shipping_revenuetax_amount_shippinggross_voucher_valuenet_voucher_valuegross_revenue_before_vouchernet_item_valuegross_item_valuetax_amount_before_vouchertax_amount_vouchergross_shipping_revenuegross_shipping_revenue_before_vouchernet_purchase_costgross_purchase_costnet_revenue_returnednet_revenue_cancelednet_payment_costnet_return_cost_and_loss_and_fraudnet_shipping_and_fulfillment_costnet_marketing_expenses

customer_address

customer_address_idaddresszip_codefirst_namelast_namecitycountry_fkgenderaccount_disabledcompanyphonecell_phone

country

country_idcountry_name

order

order_idincrement_idorder_type_fkis_first_order_idis_follow_up_order_idis_second_order_idis_second_or_subsequent_order_idcustomer_fkreturning_customer_fkorder_rank_fkitems_per_order_fkpayment_method_fkpayment_provider_fkzip_code_fkorder_rank_1st_fk

order_rank

order_rank_idorder_rank_nameorder_rank_group_idorder_rank_group_name

customer

customer_idincrement_idcustomer_nameemailnumber_of_ordersfirst_order_datelast_order_dateavg_days_between_ordersnumber_of_orders_with_vouchersphonecompanygender_fkcustomer_type_fkcustomer_group_fkcustomer_industry_fk

order_type

order_type_idorder_type_name

order_date

day_fkhour_of_day_fkday_of_week_fkorder_fkorder_date_perspective_fk

order_date_perspective

order_date_perspective_idorder_date_perspective_name

hour_of_day

hour_of_day_idhour_of_day_name

day_of_week

day_of_week_idday_of_week_nameday

day_idday_reversed_idday_nameyear_idyear_reversed_idiso_year_idiso_year_reversed_idquarter_idquarter_namemonth_idmonth_reversed_idmonth_nameweek_idweek_reversed_idweek_name

customer_date

day_fkcustomer_fkcustomer_date_perspective_fk

customer_date_perspective

customer_date_perspective_idcustomer_date_perspective_name

gender

gender_idgender_name

customer_type

customer_type_idcustomer_type_name

customer_group

customer_group_idcustomer_group_name

sales_event_duration

sales_event_fkduration_fksales_event_duration_perspective_fk

sales_event_duration_perspective

sales_event_duration_perspective_idsales_event_duration_perspective_name

duration

duration_iddaysdays_nameweeksweeks_namemonthsmonths_namequartersquarters_nameyearsyears_name

sales_event_date

sales_event_fkday_fksales_event_date_perspective_fk

sales_event_date_perspective

sales_event_date_perspective_idsales_event_date_perspective_name

product

product_idskueansales_skumerchant_skuproduct_name

category

category_idcategory_parent_fkcategory_name

newsletter_event

newsletter_event_idday_fkcampaign_fkcustomer_increment_fksentbouncebounce_blockbounce_softbounce_hardbounce_reason_fkopenfirst_openclickfirst_clickurl_fkcomplaintsubsequent_order_fkfirst_ordergross_revenue_before_vouchergross_voucher_valuenet_voucher_valuetax_amount_before_vouchernet_purchase_cost

url

url_idurl_name

campaign

campaign_idcampaign_namesubjectsent_hoursent_date

cost_per_campaign_and_day

day_fkcampaign_fknumber_of_clicksimported_cost_mciimported_cost_apicost_of_clicks_directly_assignedcost_of_clicks_campaigns_without_clickscost_of_clicks_unknown_campaign

campaign

campaign_idcampaign_namelevel_3_idlevel_3_namelevel_2_idlevel_2_namechannel_fk

channel

channel_idchannel_name

Star schema, galaxy schema, nth normal form? Doesn’t matter, do what’s fastest.

Real life schemas I

‣ https://www.contorion.de/early stage project

7 / 25

Page 8: How to build a data warehouse - code.talks 2014

campaign_click_performance

campaign_click_fkperformance_attribution_model_fkattribution_path_segment_fknumber_of_signupsnumber_of_activationsnumber_of_transactionsgross_revenue

campaign_click

campaign_click_idvisitor_idday_fkcampaign_fkuser_fkpath_segment_fkpath_position_fkreverse_path_position_fkstep_fknext_step_fkstep_reverse_fknext_step_reverse_fknumber_of_clicksnumber_of_new_visitorsnumber_of_daily_visitorsnumber_of_monthly_visitorsduration_fktime_to_endmarketing_cost

path_segment

path_segment_idpath_segment_name

performance_attribution_model

performance_attribution_model_idperformance_attribution_model_name

campaign

campaign_idcampaign_namelevel_3_idlevel_3_namelevel_2_idlevel_2_namechannel_fkcorridor_fk

conversion_path_transition

conversion_path_transition_idconversion_path_transition_namechannel_with_position_idchannel_with_position_name

path_position

path_position_idpath_position_name

reverse_path_position

reverse_path_position_idreverse_path_position_name

user

user_idnumber_of_userscustomer_idnumber_of_customersrepeat_customer_idgender_fkage_fkuser_city_fkuser_state_fkuser_country_fkmost_freq_corridor_fktotal_transaction_range_fkreferral_source_fktransaction_frequency_fkhas_sent_cash_idhas_sent_airtime_idhas_sent_cash_and_airtime_idsent_amount_money_transfernumber_of_transactionsnumber_of_transactions_with_voucherfeesfx_gainsent_amount_airtimevoucher_cost_money_transfervoucher_cost_airtimedays_between_signup_and_first_transactiondays_between_signup_and_second_transactiondays_between_first_and_second_transactiondays_between_second_and_third_transactionaverage_days_between_transactionsdays_since_last_transactiondays_since_last_login

day

day_idday_nameyear_idiso_year_idquarter_idquarter_namemonth_idmonth_nameweek_idweek_nameday_of_week_idday_of_week_nameday_of_month_idday_of_month_reversed_idnumber_of_days_in_month

duration

duration_iddaysdays_nameweeksweeks_namemonthsmonths_namequartersquarters_nameyearsyears_nameconversionsconversions_name

cost_per_campaign_and_day

day_fkcampaign_fknumber_of_clicksimported_cost_mciimported_cost_apicost_of_clicks_directly_assignedcost_of_clicks_campaigns_without_clickscost_of_clicks_unknown_campaign

channel

channel_idchannel_name

corridor

corridor_idcorridor_namesender_country_fksender_country_namereceiver_country_fkreceiver_country_name

campaign_cohort

performance_attribution_model_fkuser_fkcampaign_fkchannel_fkday_fkduration_fknumber_of_transactionsgross_revenue

age

age_idage_nameage_group_idage_group_name

user_city

user_city_iduser_city_nameuser_state_iduser_state_nameuser_country_fkuser_country_name

gender

gender_idgender_name

referral_source

referral_source_idreferral_source_name

total_transaction_range

total_transaction_range_idtotal_transaction_range_name

transaction_frequency

transaction_frequency_idtransaction_frequency_name

transaction

transaction_idnumber_of_transactionsnumber_of_first_transactionsnumber_of_second_transactionsnumber_of_third_transactionsnumber_of_subsequent_transactionsnumber_of_transactions_with_vouchernumber_of_first_transactions_with_vouchernumber_of_money_transfer_transactionsnumber_of_airtime_transactionsnumber_of_on_hold_transactionsnumber_of_pending_transactionsnumber_of_paid_transactionsis_repeat_customer_idtransaction_status_fkcancellation_status_fkcustomer_fkuser_fksender_city_fkreceiver_city_fksender_currency_fkreceiver_currency_fkcorrespondent_fkvoucher_fkcorridor_fkpayment_method_fkreceive_method_fktransaction_rank_fkbank_fksent_amount_range_fksent_amount_money_transfersent_amount_airtimereceive_amount_creationreceive_amount_payouttotal_to_payfx_gainfeesvoucher_cost_money_transfervoucher_cost_airtimefx_rate_gbp_to_sent_amountfx_rate_gbp_to_receive_amount_creation_datefx_rate_gbp_to_receive_amount_payout_datefx_rate_sent_to_receive

bank

bank_idbank_name

cancellation_status

cancellation_status_idcancellation_status_name correspondent

correspondent_idcorrespondent_name

transaction_city

transaction_city_idtransaction_city_nametransaction_state_idtransaction_state_nametransaction_country_fktransaction_country_nametransaction_country_codetransaction_capital_latitudetransaction_capital_longitude

currency

currency_idcurrency_codecurrency_name

voucher

voucher_idvoucher_namevoucher_type_idvoucher_type_namevoucher_percentage_idvoucher_percentage_namevoucher_receive_method_group_idvoucher_receive_method_group_namevoucher_start_date_fkvoucher_end_date_fkvoucher_duration_days_idvoucher_duration_days_namevoucher_duration_range_idvoucher_duration_range_name

payment_method

payment_method_idpayment_method_namepayment_method_group_idpayment_method_group_name

receive_method

receive_method_idreceive_method_namereceive_method_group_idreceive_method_group_namereceive_service_id

sent_amount_range

sent_amount_range_idorigin_currency_fksent_amount_range_namerange_lower_limitrange_upper_limit

transaction_rank

transaction_rank_idtransaction_rank_nametransaction_rank_group_idtransaction_rank_group_name

transaction_status

transaction_status_idtransaction_status_name

country

country_idcountry_codecountry_name

foreign_exchange_rate

foreign_exchange_rate_idday_fksender_currency_fkreceiver_currency_fkforeign_exchange_rateforeign_exchange_rate_without_markup

voucher_usage_fact

day_fkvoucher_fkvoucher_duration_days_idvoucher_duration_days_namevoucher_duration_range_idvoucher_duration_range_namevoucher_start_date_fkvoucher_end_date_fkvoucher_is_money_transfer_idvoucher_is_airtime_idvoucher_is_valid_idvoucher_is_used_idvoucher_receive_method_idvoucher_receive_method_namenumber_of_customersnumber_of_transactionsnumber_of_first_transactionsfeesvoucher_cost_money_transfervoucher_cost_airtimefx_gainsent_amount_money_transfersent_amount_airtime

user_date

day_fkuser_fkuser_time_perspective_fk

user_time_perspective

user_time_perspective_iduser_time_perspective_name

transaction_event_date

transaction_event_fkday_fktransaction_event_time_perspective_fk

transaction_event

transaction_event_idnumber_of_transaction_eventstransaction_fkprevious_status_fkcurrent_status_fkhours_since_transactionhours_since_last_eventhours_to_next_eventsent_amount_money_transfersent_amount_airtimevoucher_cost_money_transfervoucher_cost_airtimefx_gainfees

transaction_event_time_perspective

transaction_event_time_perspective_idtransaction_event_time_perspective_name

transaction_date

day_fktransaction_fktransaction_time_perspective_fk

transaction_time_perspective

transaction_time_perspective_idtransaction_time_perspective_name

transaction_event_duration

transaction_event_fkduration_fktransaction_event_duration_perspective_fk

transaction_event_duration_perspective

transaction_event_duration_perspective_idtransaction_event_duration_perspective_name

transaction_duration

duration_fktransaction_fktransaction_time_perspective_fk

8 / 25

‣ https://www.worldremit.com/ finished soon* project

Real life schemas II

* A Data Warehouse is never finished

Page 9: How to build a data warehouse - code.talks 2014

artwork

artwork_idartist_fkshowdown_fkartwork_category_fkartwork_subject_fkartwork_is_curated_fkartwork_is_user_collection_fkartwork_is_admin_collection_fkartwork_related_fkartwork_sale_category_fkartwork_for_sale_as_print_fkartwork_for_sale_as_original_fkdate_uploaded_fkartwork_in_showdown_fkartwork_in_weekly_roundup_fkartwork_is_visibleartwork_is_in_curatedartwork_is_in_user_collectionartwork_is_in_admin_collectionuser_collections_per_artworkadmin_collections_per_artworkurltitlestylesartist_nameartist_first_nameartist_last_name

artwork_category

artwork_category_idartwork_category_name

artwork_for_sale_as_original

artwork_for_sale_as_original_idartwork_for_sale_as_original_name

artwork_for_sale_as_print

artwork_for_sale_as_print_idartwork_for_sale_as_print_name

artwork_in_showdown

artwork_in_showdown_idartwork_in_showdown_name

artwork_in_weekly_roundup

artwork_in_weekly_roundup_idartwork_in_weekly_roundup_name

artwork_is_admin_collection

artwork_is_admin_collection_idartwork_is_admin_collection_name

artwork_is_curated

artwork_is_curated_idartwork_is_curated_name

artwork_is_user_collection

artwork_is_user_collection_idartwork_is_user_collection_nameartwork_related

artwork_related_idartwork_related_name

artwork_sale_category

artwork_sale_category_idartwork_sale_category_name

artwork_subject

artwork_subject_idartwork_subject_name

round

round_idshowdown_idshowdown_roundshowdown_title_sort_idshowdown_title

user

user_iduser_type_fkuser_status_fkuser_city_fkartist_with_artwork_for_sale_idartist_with_artwork_uploaded_iduser_nameuser_first_nameuser_last_nameemailnumber_of_weekly_roundupnumber_of_showdownnumber_of_artwork_commentsnumber_of_collection_commentsnumber_of_artworks_in_user_collectionsnumber_of_user_likesnumber_of_collection_favouritesnumber_of_user_loginsnumber_of_messages_sentnumber_of_uploadshours_to_first_uploadnumber_of_bought_itemsnumber_of_originals_boughtnumber_of_prints_boughtnumber_of_orders_madenet_item_price_boughtnet_item_revenue_boughtgross_revenue_after_vouchers_boughtnet_revenue_after_vouchers_boughtnet_voucher_cost_boughtnumber_of_sold_itemsnumber_of_originals_soldnumber_of_prints_soldnumber_of_orders_soldnet_item_price_soldnet_item_revenue_soldnet_voucher_cost_sold

product

product_idskuartwork_fkproduct_category_fksubstrate_fk

product_category

product_category_idproduct_category_nameedition_type

substrate

substrate_idsubstrate_name

collection_artwork_order_item

collection_artwork_order_item_idcollection_fkartwork_fkorder_item_fk

collection

collection_idcollection_nameuser_fkcollection_type_fkcollection_detailed_type_fkdate_created_fkdate_initiated_fk

order_item

order_item_idprocessed_order_item_idis_original_idis_print_idprocessed_product_idorder_fkproduct_fkorder_item_status_fkprice_range_fkorder_process_fkoption_fkfulfillment_provider_fkrefund_reason_fkgross_revenue_itemnet_item_pricenet_item_price_first_ordervat_amountnet_shipping_revenuenet_shipping_revenue_first_orderduties_amountgross_revenue_item_optionnet_option_pricenet_option_price_first_ordernet_payment_costnet_option_costnet_printing_costnet_voucher_amount_saatchi_sharenet_voucher_amount_artist_sharenet_voucher_amount_saatchi_share_first_ordernet_voucher_amount_artist_share_first_orderartist_commissionartist_royaltiesestimated_net_revenue_after_vouchersorigin_country_iso2origin_latitudeorigin_longitudedestination_latitudedestination_longitude

artwork_style_mapping

artwork_fkartwork_style_fkartwork_style

artwork_style_idartwork_style_name

artwork_in_collection

artwork_fkcollection_fk

collection_artwork_order_item_date

collection_artwork_order_item_fkday_fkcollection_artwork_order_item_time_perspective_fk

collection_artwork_order_item_time_perspective

collection_artwork_order_item_time_perspective_idcollection_artwork_order_item_time_perspective_name

day

day_idday_nameyear_idiso_year_idquarter_idquarter_namemonth_idmonth_nameweek_idweek_nameday_of_the_monthnumber_of_days_in_monthiso_date

collection_detailed_type

collection_detailed_type_idcollection_detailed_type_name

collection_type

collection_type_idcollection_type_name

campaign_click_date

campaign_click_fkday_fkonline_marketing_time_perspective_fk

campaign_click

campaign_click_idcampaign_fksearch_phrase_fkreferrer_fkuser_fknumber_of_clicksnumber_of_daily_visitsnumber_of_monthly_visitsnumber_of_new_visitsnumber_of_daily_visitorsnumber_of_monthly_visitorssubsequent_registration_fksubsequent_confirmation_fksubsequent_first_order_fksubsequent_order_fkdirect_costcost_of_campaigns_without_clicksunmatched_costvisit_duration

online_marketing_time_perspective

online_marketing_time_perspective_idonline_marketing_time_perspective_name

email_event_date

email_event_fkday_fkemail_time_perspective_fk

email_event

email_event_idemail_list_fkemail_campaign_fkemail_recipient_fksubscribeunsubscribeemail_unsubscribe_reason_fksentbounce_softbounce_hardopenfirst_openclickfirst_clicksubsequent_ordersubsequent_first_orderitemsnet_item_pricenet_option_pricenet_shipping_revenuenet_voucher_amount_saatchi_sharenet_voucher_amount_artist_share

email_time_perspective

email_time_perspective_idemail_time_perspective_name

transactional_mail

number_of_mails_senttransactional_mail_type_fkday_fk

transactional_mail_type

transactional_mail_type_idtransactional_mail_type_name

sales_event_date

sales_event_fkday_fksales_event_date_perspective_fk

sales_event

sales_event_idorder_item_fkorder_item_current_status_fkorder_item_status_partition_fkorder_timestampevent_timestamphours_since_orderhours_since_last_eventhours_to_next_eventeffected_net_revenue_after_vouchersestimated_net_revenue_after_vouchers

sales_event_date_perspective

sales_event_date_perspective_idsales_event_date_perspective_name

order_date

day_fkorder_fkorder_date_perspective_fk

order

order_idorder_increment_idprocessed_order_idis_first_order_idis_second_order_idis_second_or_subsequent_order_idorder_with_voucher_iduser_fkreturning_buyer_fkhour_of_day_fkvoucher_fkpayment_method_fkpayment_provider_fkshipping_city_fkorder_source_fk

order_date_perspective

order_date_perspective_idorder_date_perspective_name

sales_event_duration

sales_event_fkduration_fksales_event_duration_perspective_fk

duration

duration_iddaysdays_nameweeksweeks_namemonthsmonths_namequartersquarters_namefive_day_periodfive_day_period_name

sales_event_duration_perspective

sales_event_duration_perspective_idsales_event_duration_perspective_name

order_duration

order_fkduration_fksales_time_perspective_fk

sales_time_perspective

sales_time_perspective_idsales_time_perspective_name

fulfillment_provider

fulfillment_provider_idfulfillment_provider_name

option

option_idoption_name

order_item_status

order_item_status_idorder_item_status_sort_idorder_item_status_name

order_process

order_process_idorder_process_namecheckout_type_idcheckout_typefulfillment_type_idfulfillment_type

price_range

price_range_idprice_range_name

refund_reason

refund_reason_idrefund_reason_namerefund_code_id

hour_of_day

hour_of_day_idhour_of_day_name

order_source

order_source_idorder_source_name

payment_method

payment_method_idpayment_method_name

payment_provider

payment_provider_idpayment_provider_name

shipping_city

shipping_city_idshipping_city_nameshipping_country_idshipping_country_name

voucher

voucher_idvoucher_name

order_item_status_partition

order_item_status_partition_idorder_item_status_perspective_idorder_item_status_perspective_nameorder_item_status_group_idorder_item_status_group_name

order_item_refunds

order_item_refunds_idorder_item_fkrefund_code_idrefund_coderefund_descrefund_amountrefund_daterefund_comment

order_item_status_mapping

order_item_status_fkorder_item_status_partition_fk

email_campaign

email_campaign_idemail_campaign_nameemail_list_fk

email_recipient

email_recipient_idemailemail_recipient_location_fk

email_unsubscribe_reason

email_unsubscribe_reason_idemail_unsubscribe_reason_name

email_list

email_list_idemail_list_name

email_recipient_location

email_recipient_location_idcountry_idcountry_nameregion_idregion_namelatitudelongitude

user_city

user_city_iduser_cityuser_country_iduser_country

user_status

user_status_iduser_status_name

user_type

user_type_iduser_type_name

user_event_date_registration

user_event_fkday_fkuser_event_time_perspective_fk

user_event

user_event_iduser_fkuser_type_fkuser_event_dateregistration_dateweekly_roundupshowdownartwork_commentcollection_commentartworkuser_likescollection_favouriteuser_loginmessage_sentartwork_uploadartwork_for_sale_as_printartwork_for_sale_as_originalartwork_for_sale_as_both_print_and_originalartwork_for_sale_as_either_print_or_originalsignupverified_signupuser_ordertime_since_signuptime_since_last_order

user_event_date_event

user_event_fkday_fkuser_event_time_perspective_fk

campaign

campaign_idcampaign_namecampaign_codechannel_idchannel_nameis_brand_idis_brand_namepartner_or_adwords_account_idpartner_or_adwords_account_namepublication_or_adwords_campaign_idpublication_or_adwords_campaign_namewmc_or_adwords_adgroup_idwmc_or_adwords_adgroup_name

referrer

referrer_idreferrer_namereferrer_type_name

search_phrase

search_phrase_idsearch_phrase_namesearch_phrase_type_name

user_date

user_fkday_fksales_time_perspective_fk

campaign_click_position

campaign_click_fkconversion_type_fkconversion_entity_fkrelative_conversion_path_position_fktime_to_conversion_fkconversion_path_positionconversion_path_lengthhours_to_registrationhours_to_order

conversion_path_position

conversion_path_position_idconversion_path_position_nameconversion_type

conversion_type_idconversion_type_name

relative_conversion_path_position

relative_conversion_path_position_idmedian_idmedian_namequartile_idquartile_namedecile_iddecile_namepercentile_idpercentile_name

campaign_click_performance

campaign_click_fkperformance_attribution_model_fkconversion_type_fknumber_of_registrationsnumber_of_leadsnumber_of_ordersnumber_of_received_ordersnumber_of_first_ordersnumber_of_orders_with_vouchernet_order_revenue performance_attribution_model

performance_attribution_model_idperformance_attribution_model_name

9 / 25

‣ http://www.saatchiart.com/exit August 2014

Real life schemas III

Page 10: How to build a data warehouse - code.talks 2014

Optimize for change speed!

Data integration

‣ Visuals ETL tools• many data source connectors

• hard to debug

• slow to change

‣ Start with simple sql queries & batch scripts

!

!

!

!

!

‣ Later build something more robust

10 / 25

cat create-tables.sql | psql dwh!!cat load-order.sql \! | mysql --skip-column-names source_db \! | psql dwh --command="COPY tmp.order FROM STDIN \! WITH NULL AS 'NULL'"!!cat /data/payment.csv \! | python payment_filter.py! | psql dwh --command="COPY tmp.payment FROM STDIN” !!cat transform-order.sql | psql dwh!!

Page 11: How to build a data warehouse - code.talks 2014

Data integration in Yves & Zed

11 / 25

‣ Jobs = processing steps with dependencies• parallel execution with cost based scheduler

• robust, transparent, no black boxes

‣ Parallel jobs & incremental processing

‣ Extensive visualisations & monitoring tools

Page 12: How to build a data warehouse - code.talks 2014

Plain text files

‣ Very git-friendly

12 / 25

<?xml version="1.0" encoding="UTF-8"?>!<process xmlns="http://project-a.com/dwh-process"! id=“operational-data" ..>!! <initial-job id="initialize-schemas">! <description>Recreates schemas and writes configs</description>! <commands>! ..! </commands>! </initial-job>!! <!-- Orders -->! <job id="load-order">! <description>Loads orders into tmp.order</description>! <commands>! <execute-sql-file file-name="orders/create-order-tmp-table.sql" echo-queries="true"/>! <load-from-mysql file-name="orders/load-order.sql"! target-table="tmp.order" database="app"! timezone="UTC"/>! <execute-sql>SELECT tmp.index_tmp_order();</execute-sql>! </commands>! </job>!! <job id="cleanse-order">! <description>Deletes test orders and other invalid orders</description>! <dependencies>! <dependency job="cleanse-member"/>! <dependency job="load-order-item"/>! <dependency job="load-product"/>!

Page 13: How to build a data warehouse - code.talks 2014

Each KPI is always computed in the same way

MDX = query language for multidimensional data

‣ Developed by Microsoft as part of Analysis Services• http://en.wikipedia.org/wiki/MultiDimensional_eXpressions

!

!

13 / 25

SELECT !TopCount([Product].[Product].Members, 2,! [Measures].[Revenue])! ON COLUMNS,![Measures].[Revenue]! ON ROWS!FROM [Pet sales]!WHERE [Date].[Month].[Oct]

SELECT [Date].[Month].Members! ON COLUMNS,!CrossJoin({[Measures].[Sold items],! [Measures].[# Orders], ! [Measures].[Revenue]},! Descendants([Product].[All products]))! ON ROWS!FROM [Pet sales]

order_item

item_idorder_idhas_voucherpriceday_fkproduct_fk

day

day_idday_namemonth_idmonth_name

product

product_idproduct_name

Page 14: How to build a data warehouse - code.talks 2014

Mondrian = engine for executing MDX

‣ Open source analytics processor• http://mondrian.pentaho.com

• http://en.wikipedia.org/wiki/Mondrian_OLAP_server

• In Java

• Eclipse Public License

• Active community

• https://github.com/pentaho/mondrian/

!

‣ Part of Pentaho BI platform

14 / 25

M A N N I N G

William D. BackNicholas Goodman

Julian Hyde

Open source business analytics

www.it-ebooks.info

Page 15: How to build a data warehouse - code.talks 2014

Mondrian schema I

‣ The relation between fact tables and dimension tables is defined in a XML file

15 / 25

<Cube name="Pet sales" defaultMeasure="# Orders">! <Table schema="dim" name="order_item"/>!! <Dimension name="Date" type="TimeDimension" foreignKey="day_fk">! <Hierarchy allMemberName="All dates" hasAll="true" primaryKey="day_id">! <Table schema="dim" name="day"/>! <Level name="Month" column="month_id" nameColumn="month_name"! type="Integer" levelType="TimeMonths" uniqueMembers="true"/>! <Level name="Day" column="day_id" nameColumn="day_name"! type="Integer" levelType="TimeDays" uniqueMembers="true"/>! </Hierarchy>! </Dimension>! ! <Dimension name="Product" foreignKey="product_fk">! <Hierarchy hasAll="true" allMemberName="All products" primaryKey="product_id">! <Table schema="dim" name="product"/>! <Level name="Product" column="product_id" nameColumn="product_name"! type="Integer" uniqueMembers="true"/>! </Hierarchy>! </Dimension>! ! ..!</Cube>

order_item

item_idorder_idhas_voucherpriceday_fkproduct_fk

day

day_idday_namemonth_idmonth_name

product

product_idproduct_name

Page 16: How to build a data warehouse - code.talks 2014

Each KPI is always computed in the same way

Mondrian schema II

‣ Measures as defined as aggregates on columns

!

!

!

!

‣ Mondrian = SQL query generator

16 / 25

SELECT [Date].[Month].Members! ON COLUMNS,![Measures].[Avg cart value]! ON ROWS!FROM [Pet sales]

SELECT! "day"."month_id" AS "c0",! count(DISTINCT "order_item"."order_id") AS "m0",! sum("order_item"."price") AS "m1"!FROM! "dim"."day" AS "day",! "dim"."order_item" AS "order_item"!WHERE! "order_item"."day_fk" = "day"."day_id"!GROUP BY! "day"."month_id"

order_item

item_idorder_idhas_voucherpriceday_fkproduct_fk

day

day_idday_namemonth_idmonth_name

product

product_idproduct_name<Cube name="Pet sales" defaultMeasure="# Orders”>!

..! <Measure name="# Orders" column="order_id" datatype="Integer" aggregator="distinct-count" formatString="Standard"/>!! <Measure name="Revenue" column="price" datatype="Integer" aggregator="sum" formatString="Currency"/>!! <Measure name="Sold items" column="item_id" datatype="Integer" aggregator="count" formatString="Standard"/>!! <CalculatedMember name="Avg cart value" dimension="Measures">! <Formula>[Measures].[Revenue] / [Measures].[# Orders]</Formula>! </CalculatedMember>!</Cube>!!

➞ ➞

Page 17: How to build a data warehouse - code.talks 2014

Always draw your Mondrian schema!

Mondrian schema III

‣ Everything about KPIs & dimensions (business) and tables & columns (IT) in one file• consistent & explicit semantics

• transparency is easy

17 / 25

Page 18: How to build a data warehouse - code.talks 2014

Try it out immediately, it’s amazing: http://demo.analytical-labs.com/

Ad-hoc queries with Saiku Analytics

‣ Drag & drop reporting tool on top of Mondrian• Open source (Apache 2.0)

• Talks to Mondrian via MDX

• http://meteorite.bi/saiku

18 / 25

Page 19: How to build a data warehouse - code.talks 2014

Reports in Yves & Zed I

‣ Own lightweight reporting frontend• bootstrap/ Google charts

• lacks many features

• features are easy to implement

19 / 25Numbers are random!

Page 20: How to build a data warehouse - code.talks 2014

Numbers are random!

Reports in Yves & Zed II

‣ Dashboard-like interactive reports • maintained by developers

• each table / chart is an MDX query

20 / 25

Page 21: How to build a data warehouse - code.talks 2014

XMLA = XML for Analysis = MDX via SOAP

‣ Industry standard originally proposed by Microsoft• http://en.wikipedia.org/wiki/XML_for_Analysis

• Soap protocol to discover and query OLAP cubes

• Mondrian has an XMLA server

‣ Request

‣ Response

21 / 25

<?xml version="1.0" encoding="UTF-8"?>!<SOAP-ENV:Envelope xmlns:SOAP-ENV=“..”>! <SOAP-ENV:Header/>! <SOAP-ENV:Body>! <Execute xmlns="urn:schemas-microsoft-com:xml-analysis">! <Command>! <Statement>! <![CDATA[!SELECT [Date].[Month].Members! ON COLUMNS,![Measures].[Avg cart value]! ON ROWS!FROM [Pet sales]! ]]>! </Statement>! </Command>! <Properties>! <PropertyList>! <Catalog>dwh</Catalog>! <DataSourceInfo>Monsai</DataSourceInfo>! <Format>Multidimensional</Format>!

<?xml version="1.0" encoding="UTF-8"?>!<SOAP-ENV:Envelope xmlns:SOAP-ENV="..">! <SOAP-ENV:Header ../>! <SOAP-ENV:Body>! <cxmla:ExecuteResponse xmlns:cxmla="urn:schemas-microsoft-com:xml-analysis">! <cxmla:return>! <root>! <OlapInfo ../>! <Axes>! <Axis name=“Axis0" ../>! <Axis name="Axis1">! <Tuples>! <Tuple>! <Member Hierarchy=“Measures" ..>! </Tuple>! </Tuples>! </Axis>! <Axis name=“SlicerAxis" ../>! </Axes>! <CellData>! <Cell CellOrdinal="0">! <Value xsi:type="xsd:double">26.666666666666668</Value>! <FmtValue>26,67 €</FmtValue>! <FormatString>Standard</FormatString>! </Cell>! <Cell CellOrdinal="1">! <Value xsi:type="xsd:double">40</Value>! <FmtValue>40,00 €</FmtValue>! <FormatString>Standard</FormatString>! </Cell>!

Page 22: How to build a data warehouse - code.talks 2014

Data Warehouse in Yves & Zed

!

!

!

!

!

!

!

!

!

!

!

!

‣ monsai = Mondrian XMLA Server + Saiku in a single war file, https://github.com/project-a/monsai

22 / 25

applicationdatabases

json files

csv files

apis

SQLSQL

DB results

XMLA / MDX

XMLA responseMondrian

Mondrian schema

MDX results

databasemapping

data integration monsai reporting

Page 23: How to build a data warehouse - code.talks 2014

Search for computer scientists, not business intelligence experts

What kind of people do you need to hire for this?

‣ The “typical BI expert”:• studied something related to business and learnt VBA

programming through Excel

• relies on others to set up databases and tools

‣ Your ideal candidate• has studied computer science

• masters the basic tools of software development and computer science

• likes to learn new technologies

• understands how databases work

‣ Good profile example: http://www.project-a.com/en/careers/jobs/?yid=332

For our "A-Team" we are looking to fill the following position as soon as possible

Data Engineer / Data Scientist (m/f) Your tasks:

You will help our business intelligence team to build data driven applications for our ventures:data warehouses, recommendation engines and CRM systems (developed in-house, basedon open-source technologies)You will integrate, transform and index data from various data sources, develop meaningfuldata representations and visualisations, and provide aggregated data for third-party systemsYou will advance our software architecture and tool set to growing challenges and dataamounts (performance, scaling, data quality)You will work in an agile software development process in close collaboration with a productmanagement team

Your profile:

You have a Master's degree in computer science or a comparable degreeYou have a genuine interest in data and algorithms and you are excited about solving difficultproblems and strive for efficient and robust solutionsYou master at least these basic tools of computer science: object oriented programming inmultiple languages, HTTP and current web technologies, the unix command line and basicserver administration, version control systems, a basic understanding of the interplaybetween software and memory, hard discs and the CPUYou have profound knowledge about the inner workings of database systemsYou are eager to delve into new technologies and programming languages (our currentstack: Mac or Linux, PostgreSQL, Mondrian & MDX, PHP, Java, Python, Solr, ElasticSearch,R)You have a basic understanding of mathematics and machine learning

Your chance:

You will join a highly professional and motivated teamYou will have the unique opportunity to witness the launch of a newly established companyand you can contribute your own ideas to its developmentYou will benefit from the greatest possible creative freedom to develop your skills furtherYou will enjoy a state-of-the-art, top-equipped workplace right in the center of BerlinYou will benefit from a communicative, stimulating and inspiring environment

We’re looking forward to your online application.

Job opportunity Data Engineer / Data Scientist (m/f) at Projec... https://karriere.project-a.com/eng?yid=332

1 of 1 06/10/14 14:07

23 / 25

Page 24: How to build a data warehouse - code.talks 2014

Any kind of Scrum / Kanban works, do it

Use a standard software engineering process!

‣ Product managers: what?• Collection of business requirements

• KPI & report definitions

• QA & analysis

!

‣ Developers: how?• Implementation, performance & stability

• Schema & process design

• Consistency checks

24 / 25

Netrevenue

Netvouchercost

Avgnet

vouchercostperorder

Avgnetordervalue

Contributionmargin1

Contributionmargin3a

%Contributionmargin1

%Net

revenuesecondor

subsequentorders

Avgnet

revenueper

buyingmember

Taxshippingamount

Taxamount

%Firstorders

%Secondor

subsequentorders

#Ordersper

buyingmember

Avg#of

itemsperorder

Avg#of

uniqueitemsperorder

Grossrevenue

Avggrossitemprice

Grosspriceto

grossretailpriceratio

Pricetoretailpriceratio

Avggrossordervalue

%Grossvouchercost

Grossinvoicedamount

Netinvoicedamount

Grossretailprice

Avgretailprice

Avgnet

discount

Netpricetonet

purchasepriceratio

Netpricetonetretailpriceratio

%Vouchersused

Avgnetitemprice

Avgnet

purchasecost

%Net

discount

Avgnetordervaluefirstorders

Avgnetordervaluesecondor

subsequentorders

Avggrossinvoicedamount

Avgnet

invoicedamount

HGBnet

revenuemargin

Avggrossvouchercostper

buyingmember

Contributionmargin2a

Contributionmargin2b

Avgcontributionmargin2bper

buyingmember

Netcostofsales

Netcostofsalesperorder

#Orders

#Firstorders

#Secondorders

#Secondor

subsequentorders

#Buyingmembers

#Returningmembers

#Items

#Uniqueitems

Netitem

revenue

Taxitemamount

Netpurchasecost

Netretailprice

Retailtax

amount

Grossvouchercost

#Orderswith

vouchers

#Orderswithoneuniqueitem

Netshippingrevenue

Grossshippingrevenue

Netrevenuefirstorders

Netrevenuesecondor

subsequentorders

Netpaymentcost

Netreturncost

Netfulfillmentcost

Price Retailprice

Page 25: How to build a data warehouse - code.talks 2014

http://www.project-a.com/

Thank you

25 / 25

Data integration is easy if you keep things simple!