Upload
martin-loetzsch
View
542
Download
0
Embed Size (px)
DESCRIPTION
Lightweight data warehouse development with open source technologies
Citation preview
How to Build a Data Warehouse?Martin Loetzsch
Project A Ventures, Berlin!http://project-a.comhttp://twitter.com/martin-loetzsch
The “typical startup”
‣ Has data in • application database
• Excel & csv files
• external tools
‣ Excel based reporting chains• manual sql queries, CSVs
• copy & paste from external data sources
• difficult to debug and test
• sometimes cranky
!
‣ Everybody pulls their own numbers. # Orders?
!
!
!
!
!
!
!
‣ Does not have “big data”
‣ Will not have “big data” in the relevant future
2 / 25
-- count rows!SELECT count(*) FROM orders;!!-- count everything except test orders!SELECT count(*) FROM orders!WHERE is_test IS NULL;!!-- count everything that was once paid!SELECT count(*) FROM orders!JOIN order_history ON order_fk = order_id!WHERE status_id = 17;
If Excel works for your company, stick to it
Integrate data!
Data driven growth requires integrated data
‣ Integrated data = Data Warehouse
!
!
!
!
‣ Data in the Data Warehouse is• the single point of truth
• cleaned up & validated
• easy to access
• embedded in the organisation
‣ Connect data from different domains
3 / 25
applicationdatabases
json files
csv files
apis
reporting
crm
marketing
…
search
pricing
DWHorders
users
products
stocks
prices
emailsclicks
…
‣ 1. Use a BI Solutions by one of the big vendors!
!
• classic agency business
• takes forever in startup time
• usually too expensive
!
‣ 2. Use a cloud based DWH solution!
!
• covers only 80% of your business questions
• usually not possible to extend
‣ 3. Build your own, it’s easy!!
!
• with technology that existed in the 1990s
• simple ETL scripts running inside Postgresql
• open source Pentaho Mondrian as query processor
• own lightweight reporting frontend
• integrated in own shop system
‣ Keep it simple & pragmatic
‣ Don’t use big data technologies if you don’t have big data
How to build a Data Warehouse?
4 / 25Invest in own BI infrastructure
Works with Excel, SQL frontends, Elasticsearch, Mondrian & other BI front ends
Basis of any Data Warehouse: fact tables
‣
!
!
!
‣ KPIs: aggregations on single columns
‣ All time orders?!
‣ Revenue October 1st?!
‣ Sales by product?
!
!
‣ Allowed query operations• Aggregations (count, distinct-count, sum, avg)
• Filtering
• Grouping
5 / 25
item id
order id
has voucher price day product
1 1 20 09-30 Cat
2 1 10 09-30 Dog
3 2 2 20 09-30 Cat
4 3 30 09-30 Cow
5 4 4 10 10-01 Dog
6 4 4 30 10-01 Cow
# Sold items: count(item_id)
# Orders: distinct-count(order_id)
# Orders with vouchers: distinct-count(has_voucher)
Revenue: sum(price)
Avg product price: avg(price)
SELECT count(distinct order_id) FROM order_item;
SELECT sum(price) FROM order_item WHERE day = ’10-01';
SELECT count(item_id) FROM order_item GROUP BY product;
‣ Move redundant categorial data to “dimension” tables
order_item
item_idorder_idhas_voucherpriceday_fkproduct_fk
day
day_idday_namemonth_idmonth_name
product
product_idproduct_name
Key challenge: finding good keys
Dimensional modelling
6 / 25
item id
order id
has voucher price day
fkproduct
fk1 1 20 930 1
2 1 10 930 2
3 2 2 20 930 1
4 3 30 930 3
5 4 4 10 1001 2
6 4 4 30 1001 3
day id
day name
month id
month name
930 09-30 9 Sep
1001 10-01 10 Oct
product id
product name
1 Cat
2 Dog
3 Cow
order_item_status_mapping
order_item_status_fkorder_item_status_partition_fk
order_item_status
order_item_status_idorder_item_status_sort_idorder_item_status_name
order_item_status_partition
order_item_status_partition_idorder_item_status_perspective_idorder_item_status_perspective_nameorder_item_status_process_idorder_item_status_process_nameorder_item_status_group_idorder_item_status_group_name
sales_event
sales_event_idorder_item_fkorder_item_current_status_fkorder_item_status_partition_fkorder_timestampevent_timestamphours_since_orderhours_since_last_eventhours_to_next_eventestimated_net_revenue
order_item
order_item_idorder_fkmerchant_fkproduct_fkcategory_fkcategory_tree_fkorder_process_fkorder_item_status_fkprocessed_order_item_idnet_shipping_revenuetax_amount_shippinggross_voucher_valuenet_voucher_valuegross_revenue_before_vouchernet_item_valuegross_item_valuetax_amount_before_vouchertax_amount_vouchergross_shipping_revenuegross_shipping_revenue_before_vouchernet_purchase_costgross_purchase_costnet_revenue_returnednet_revenue_cancelednet_payment_costnet_return_cost_and_loss_and_fraudnet_shipping_and_fulfillment_costnet_marketing_expenses
customer_address
customer_address_idaddresszip_codefirst_namelast_namecitycountry_fkgenderaccount_disabledcompanyphonecell_phone
country
country_idcountry_name
order
order_idincrement_idorder_type_fkis_first_order_idis_follow_up_order_idis_second_order_idis_second_or_subsequent_order_idcustomer_fkreturning_customer_fkorder_rank_fkitems_per_order_fkpayment_method_fkpayment_provider_fkzip_code_fkorder_rank_1st_fk
order_rank
order_rank_idorder_rank_nameorder_rank_group_idorder_rank_group_name
customer
customer_idincrement_idcustomer_nameemailnumber_of_ordersfirst_order_datelast_order_dateavg_days_between_ordersnumber_of_orders_with_vouchersphonecompanygender_fkcustomer_type_fkcustomer_group_fkcustomer_industry_fk
order_type
order_type_idorder_type_name
order_date
day_fkhour_of_day_fkday_of_week_fkorder_fkorder_date_perspective_fk
order_date_perspective
order_date_perspective_idorder_date_perspective_name
hour_of_day
hour_of_day_idhour_of_day_name
day_of_week
day_of_week_idday_of_week_nameday
day_idday_reversed_idday_nameyear_idyear_reversed_idiso_year_idiso_year_reversed_idquarter_idquarter_namemonth_idmonth_reversed_idmonth_nameweek_idweek_reversed_idweek_name
customer_date
day_fkcustomer_fkcustomer_date_perspective_fk
customer_date_perspective
customer_date_perspective_idcustomer_date_perspective_name
gender
gender_idgender_name
customer_type
customer_type_idcustomer_type_name
customer_group
customer_group_idcustomer_group_name
sales_event_duration
sales_event_fkduration_fksales_event_duration_perspective_fk
sales_event_duration_perspective
sales_event_duration_perspective_idsales_event_duration_perspective_name
duration
duration_iddaysdays_nameweeksweeks_namemonthsmonths_namequartersquarters_nameyearsyears_name
sales_event_date
sales_event_fkday_fksales_event_date_perspective_fk
sales_event_date_perspective
sales_event_date_perspective_idsales_event_date_perspective_name
product
product_idskueansales_skumerchant_skuproduct_name
category
category_idcategory_parent_fkcategory_name
newsletter_event
newsletter_event_idday_fkcampaign_fkcustomer_increment_fksentbouncebounce_blockbounce_softbounce_hardbounce_reason_fkopenfirst_openclickfirst_clickurl_fkcomplaintsubsequent_order_fkfirst_ordergross_revenue_before_vouchergross_voucher_valuenet_voucher_valuetax_amount_before_vouchernet_purchase_cost
url
url_idurl_name
campaign
campaign_idcampaign_namesubjectsent_hoursent_date
cost_per_campaign_and_day
day_fkcampaign_fknumber_of_clicksimported_cost_mciimported_cost_apicost_of_clicks_directly_assignedcost_of_clicks_campaigns_without_clickscost_of_clicks_unknown_campaign
campaign
campaign_idcampaign_namelevel_3_idlevel_3_namelevel_2_idlevel_2_namechannel_fk
channel
channel_idchannel_name
Star schema, galaxy schema, nth normal form? Doesn’t matter, do what’s fastest.
Real life schemas I
‣ https://www.contorion.de/early stage project
7 / 25
campaign_click_performance
campaign_click_fkperformance_attribution_model_fkattribution_path_segment_fknumber_of_signupsnumber_of_activationsnumber_of_transactionsgross_revenue
campaign_click
campaign_click_idvisitor_idday_fkcampaign_fkuser_fkpath_segment_fkpath_position_fkreverse_path_position_fkstep_fknext_step_fkstep_reverse_fknext_step_reverse_fknumber_of_clicksnumber_of_new_visitorsnumber_of_daily_visitorsnumber_of_monthly_visitorsduration_fktime_to_endmarketing_cost
path_segment
path_segment_idpath_segment_name
performance_attribution_model
performance_attribution_model_idperformance_attribution_model_name
campaign
campaign_idcampaign_namelevel_3_idlevel_3_namelevel_2_idlevel_2_namechannel_fkcorridor_fk
conversion_path_transition
conversion_path_transition_idconversion_path_transition_namechannel_with_position_idchannel_with_position_name
path_position
path_position_idpath_position_name
reverse_path_position
reverse_path_position_idreverse_path_position_name
user
user_idnumber_of_userscustomer_idnumber_of_customersrepeat_customer_idgender_fkage_fkuser_city_fkuser_state_fkuser_country_fkmost_freq_corridor_fktotal_transaction_range_fkreferral_source_fktransaction_frequency_fkhas_sent_cash_idhas_sent_airtime_idhas_sent_cash_and_airtime_idsent_amount_money_transfernumber_of_transactionsnumber_of_transactions_with_voucherfeesfx_gainsent_amount_airtimevoucher_cost_money_transfervoucher_cost_airtimedays_between_signup_and_first_transactiondays_between_signup_and_second_transactiondays_between_first_and_second_transactiondays_between_second_and_third_transactionaverage_days_between_transactionsdays_since_last_transactiondays_since_last_login
day
day_idday_nameyear_idiso_year_idquarter_idquarter_namemonth_idmonth_nameweek_idweek_nameday_of_week_idday_of_week_nameday_of_month_idday_of_month_reversed_idnumber_of_days_in_month
duration
duration_iddaysdays_nameweeksweeks_namemonthsmonths_namequartersquarters_nameyearsyears_nameconversionsconversions_name
cost_per_campaign_and_day
day_fkcampaign_fknumber_of_clicksimported_cost_mciimported_cost_apicost_of_clicks_directly_assignedcost_of_clicks_campaigns_without_clickscost_of_clicks_unknown_campaign
channel
channel_idchannel_name
corridor
corridor_idcorridor_namesender_country_fksender_country_namereceiver_country_fkreceiver_country_name
campaign_cohort
performance_attribution_model_fkuser_fkcampaign_fkchannel_fkday_fkduration_fknumber_of_transactionsgross_revenue
age
age_idage_nameage_group_idage_group_name
user_city
user_city_iduser_city_nameuser_state_iduser_state_nameuser_country_fkuser_country_name
gender
gender_idgender_name
referral_source
referral_source_idreferral_source_name
total_transaction_range
total_transaction_range_idtotal_transaction_range_name
transaction_frequency
transaction_frequency_idtransaction_frequency_name
transaction
transaction_idnumber_of_transactionsnumber_of_first_transactionsnumber_of_second_transactionsnumber_of_third_transactionsnumber_of_subsequent_transactionsnumber_of_transactions_with_vouchernumber_of_first_transactions_with_vouchernumber_of_money_transfer_transactionsnumber_of_airtime_transactionsnumber_of_on_hold_transactionsnumber_of_pending_transactionsnumber_of_paid_transactionsis_repeat_customer_idtransaction_status_fkcancellation_status_fkcustomer_fkuser_fksender_city_fkreceiver_city_fksender_currency_fkreceiver_currency_fkcorrespondent_fkvoucher_fkcorridor_fkpayment_method_fkreceive_method_fktransaction_rank_fkbank_fksent_amount_range_fksent_amount_money_transfersent_amount_airtimereceive_amount_creationreceive_amount_payouttotal_to_payfx_gainfeesvoucher_cost_money_transfervoucher_cost_airtimefx_rate_gbp_to_sent_amountfx_rate_gbp_to_receive_amount_creation_datefx_rate_gbp_to_receive_amount_payout_datefx_rate_sent_to_receive
bank
bank_idbank_name
cancellation_status
cancellation_status_idcancellation_status_name correspondent
correspondent_idcorrespondent_name
transaction_city
transaction_city_idtransaction_city_nametransaction_state_idtransaction_state_nametransaction_country_fktransaction_country_nametransaction_country_codetransaction_capital_latitudetransaction_capital_longitude
currency
currency_idcurrency_codecurrency_name
voucher
voucher_idvoucher_namevoucher_type_idvoucher_type_namevoucher_percentage_idvoucher_percentage_namevoucher_receive_method_group_idvoucher_receive_method_group_namevoucher_start_date_fkvoucher_end_date_fkvoucher_duration_days_idvoucher_duration_days_namevoucher_duration_range_idvoucher_duration_range_name
payment_method
payment_method_idpayment_method_namepayment_method_group_idpayment_method_group_name
receive_method
receive_method_idreceive_method_namereceive_method_group_idreceive_method_group_namereceive_service_id
sent_amount_range
sent_amount_range_idorigin_currency_fksent_amount_range_namerange_lower_limitrange_upper_limit
transaction_rank
transaction_rank_idtransaction_rank_nametransaction_rank_group_idtransaction_rank_group_name
transaction_status
transaction_status_idtransaction_status_name
country
country_idcountry_codecountry_name
foreign_exchange_rate
foreign_exchange_rate_idday_fksender_currency_fkreceiver_currency_fkforeign_exchange_rateforeign_exchange_rate_without_markup
voucher_usage_fact
day_fkvoucher_fkvoucher_duration_days_idvoucher_duration_days_namevoucher_duration_range_idvoucher_duration_range_namevoucher_start_date_fkvoucher_end_date_fkvoucher_is_money_transfer_idvoucher_is_airtime_idvoucher_is_valid_idvoucher_is_used_idvoucher_receive_method_idvoucher_receive_method_namenumber_of_customersnumber_of_transactionsnumber_of_first_transactionsfeesvoucher_cost_money_transfervoucher_cost_airtimefx_gainsent_amount_money_transfersent_amount_airtime
user_date
day_fkuser_fkuser_time_perspective_fk
user_time_perspective
user_time_perspective_iduser_time_perspective_name
transaction_event_date
transaction_event_fkday_fktransaction_event_time_perspective_fk
transaction_event
transaction_event_idnumber_of_transaction_eventstransaction_fkprevious_status_fkcurrent_status_fkhours_since_transactionhours_since_last_eventhours_to_next_eventsent_amount_money_transfersent_amount_airtimevoucher_cost_money_transfervoucher_cost_airtimefx_gainfees
transaction_event_time_perspective
transaction_event_time_perspective_idtransaction_event_time_perspective_name
transaction_date
day_fktransaction_fktransaction_time_perspective_fk
transaction_time_perspective
transaction_time_perspective_idtransaction_time_perspective_name
transaction_event_duration
transaction_event_fkduration_fktransaction_event_duration_perspective_fk
transaction_event_duration_perspective
transaction_event_duration_perspective_idtransaction_event_duration_perspective_name
transaction_duration
duration_fktransaction_fktransaction_time_perspective_fk
8 / 25
‣ https://www.worldremit.com/ finished soon* project
Real life schemas II
* A Data Warehouse is never finished
artwork
artwork_idartist_fkshowdown_fkartwork_category_fkartwork_subject_fkartwork_is_curated_fkartwork_is_user_collection_fkartwork_is_admin_collection_fkartwork_related_fkartwork_sale_category_fkartwork_for_sale_as_print_fkartwork_for_sale_as_original_fkdate_uploaded_fkartwork_in_showdown_fkartwork_in_weekly_roundup_fkartwork_is_visibleartwork_is_in_curatedartwork_is_in_user_collectionartwork_is_in_admin_collectionuser_collections_per_artworkadmin_collections_per_artworkurltitlestylesartist_nameartist_first_nameartist_last_name
artwork_category
artwork_category_idartwork_category_name
artwork_for_sale_as_original
artwork_for_sale_as_original_idartwork_for_sale_as_original_name
artwork_for_sale_as_print
artwork_for_sale_as_print_idartwork_for_sale_as_print_name
artwork_in_showdown
artwork_in_showdown_idartwork_in_showdown_name
artwork_in_weekly_roundup
artwork_in_weekly_roundup_idartwork_in_weekly_roundup_name
artwork_is_admin_collection
artwork_is_admin_collection_idartwork_is_admin_collection_name
artwork_is_curated
artwork_is_curated_idartwork_is_curated_name
artwork_is_user_collection
artwork_is_user_collection_idartwork_is_user_collection_nameartwork_related
artwork_related_idartwork_related_name
artwork_sale_category
artwork_sale_category_idartwork_sale_category_name
artwork_subject
artwork_subject_idartwork_subject_name
round
round_idshowdown_idshowdown_roundshowdown_title_sort_idshowdown_title
user
user_iduser_type_fkuser_status_fkuser_city_fkartist_with_artwork_for_sale_idartist_with_artwork_uploaded_iduser_nameuser_first_nameuser_last_nameemailnumber_of_weekly_roundupnumber_of_showdownnumber_of_artwork_commentsnumber_of_collection_commentsnumber_of_artworks_in_user_collectionsnumber_of_user_likesnumber_of_collection_favouritesnumber_of_user_loginsnumber_of_messages_sentnumber_of_uploadshours_to_first_uploadnumber_of_bought_itemsnumber_of_originals_boughtnumber_of_prints_boughtnumber_of_orders_madenet_item_price_boughtnet_item_revenue_boughtgross_revenue_after_vouchers_boughtnet_revenue_after_vouchers_boughtnet_voucher_cost_boughtnumber_of_sold_itemsnumber_of_originals_soldnumber_of_prints_soldnumber_of_orders_soldnet_item_price_soldnet_item_revenue_soldnet_voucher_cost_sold
product
product_idskuartwork_fkproduct_category_fksubstrate_fk
product_category
product_category_idproduct_category_nameedition_type
substrate
substrate_idsubstrate_name
collection_artwork_order_item
collection_artwork_order_item_idcollection_fkartwork_fkorder_item_fk
collection
collection_idcollection_nameuser_fkcollection_type_fkcollection_detailed_type_fkdate_created_fkdate_initiated_fk
order_item
order_item_idprocessed_order_item_idis_original_idis_print_idprocessed_product_idorder_fkproduct_fkorder_item_status_fkprice_range_fkorder_process_fkoption_fkfulfillment_provider_fkrefund_reason_fkgross_revenue_itemnet_item_pricenet_item_price_first_ordervat_amountnet_shipping_revenuenet_shipping_revenue_first_orderduties_amountgross_revenue_item_optionnet_option_pricenet_option_price_first_ordernet_payment_costnet_option_costnet_printing_costnet_voucher_amount_saatchi_sharenet_voucher_amount_artist_sharenet_voucher_amount_saatchi_share_first_ordernet_voucher_amount_artist_share_first_orderartist_commissionartist_royaltiesestimated_net_revenue_after_vouchersorigin_country_iso2origin_latitudeorigin_longitudedestination_latitudedestination_longitude
artwork_style_mapping
artwork_fkartwork_style_fkartwork_style
artwork_style_idartwork_style_name
artwork_in_collection
artwork_fkcollection_fk
collection_artwork_order_item_date
collection_artwork_order_item_fkday_fkcollection_artwork_order_item_time_perspective_fk
collection_artwork_order_item_time_perspective
collection_artwork_order_item_time_perspective_idcollection_artwork_order_item_time_perspective_name
day
day_idday_nameyear_idiso_year_idquarter_idquarter_namemonth_idmonth_nameweek_idweek_nameday_of_the_monthnumber_of_days_in_monthiso_date
collection_detailed_type
collection_detailed_type_idcollection_detailed_type_name
collection_type
collection_type_idcollection_type_name
campaign_click_date
campaign_click_fkday_fkonline_marketing_time_perspective_fk
campaign_click
campaign_click_idcampaign_fksearch_phrase_fkreferrer_fkuser_fknumber_of_clicksnumber_of_daily_visitsnumber_of_monthly_visitsnumber_of_new_visitsnumber_of_daily_visitorsnumber_of_monthly_visitorssubsequent_registration_fksubsequent_confirmation_fksubsequent_first_order_fksubsequent_order_fkdirect_costcost_of_campaigns_without_clicksunmatched_costvisit_duration
online_marketing_time_perspective
online_marketing_time_perspective_idonline_marketing_time_perspective_name
email_event_date
email_event_fkday_fkemail_time_perspective_fk
email_event
email_event_idemail_list_fkemail_campaign_fkemail_recipient_fksubscribeunsubscribeemail_unsubscribe_reason_fksentbounce_softbounce_hardopenfirst_openclickfirst_clicksubsequent_ordersubsequent_first_orderitemsnet_item_pricenet_option_pricenet_shipping_revenuenet_voucher_amount_saatchi_sharenet_voucher_amount_artist_share
email_time_perspective
email_time_perspective_idemail_time_perspective_name
transactional_mail
number_of_mails_senttransactional_mail_type_fkday_fk
transactional_mail_type
transactional_mail_type_idtransactional_mail_type_name
sales_event_date
sales_event_fkday_fksales_event_date_perspective_fk
sales_event
sales_event_idorder_item_fkorder_item_current_status_fkorder_item_status_partition_fkorder_timestampevent_timestamphours_since_orderhours_since_last_eventhours_to_next_eventeffected_net_revenue_after_vouchersestimated_net_revenue_after_vouchers
sales_event_date_perspective
sales_event_date_perspective_idsales_event_date_perspective_name
order_date
day_fkorder_fkorder_date_perspective_fk
order
order_idorder_increment_idprocessed_order_idis_first_order_idis_second_order_idis_second_or_subsequent_order_idorder_with_voucher_iduser_fkreturning_buyer_fkhour_of_day_fkvoucher_fkpayment_method_fkpayment_provider_fkshipping_city_fkorder_source_fk
order_date_perspective
order_date_perspective_idorder_date_perspective_name
sales_event_duration
sales_event_fkduration_fksales_event_duration_perspective_fk
duration
duration_iddaysdays_nameweeksweeks_namemonthsmonths_namequartersquarters_namefive_day_periodfive_day_period_name
sales_event_duration_perspective
sales_event_duration_perspective_idsales_event_duration_perspective_name
order_duration
order_fkduration_fksales_time_perspective_fk
sales_time_perspective
sales_time_perspective_idsales_time_perspective_name
fulfillment_provider
fulfillment_provider_idfulfillment_provider_name
option
option_idoption_name
order_item_status
order_item_status_idorder_item_status_sort_idorder_item_status_name
order_process
order_process_idorder_process_namecheckout_type_idcheckout_typefulfillment_type_idfulfillment_type
price_range
price_range_idprice_range_name
refund_reason
refund_reason_idrefund_reason_namerefund_code_id
hour_of_day
hour_of_day_idhour_of_day_name
order_source
order_source_idorder_source_name
payment_method
payment_method_idpayment_method_name
payment_provider
payment_provider_idpayment_provider_name
shipping_city
shipping_city_idshipping_city_nameshipping_country_idshipping_country_name
voucher
voucher_idvoucher_name
order_item_status_partition
order_item_status_partition_idorder_item_status_perspective_idorder_item_status_perspective_nameorder_item_status_group_idorder_item_status_group_name
order_item_refunds
order_item_refunds_idorder_item_fkrefund_code_idrefund_coderefund_descrefund_amountrefund_daterefund_comment
order_item_status_mapping
order_item_status_fkorder_item_status_partition_fk
email_campaign
email_campaign_idemail_campaign_nameemail_list_fk
email_recipient
email_recipient_idemailemail_recipient_location_fk
email_unsubscribe_reason
email_unsubscribe_reason_idemail_unsubscribe_reason_name
email_list
email_list_idemail_list_name
email_recipient_location
email_recipient_location_idcountry_idcountry_nameregion_idregion_namelatitudelongitude
user_city
user_city_iduser_cityuser_country_iduser_country
user_status
user_status_iduser_status_name
user_type
user_type_iduser_type_name
user_event_date_registration
user_event_fkday_fkuser_event_time_perspective_fk
user_event
user_event_iduser_fkuser_type_fkuser_event_dateregistration_dateweekly_roundupshowdownartwork_commentcollection_commentartworkuser_likescollection_favouriteuser_loginmessage_sentartwork_uploadartwork_for_sale_as_printartwork_for_sale_as_originalartwork_for_sale_as_both_print_and_originalartwork_for_sale_as_either_print_or_originalsignupverified_signupuser_ordertime_since_signuptime_since_last_order
user_event_date_event
user_event_fkday_fkuser_event_time_perspective_fk
campaign
campaign_idcampaign_namecampaign_codechannel_idchannel_nameis_brand_idis_brand_namepartner_or_adwords_account_idpartner_or_adwords_account_namepublication_or_adwords_campaign_idpublication_or_adwords_campaign_namewmc_or_adwords_adgroup_idwmc_or_adwords_adgroup_name
referrer
referrer_idreferrer_namereferrer_type_name
search_phrase
search_phrase_idsearch_phrase_namesearch_phrase_type_name
user_date
user_fkday_fksales_time_perspective_fk
campaign_click_position
campaign_click_fkconversion_type_fkconversion_entity_fkrelative_conversion_path_position_fktime_to_conversion_fkconversion_path_positionconversion_path_lengthhours_to_registrationhours_to_order
conversion_path_position
conversion_path_position_idconversion_path_position_nameconversion_type
conversion_type_idconversion_type_name
relative_conversion_path_position
relative_conversion_path_position_idmedian_idmedian_namequartile_idquartile_namedecile_iddecile_namepercentile_idpercentile_name
campaign_click_performance
campaign_click_fkperformance_attribution_model_fkconversion_type_fknumber_of_registrationsnumber_of_leadsnumber_of_ordersnumber_of_received_ordersnumber_of_first_ordersnumber_of_orders_with_vouchernet_order_revenue performance_attribution_model
performance_attribution_model_idperformance_attribution_model_name
9 / 25
‣ http://www.saatchiart.com/exit August 2014
Real life schemas III
Optimize for change speed!
Data integration
‣ Visuals ETL tools• many data source connectors
• hard to debug
• slow to change
‣ Start with simple sql queries & batch scripts
!
!
!
!
!
‣ Later build something more robust
10 / 25
cat create-tables.sql | psql dwh!!cat load-order.sql \! | mysql --skip-column-names source_db \! | psql dwh --command="COPY tmp.order FROM STDIN \! WITH NULL AS 'NULL'"!!cat /data/payment.csv \! | python payment_filter.py! | psql dwh --command="COPY tmp.payment FROM STDIN” !!cat transform-order.sql | psql dwh!!
Data integration in Yves & Zed
11 / 25
‣ Jobs = processing steps with dependencies• parallel execution with cost based scheduler
• robust, transparent, no black boxes
‣ Parallel jobs & incremental processing
‣ Extensive visualisations & monitoring tools
Plain text files
‣ Very git-friendly
12 / 25
<?xml version="1.0" encoding="UTF-8"?>!<process xmlns="http://project-a.com/dwh-process"! id=“operational-data" ..>!! <initial-job id="initialize-schemas">! <description>Recreates schemas and writes configs</description>! <commands>! ..! </commands>! </initial-job>!! <!-- Orders -->! <job id="load-order">! <description>Loads orders into tmp.order</description>! <commands>! <execute-sql-file file-name="orders/create-order-tmp-table.sql" echo-queries="true"/>! <load-from-mysql file-name="orders/load-order.sql"! target-table="tmp.order" database="app"! timezone="UTC"/>! <execute-sql>SELECT tmp.index_tmp_order();</execute-sql>! </commands>! </job>!! <job id="cleanse-order">! <description>Deletes test orders and other invalid orders</description>! <dependencies>! <dependency job="cleanse-member"/>! <dependency job="load-order-item"/>! <dependency job="load-product"/>!
Each KPI is always computed in the same way
MDX = query language for multidimensional data
‣ Developed by Microsoft as part of Analysis Services• http://en.wikipedia.org/wiki/MultiDimensional_eXpressions
‣
!
!
‣
13 / 25
SELECT !TopCount([Product].[Product].Members, 2,! [Measures].[Revenue])! ON COLUMNS,![Measures].[Revenue]! ON ROWS!FROM [Pet sales]!WHERE [Date].[Month].[Oct]
SELECT [Date].[Month].Members! ON COLUMNS,!CrossJoin({[Measures].[Sold items],! [Measures].[# Orders], ! [Measures].[Revenue]},! Descendants([Product].[All products]))! ON ROWS!FROM [Pet sales]
order_item
item_idorder_idhas_voucherpriceday_fkproduct_fk
day
day_idday_namemonth_idmonth_name
product
product_idproduct_name
Mondrian = engine for executing MDX
‣ Open source analytics processor• http://mondrian.pentaho.com
• http://en.wikipedia.org/wiki/Mondrian_OLAP_server
• In Java
• Eclipse Public License
• Active community
• https://github.com/pentaho/mondrian/
!
‣ Part of Pentaho BI platform
14 / 25
M A N N I N G
William D. BackNicholas Goodman
Julian Hyde
Open source business analytics
www.it-ebooks.info
Mondrian schema I
‣ The relation between fact tables and dimension tables is defined in a XML file
15 / 25
<Cube name="Pet sales" defaultMeasure="# Orders">! <Table schema="dim" name="order_item"/>!! <Dimension name="Date" type="TimeDimension" foreignKey="day_fk">! <Hierarchy allMemberName="All dates" hasAll="true" primaryKey="day_id">! <Table schema="dim" name="day"/>! <Level name="Month" column="month_id" nameColumn="month_name"! type="Integer" levelType="TimeMonths" uniqueMembers="true"/>! <Level name="Day" column="day_id" nameColumn="day_name"! type="Integer" levelType="TimeDays" uniqueMembers="true"/>! </Hierarchy>! </Dimension>! ! <Dimension name="Product" foreignKey="product_fk">! <Hierarchy hasAll="true" allMemberName="All products" primaryKey="product_id">! <Table schema="dim" name="product"/>! <Level name="Product" column="product_id" nameColumn="product_name"! type="Integer" uniqueMembers="true"/>! </Hierarchy>! </Dimension>! ! ..!</Cube>
order_item
item_idorder_idhas_voucherpriceday_fkproduct_fk
day
day_idday_namemonth_idmonth_name
product
product_idproduct_name
Each KPI is always computed in the same way
Mondrian schema II
‣ Measures as defined as aggregates on columns
!
!
!
!
‣ Mondrian = SQL query generator
16 / 25
SELECT [Date].[Month].Members! ON COLUMNS,![Measures].[Avg cart value]! ON ROWS!FROM [Pet sales]
SELECT! "day"."month_id" AS "c0",! count(DISTINCT "order_item"."order_id") AS "m0",! sum("order_item"."price") AS "m1"!FROM! "dim"."day" AS "day",! "dim"."order_item" AS "order_item"!WHERE! "order_item"."day_fk" = "day"."day_id"!GROUP BY! "day"."month_id"
order_item
item_idorder_idhas_voucherpriceday_fkproduct_fk
day
day_idday_namemonth_idmonth_name
product
product_idproduct_name<Cube name="Pet sales" defaultMeasure="# Orders”>!
..! <Measure name="# Orders" column="order_id" datatype="Integer" aggregator="distinct-count" formatString="Standard"/>!! <Measure name="Revenue" column="price" datatype="Integer" aggregator="sum" formatString="Currency"/>!! <Measure name="Sold items" column="item_id" datatype="Integer" aggregator="count" formatString="Standard"/>!! <CalculatedMember name="Avg cart value" dimension="Measures">! <Formula>[Measures].[Revenue] / [Measures].[# Orders]</Formula>! </CalculatedMember>!</Cube>!!
➞ ➞
Always draw your Mondrian schema!
Mondrian schema III
‣ Everything about KPIs & dimensions (business) and tables & columns (IT) in one file• consistent & explicit semantics
• transparency is easy
17 / 25
Try it out immediately, it’s amazing: http://demo.analytical-labs.com/
Ad-hoc queries with Saiku Analytics
‣ Drag & drop reporting tool on top of Mondrian• Open source (Apache 2.0)
• Talks to Mondrian via MDX
• http://meteorite.bi/saiku
18 / 25
Reports in Yves & Zed I
‣ Own lightweight reporting frontend• bootstrap/ Google charts
• lacks many features
• features are easy to implement
19 / 25Numbers are random!
Numbers are random!
Reports in Yves & Zed II
‣ Dashboard-like interactive reports • maintained by developers
• each table / chart is an MDX query
20 / 25
XMLA = XML for Analysis = MDX via SOAP
‣ Industry standard originally proposed by Microsoft• http://en.wikipedia.org/wiki/XML_for_Analysis
• Soap protocol to discover and query OLAP cubes
• Mondrian has an XMLA server
‣ Request
‣ Response
21 / 25
<?xml version="1.0" encoding="UTF-8"?>!<SOAP-ENV:Envelope xmlns:SOAP-ENV=“..”>! <SOAP-ENV:Header/>! <SOAP-ENV:Body>! <Execute xmlns="urn:schemas-microsoft-com:xml-analysis">! <Command>! <Statement>! <![CDATA[!SELECT [Date].[Month].Members! ON COLUMNS,![Measures].[Avg cart value]! ON ROWS!FROM [Pet sales]! ]]>! </Statement>! </Command>! <Properties>! <PropertyList>! <Catalog>dwh</Catalog>! <DataSourceInfo>Monsai</DataSourceInfo>! <Format>Multidimensional</Format>!
<?xml version="1.0" encoding="UTF-8"?>!<SOAP-ENV:Envelope xmlns:SOAP-ENV="..">! <SOAP-ENV:Header ../>! <SOAP-ENV:Body>! <cxmla:ExecuteResponse xmlns:cxmla="urn:schemas-microsoft-com:xml-analysis">! <cxmla:return>! <root>! <OlapInfo ../>! <Axes>! <Axis name=“Axis0" ../>! <Axis name="Axis1">! <Tuples>! <Tuple>! <Member Hierarchy=“Measures" ..>! </Tuple>! </Tuples>! </Axis>! <Axis name=“SlicerAxis" ../>! </Axes>! <CellData>! <Cell CellOrdinal="0">! <Value xsi:type="xsd:double">26.666666666666668</Value>! <FmtValue>26,67 €</FmtValue>! <FormatString>Standard</FormatString>! </Cell>! <Cell CellOrdinal="1">! <Value xsi:type="xsd:double">40</Value>! <FmtValue>40,00 €</FmtValue>! <FormatString>Standard</FormatString>! </Cell>!
Data Warehouse in Yves & Zed
!
!
!
!
!
!
!
!
!
!
!
!
‣ monsai = Mondrian XMLA Server + Saiku in a single war file, https://github.com/project-a/monsai
22 / 25
applicationdatabases
json files
csv files
apis
SQLSQL
DB results
XMLA / MDX
XMLA responseMondrian
Mondrian schema
MDX results
databasemapping
data integration monsai reporting
Search for computer scientists, not business intelligence experts
What kind of people do you need to hire for this?
‣ The “typical BI expert”:• studied something related to business and learnt VBA
programming through Excel
• relies on others to set up databases and tools
‣ Your ideal candidate• has studied computer science
• masters the basic tools of software development and computer science
• likes to learn new technologies
• understands how databases work
‣ Good profile example: http://www.project-a.com/en/careers/jobs/?yid=332
For our "A-Team" we are looking to fill the following position as soon as possible
Data Engineer / Data Scientist (m/f) Your tasks:
You will help our business intelligence team to build data driven applications for our ventures:data warehouses, recommendation engines and CRM systems (developed in-house, basedon open-source technologies)You will integrate, transform and index data from various data sources, develop meaningfuldata representations and visualisations, and provide aggregated data for third-party systemsYou will advance our software architecture and tool set to growing challenges and dataamounts (performance, scaling, data quality)You will work in an agile software development process in close collaboration with a productmanagement team
Your profile:
You have a Master's degree in computer science or a comparable degreeYou have a genuine interest in data and algorithms and you are excited about solving difficultproblems and strive for efficient and robust solutionsYou master at least these basic tools of computer science: object oriented programming inmultiple languages, HTTP and current web technologies, the unix command line and basicserver administration, version control systems, a basic understanding of the interplaybetween software and memory, hard discs and the CPUYou have profound knowledge about the inner workings of database systemsYou are eager to delve into new technologies and programming languages (our currentstack: Mac or Linux, PostgreSQL, Mondrian & MDX, PHP, Java, Python, Solr, ElasticSearch,R)You have a basic understanding of mathematics and machine learning
Your chance:
You will join a highly professional and motivated teamYou will have the unique opportunity to witness the launch of a newly established companyand you can contribute your own ideas to its developmentYou will benefit from the greatest possible creative freedom to develop your skills furtherYou will enjoy a state-of-the-art, top-equipped workplace right in the center of BerlinYou will benefit from a communicative, stimulating and inspiring environment
We’re looking forward to your online application.
Job opportunity Data Engineer / Data Scientist (m/f) at Projec... https://karriere.project-a.com/eng?yid=332
1 of 1 06/10/14 14:07
23 / 25
Any kind of Scrum / Kanban works, do it
Use a standard software engineering process!
‣ Product managers: what?• Collection of business requirements
• KPI & report definitions
• QA & analysis
!
‣ Developers: how?• Implementation, performance & stability
• Schema & process design
• Consistency checks
24 / 25
Netrevenue
Netvouchercost
Avgnet
vouchercostperorder
Avgnetordervalue
Contributionmargin1
Contributionmargin3a
%Contributionmargin1
%Net
revenuesecondor
subsequentorders
Avgnet
revenueper
buyingmember
Taxshippingamount
Taxamount
%Firstorders
%Secondor
subsequentorders
#Ordersper
buyingmember
Avg#of
itemsperorder
Avg#of
uniqueitemsperorder
Grossrevenue
Avggrossitemprice
Grosspriceto
grossretailpriceratio
Pricetoretailpriceratio
Avggrossordervalue
%Grossvouchercost
Grossinvoicedamount
Netinvoicedamount
Grossretailprice
Avgretailprice
Avgnet
discount
Netpricetonet
purchasepriceratio
Netpricetonetretailpriceratio
%Vouchersused
Avgnetitemprice
Avgnet
purchasecost
%Net
discount
Avgnetordervaluefirstorders
Avgnetordervaluesecondor
subsequentorders
Avggrossinvoicedamount
Avgnet
invoicedamount
HGBnet
revenuemargin
Avggrossvouchercostper
buyingmember
Contributionmargin2a
Contributionmargin2b
Avgcontributionmargin2bper
buyingmember
Netcostofsales
Netcostofsalesperorder
#Orders
#Firstorders
#Secondorders
#Secondor
subsequentorders
#Buyingmembers
#Returningmembers
#Items
#Uniqueitems
Netitem
revenue
Taxitemamount
Netpurchasecost
Netretailprice
Retailtax
amount
Grossvouchercost
#Orderswith
vouchers
#Orderswithoneuniqueitem
Netshippingrevenue
Grossshippingrevenue
Netrevenuefirstorders
Netrevenuesecondor
subsequentorders
Netpaymentcost
Netreturncost
Netfulfillmentcost
Price Retailprice
http://www.project-a.com/
Thank you
25 / 25
Data integration is easy if you keep things simple!