Get the facts, or the facts will get you. And when you get them, get them right, or they will get you wrong. Dr. Thomas Fuller, Gnomologia, 1732 British

Embed Size (px)

Citation preview

  • Slide 1

Get the facts, or the facts will get you. And when you get them, get them right, or they will get you wrong. Dr. Thomas Fuller, Gnomologia, 1732 British physician (1654 - 1734) Slide 2 Data Vault, The new Datawarehouse Supermodel Martijn Evers Datawarehouse Architect Radboud University Nijmegen & President Dutch Data Vault User group. Slide 3 Introduction Welcome Who is ME? My Job My Employer Data Vault Slide 4 This presentation Basic Introduction -Core Concepts -Not enough for deploying a working Data Vault! Giving Directions -Understanding Usability -Further study Fun Alas no demos Contains bonus slides Do ask questions! Slide 5 Agenda Cosmology of Data warehousing Data Vault -Modelling -Loading Data Vault Considerations & Comparisons Example of a Data Vault Analysis & Transformation(METIS) Example DWH Data Vault Architecture Conclusion & wrap-up Slide 6 Star (schemas) -Aggregates as Planets Data Marts as Constellations Galaxies as (Conformed) Data Marts Where is the Data Vault ? Slide 7 Slide 8 Information paradox Event Horizon Holographic Universe Elementary Particles Slide 9 Data is retained indefinitely Vault matter is trapped Temporal, Accessible Information Holographic Visible and Frozen Elementary facts Elementary Elementary Particles Integration points Integrated Singularity Flexible, extensible Expandable Expands on Matter/Information Central EDW Central Point Spinning point of the Galaxy Data Vault vs. Black Hole Slide 10 Historic Overview (Linstedt, Graziano, & Hultgren, The New Business Supermodel, The Business of Data Vault Modeling, 2008, p. 36) Created By Dan Linstedt Released in 2000 Formally Introduced in the Netherlands in 2007 First DV Book: The Business of Data Vault Modeling 2008 First (Dutch) User group in 2010 Technical book from Dan Linstedt in 2011 Slide 11 Data Vault Components Modeling ETL/Load architecture Slide 12 ETL/Load Architecture -100% of the data (within scope) 100% of the time -Source driven /Auditable: -Fact Oriented -Template/metadata driven -No Business Rules Kimball or Inmon ETL -Complex ETL -Truth oriented -Business Rules before EDW Pictures: Dan Linstedt Slide 13 Data Vault Architecture Central EDW No Business Rules Incremental/Non destructive Loading 100% of the data (within scope) 100% of the time Auditable/Source Driven Slide 14 Dualistic approach for central EDW DWH source driven or demand driven? Source driven Goal oriented Neither may dominate! Slide 15 Dualistic approach = realistic approach No problematic assumptions Detailed approach Clear principles User visible Slide 16 Modeling a Data Vault Legenda Based on pictures by Dan Linstedt Slide 17 Data Vault Modelling Primary Entity types -HUB -Unique list of business keys (customer number, order number, part number) -LINK -Unique list of business keys combinations -SATELLITE -Tracks associated attributes through time Secondary Entity Types -Hierarchical LINK -Transactional LINK Helper Tables -PIT -Bridge Slide 18 Loading a Data Vault Metadata Load Templates -Hub -Link -Satellite Loading Phases Slide 19 Common Minimal Metadata Load Sequence Data Vault ID dv_id, DV_SQN Load Date Time Stamp load_dts Load End Date Time Stamp load_dts_end (optional) Record Source record_src Slide 20 Loading a HUB Pictures: Dan Linstedt INSERT INTO customer_hub (cust#,load_dts,record_src) SELECT source.customer#, @load_dts, @record_src FROM source_customer AS source WHERE NOT EXISTS (SELECT * FROM customer_hub AS hub WHERE hub.customer#=source.customer#) Slide 21 Loading a Link Pictures: Dan Linstedt Slide 22 Link Load query INSERT INTO custcontact_link(cust_id,contact_id,load_dts, record_src) SELECT source.customer#, @load_dts, @record_src FROM source_table AS source INNER JOIN contact_hub AS contact ON contact. contact#= source.contact# INNER JOIN customer_hub AS cust ON cust. customer#= source.customer# WHERE NOT EXISTS (SELECT * FROM custcontact_link AS link WHERE link. contact_id= contact.id and link.cust_id= cust.id) Slide 23 Loading a Satellite Pictures: Dan Linstedt Slide 24 Satellite Load query INSERT INTO customer_sat (hub_id,load_dts, name,record_src) SELECT hub.id, @load_dts, source.cust_name,,@record_src FROM source_customer AS source INNER JOIN customer_hub AS hub ON cust.customer#= source.customer# # INNER JOIN customer_sat AS sat ON sat.id= hub.id# AND sat Is most recent AND sat.namesource.name Slide 25 Data Vault Loading Phases Pictures: Dan Linstedt Where possible ! Slide 26 Parallel Loading Synchronization Points/ Dependencies Staging Hubs -Hub Satellites Links -Link Link on Link -Link on Link Satellites Data Mart Feed -Full/Partial Refresh -Incremental loads Slide 27 Geology of a Data Vault (Batch) Loading (Dayly) Batch Real Time/ Transactional Loading Micro Batch, Continuous Batch Pictures: Dan Linstedt Slide 28 Data Vault Considerations & Comparisons Pros Cons Versus 3NF Versus. Dimensional Modelling Slide 29 Data Vault Pros Scalability -Provides for Multi-Terabyte storage -Delta Driven Information -Loading Auditability -Easier Detection of Dead Data -Generation of Audit Trails -Quality Feedback loops -Truth vs. Facts Standardization -Standard Implementation Architecture -Restartable, Consistent Loading Patterns. -Generate ETL & Data model (be careful) Flexible -Rapid Build of Data Marts -Handle combinations of different arrival speeds -Flexible and incremental implementation & Deployment (Agile BI). Robustness -Isolated Development -Restartable Loading Slide 30 Data Vault Cons End-user Access & aggregation performance -Not friendly for direct exploration and user access -Not conducive to todays BI tools. -Not conducive to OLAP processing. Requires firm Architect -Business Keys -Truth vs. Facts -DV Standards Additional Layer -Might require additional processing Slide 31 But End-user Access & aggregation performance -Semantical layers & Helper tables/views -Segregation of storage & access Requires firm Architect -Ignore at your own peril -Business Keys -Auditability -Standardization Additional Layer -Adds flexibility & robustness Slide 32 Data Vault vs. 3NF Many to Many Linkages Handle lots of information Tightly integrated information Highly structured Reasonably conducive to near-real time loads Relatively easy to extend Time Driven PK issues Parent-Child Complexities Cascading Change Impacts Difficult to load Not conducive to BI tools Not conducive to Drill- down Difficult to architect for an Enterprise Not conducive to Spiral/scope Slide 33 Data Vault vs. Star Schema Good for Multi-Dimensional Analysis Subject Oriented Answers Excellent for Aggregation Points Less landing zones Great for Some Historical Storage Great for BI Tools Minimize data landing zones No Data mining. No Real-time loading. No ODS/Exploration Expensive updates (type 1,2 and 3) Inflexible modelling of basic elements like history, structure and key distribution Grain issues difficult to resolve High impact changes Latency Issues with late or early arriving facts Complex loading and changing of history Fails under very heavy loads Difficult to automate... Slide 34 Data Vault: Conclusion Go ! -F-Flexible/Agile approach -A-Auditable/Historic -S-Scalable -S-Standardized/Automatable/Repeatable -R-Robust/Stable/Dependable No Go? -E-Experience/Familiarity -N-No Direct Access -E-Extra layer -D-Data Modelling Slide 35 University Research Publications Information System (METIS) EXAMPLE Picture: Paul Kidby Slide 36 Transforming a data model to a Data Vault in 5 easy steps 1.Create a working and complete source/business model(s) (Technical-Functional Model) 2.Analyze and classify Keys & Columns 3.Classify Entities and Relationships 4.Combine information of step 3 & 4 5.Transform to a DV Slide 37 Slide 38 Slide 39 Slide 40 Slide 41 Slide 42 Slide 43 Slide 44 A Data Vault oriented Datawarehouse Architecture Staging & CDC/Replication/Real Time/SOA feeds Central EDW Data Vault Core Business Rule Layer Non Source oriented & DV structured Business Rule results & calculations/aggregations Virtualized Data Mart Layer Star Schemas encoded in semantical layers (UDM/BISM/views/Universes) None/Partial Physical star schemas EXAMPLE Slide 45 Advanced Concept: Business Data Vault Data Vault structured layer System Driven instead of Source Driven Centralization Performance Picture: Dan Linstedt Slide 46 Datawarehouse Architecture Business (Rule) Vault Business (Rule) Vault BI Apps: SAP-BO Data Vault ( Temporal ) 3NF views Dimensional views Reports Universe OLAP Data Marts (Virtual) Data Marts Voyager Central DWH Staging (Optional) Staging (Optional) Slide 47 MS Fast Track 2.0/3.0 SQL Server 2008 R2 Enterprise Edition Microsoft Fast Track 2.0/3.0 DWH Architecture met Data Vault Virtual Data marts Challenges Benefits Slide 48 Questions? Anchor Oriented Modeling? Metadata? Change Data Capture? Theorie? Fast Track? Slide 49 Information over Data Vault Data Vault Book: www.learndatavault.comwww.learndatavault.com Website creator: www.danlinsted.com Slide 50 Additional Information Data Vault Generators -BIReady: www.biready.com -Quipu: http://www.datawarehousemanagement.orghttp://www.datawarehousemanagement.org -Several others Blogs & Resources -www.prudenza.nlwww.prudenza.nl -Facebook: datavaultdirectory Linkedin groups -Data Vault Discussions, Temporal Data Modeling -Dutch Data Vault Subgroup Slide 51 Contact MSN/Email: [email protected][email protected] LinkedIn: http://www.linkedin.com/in/dmunseenhttp://www.linkedin.com/in/dmunseen Twitter: DM_Unseen Blog: http://dm-unseen.blogspot.com/http://dm-unseen.blogspot.com/ LinkedIn Group: Temporal Data Modeling Facebook: datavaultdirectory Slide 52 Dutch Data Vault User group Twitter: @NLDVGG -HASTAGS: #NLDVGG #DDVGG Email: [email protected]@gmail.com Website: Http://dvusergroup.comHttp://dvusergroup.com Windows Live: http://datavault.groups.live.com/http://datavault.groups.live.com/ Facebook: -datavaultdirectory - Dutch Data Vault User group: ([email protected])[email protected] Belgium -Contact person: Yves Mulkes / BI-community.org -Email: [email protected]@bi-community.org Slide 53 Recap & Checklist 1.Understand selling points -Check out (online) Data Vault Resources -Training/Coaching/Seminars 2.Evaluate -Understand architecture requirements -Prototyping -Consultancy 3.Implement -Small increments Slide 54