View
2.411
Download
1
Category
Tags:
Preview:
DESCRIPTION
Around 80% of the work to create a data warehouse/BI solution is spent on the ETL phase. Although building an ETL solution can be a challenge, you can break down the project into at least two separate processes for easier management. One process is strictly related to business modeling, and therefore cannot be replicated. But the other is made up of purely technical processes that are always the same, regardless of the business environment we operate in, and thus can be highly automated. In this session, we will look at well-known patterns to solving common problems and how they can be automated with the help of specific tools and techniques that use metadata to reduce development time and bugs. Using these engineering techniques, you will be able to adopt an Agile approach to your BI solution.
Citation preview
Automating Data Warehouse Patterns Through MetadataDavide Mauridmauri@solidq.com
Davide Mauri20 Years of experience on the SQL Server Platform
– Specialized in Data Solution Architecture, Database Design, Performance Tuning, Business Intelligence, Data Warehouse, Big Data & Analytics
Microsoft SQL Server MVPPresident of UGISS (Italian SQL Server UG)Mentor @ SolidQ
– Regular Speaker @ SQL Server events– Projects, Consulting, Mentoring & Training
Find me here:– Blog: http://sqlblog.com/blogs/davide_mauri/default.aspx– Twitter:@mauridb
Building a DWH in 2013Is still a (almost) manual process
A *lot* of repetitive low-value work
No (or very few) standard tools available
How it should beSemi-automatic process
– “develop by intent”
Define the mapping logic from a semantic perspective– Source to Dimensions / Measures
• (Metadata anyone?)
Design the model and let the tool build it for you
CREATE DIMENSION CustomerFROM SourceCustomerTableMAP USING CustomerMetadata
ALTER DIMENSION CustomersADD ATTRIBUTE LoyaltyLevelAS TYPE 1
CREATE FACT OrdersFROM SourceOrdersTableMAP USING OrdersMetadata
ALTER FACT OrdersADD DIMENSION Customer
The perfect BI process & architecture
AGILE BI
Iterative!
DWH PROCESSESIs automation possible?
Invest on Automation?Faster development
– Reduce Costs– Embrace Changes
Less bugs
Increase solution quality and make it consistent throughout the whole product
Automation Pre-RequisitesSplit the process to have two separate type of processes
– What can be automated– What can NOT be automated
Create and impose a set of rules that defines– How to solve common technical problems– How to implement such identified solutions
No Monkey Work!Let the people think and let the machines do the «monkey» work.
Design Pattern“A general reusable solution to a commonly occurring problem within a given context”
Design PatternGeneric ETL Pattern
– Partition Load– Incremental/Differential Load
Generic BI Design Pattern– Slowly Changing Dimension
• SCD1, SCD2, ecc.– Fact Table
• Transactional, Snapshot, Temporal Snapshot
Design PatternSpecific SQL Server Patterns
– Change Data Capture– Change Tracking– Partition Load– SSIS Parallelism
Engineering the DWH“Software Engineering allows and require the formalization of software building and maintenance process.”
Sample Rules• Always put «last_update» column• Always log Inserted/Updated/Deleted rows to
log.load_info table• Use MD5 – binary(16) for checksums• Use views to expose data
– Dimension & Fact views MUST use the same column names for lookup columns
Engineering the DWHThere are two intrinsc processes hidden in the development of a BI solution that must be allowed (or forced) to emerge.
Business ProcessData manipulation, transformation, enrichment & cleansing logic
Specific for every customer. Almost not automatable
Technical ProcessApplication of data extraction and loading techniques
Recurring (pattern) in any solution
Highly Automatable
Hi-Level Vision
STGETLETL
OLTP DWH
ETL
Technical Process
Business Process
Technical Process
ETL Phases«E» and «L» must be
– Simple, Easy and Straightforward– Completely Automated– Completely Reusable
«E» and «L» have ZERO value in a BI Solution– Should be done in the most economic way
PATTERN Well known solution to common problems
Source Full Load E
Source Incremental Load EIn this scenario, “ID” is a IDENTITY/SEQUENCE.Probably a PK.
Source Differential Load/1 E
In this scenario the source tabledoesn’t offer any specific way to Understand what’s changed
Source Differential Load/2 E
In this scenario the source table has a TimeStamp-Like column
Source Differential Load• SQL Server 2012 that can help with
incremental/differential load– Change Data Capture
• Natively supported in SSIS 2012• http://www.mattmasson.com/2011/12/cdc-in-ssis-for-sql-ser
ver-2012-2/– Change Tracking
• Underused feature in BI…not so rich as CDC but MUCH more simpler and easier
E
SCD 1 & SCD 2 LStart
Lookup Dimension Id and MD5 Checksum From Business Key
Calculate MD5 Checksum of Non-SCD-Key Colums
Dimension Id is Null?YesInsert new members
into DWH No Checksum are different?
Yes
Store into temp table
Merge data from temp table to DWHEnd
SCD 2 Special Note• Merge => UPDATE Interval + INSERT New Row
L
FACT TABLE LOAD L
Partition Load EL
Parallel Load• Logically split the work in several steps
– E.g: Load/Process one customer at time• Create a «queue» table the stores information for each step
– Step 1 -> Load Customer «A»– Step 2 -> Load Customer «B»
• Create a Package that 1. Pick the first not already picked up 2. Do work3. Back to step 3
• Call the Package «n» times simultaneously
EL
Other SSIS Specific Patterns• Range Lookup
– Not natively supported – Matt Masson has the answer in his blog
• http://blogs.msdn.com/b/mattm/archive/2008/11/25/lookup-pattern-range-lookups.aspx
METADATAA key ingredient in automation
MetadataProvide context information
– Which columns are used to build/feed a Dimension?
– Which columns are Business Keys?– Which table is the Fact Table?– How Fact and Dimension are connected?
• Which columns are used?
How to manage Metadata?• Naming Convention
• Extended Properties
• Specific, Ad Hoc Database or Tables
• Other (XML, File, ecc.)
Naming Convention• The easiest and cheapest
– No additional (hidden) costs– No need to be maintained– Never out-of-sync– No documentation need
• Actually, it IS PART of the documentation– Imposes a Standard
• Very limited in terms of flexibility and usage
Extended PropertiesSupport most of metadata needs
No additional software needed
Very verbose usage– Development of a wrapper to make usage simpler is
feasible and encouraged
Metadata ObjectsDedicated Ad-Hoc Database and Tables
As Flexible as you need
Maintenance Overhead to keep metadata in-sync with data– Development of automatic check procedure is needed– DMV can help a lot here
External Metadata ObjectsReally expensive to keep them in-sync
– A tool is needed, otherwise too much manual work
Does not give any specific benefits with respect to Ad-Hoc Database/Tables
DEMO
AUTOMATIONLet’s make it possible!
Automation Scenarios• Run-Time: «Auto-Configuring» Packages
– Really hard to customize packages– SSIS limitations must be managed
• Eg: Data Flow cannot be changed at runtime• On-the fly creation of package may be needed
• Design-Time: Package Generators / Package Templates– Easy to customize created packages
Automation Solutions• Specific Tool/frameworks
– BIML / MIST
• SQL Server Platform– SQL, PowerShell, .NET– SMO, AMO
Package GeneratorsRequired Assemblies
Microsoft.SqlServer.ManagedDTSMicrosoft.SqlServer.DTSRuntimeWrapMicrosoft.SqlServer.DTSPipelineWrap
Path:C:\Program Files (x86)\Microsoft SQL Server\110\SDK\Assemblies
DEMO
Useful Resources• «STOCK» Tasks:
– http://msdn.microsoft.com/en-us/library/ms135956.aspx
• How to set Task properties at runtime:– http://technet.microsoft.com/en-us/library/microsoft
.sqlserver.dts.runtime.executables.add.aspx
BIML – BI Markup Language• Developed by Varigence
– http://www.varigence.com – http://bimlscript.com/ – MIST: BIML Full-Featured IDE
• Free via BIDS Helper– Support “limited” to SSIS package generation– http://bidshelper.codeplex.com
THANK YOU!• For attending this session and
PASS SQLRally Nordic 2013, Stockholm
Recommended