View
6
Download
0
Category
Preview:
Citation preview
A step-by-step guide to migrating
Microsoft Data Quality Services to Azure
1
With HEDDA.IO, your data is
optimally prepared for all
your purposes at all times.
2
Contents Microsoft Data Quality Services .............................................................................................................. 3
HEDDA.IO ................................................................................................................................................ 4
Advantages of migration to HEDDA.IO ................................................................................................... 5
DQS Knowledge Base .............................................................................................................................. 6
Exporting a DQS Knowledge Base ........................................................................................................... 8
Installing HEDDA.IO ............................................................................................................................... 10
Importing a DQS Knowledge Base to HEDDA.IO ................................................................................... 12
Creating an SSIS Package ...................................................................................................................... 15
Creating an SSIS-IR with HEDDA.IO ....................................................................................................... 16
Publishing the SSIS Package .................................................................................................................. 20
Executing the SSIS Package ................................................................................................................... 21
Conclusion ............................................................................................................................................. 22
3
Microsoft Data Quality Services SQL Server Data Quality Services (DQS) is a knowledge-driven data quality product. DQS enables you
to build a knowledge base and use it to perform a variety of critical data quality tasks, including
correction, enrichment, standardization, and de-duplication of your data. The service enables you to
perform data cleansing by using cloud-based reference data services provided by reference data
providers. It also provides you with profiling that is integrated into its data-quality tasks, enabling
you to analyze the integrity of your data.
DQS consists of Data Quality Server and Data Quality Client, both of which are installed as part of
SQL Server 2017. Data Quality Server is a SQL Server instance feature that consists of three SQL
Server catalogs with data-quality functionality and storage. Data Quality Client is a SQL Server shared
feature that business users, information workers, and IT professionals can use to perform computer-
assisted data quality analyses and manage their data quality interactively. You can also perform data
quality processes by using the DQS Cleansing component in Integration Services and Master Data
Services (MDS) data quality functionality, both of which are based on DQS.
DQS was introduced with SQL Server 2012. Since the first version, the product has been continuously
maintained. In 2012, oh22information services GmbH developed additional SSIS components as
Open Source solutions and published them on Codeplex. With these components the duplicate
search as well as the loading of the domains could also be carried out within the ETL process.
SSIS DQS Matching Transformation
https://archive.codeplex.com/?p=ssisdqsmatching
SSIS DQS Domain Value Import
https://archive.codeplex.com/?p=domainvalueimport
4
HEDDA.IO oh22’s HEDDA.IO is a knowledge-driven data quality product completely built in Microsoft Azure.
HEDDA.IO enables you to build a knowledge base and use it to perform a variety of critical data
quality tasks, including correction, enrichment and standardization of your data. HEDDA.IO enables
you to perform data cleansing by using cloud-based reference data services provided by reference
data providers or developed and provided by yourself.
HEDDA.IO consists of a WEB API, WEB UI, Excel Add-in and SSIS Component which are fully hosted in
Microsoft Azure and can be integrated into your cloud and local processes. HEDDA.IO Excel Add-in is
a local feature that covers the complete scope of DQS. You can also perform data quality processes
by using the HEDDA.IO cleansing component in Integration Services on premises and in the new
Azure Data Factory SSIS Integration Runtime.
5
Advantages of migration to HEDDA.IO Some Microsoft SQL Server functions of the classic on-premises product have been increasingly used
in Azure during the last months.
For example, Azure Data Factory SSIS Integration Runtime is a complete PaaS that is fully compatible
with on-premises SQL Server Integration Services. Also, with the current version of the SQL Server
Managed Instance, the Microsoft SQL Server Master Data Services can now be used in Azure.
Microsoft Data Quality Services, on the other hand, cannot be used in Azure except on an Azure VM
as IaaS. Many processes involved in loading a data warehouse or moving data between different
services are increasingly taking place in the cloud. However, the need to carry out pure data quality
processes in the cloud increases as well. At this point a gap remains, whereby necessary data quality
processes either cannot be integrated into cloud processes or corresponding processes cannot be
migrated to Azure.
HEDDA.IO is a DQ service developed entirely for the cloud. With the concepts of knowledge bases,
domains and composite domains, HEDDA.IO is compatible with Microsoft Data Quality Services.
Through an SSIS component that is fully aligned with the SSIS-IR, the validation and cleansing of data
within the ETL processes can be easily performed. Existing on-premises processes with Microsoft
Data Quality Services can thus be migrated quickly and easily to Azure using HEDDA.IO.
With the discontinuation of Azure Data Marketplace, Reference Data Services were removed from
Data Quality Services. This means that various checks and cleanups based on composite domains can
no longer be performed with Microsoft Data Quality Services. HEDDA.IO has an open API with which
Reference Data Services can be quickly and easily integrated into the product. Various services can
be deployed directly with HEDDA.IO from the Marketplace. Some of the services are available as
open source on GitHub so that end users can create their own services.
6
DQS Knowledge Base Let's start with a DQS Knowledge base and a domain in Microsoft Data Quality Services. Open the
SQL Server 2017 Data Quality Client. In the start screen, the Knowledge Base Management area on
the left displays the Knowledge Bases that you have already defined. Click on the Open Knowledge
Base button and select the Knowledge Base DQS Data in the following dialog. You may also find the
DQS Data knowledge base under Recent Knowledge Base.
DQS Data is a standard knowledge base that is automatically created on your system when Data
Quality Services are installed.
7
If you have opened the Knowledge Base, you will see the various domains that belong to this
Knowledge Base in the Domain area on the left. In the right area you will see the corresponding
domain properties for the selected domain, in this example the domain Country/Region. In addition
to the name, a more detailed description of the domain and its data type is included here.
Via the Domain Values tab you can switch to the actual values within your domain. Here you see the
individual values for the selected domain as well as the assigned leading values in the column
"Correct To".
As you can see in the selection, a domain can also have completely different character sets here.
8
Exporting a DQS Knowledge Base To export a Knowledge Base, click the second icon from the right in the Domain Management area
on the left. The Icon looks like a small table with an arrow pointing to the right.
You can now select whether you want to export the full Knowledge Base or the selected domain.
Click on Export Knowledge Base to export the entire KB.
9
In the next dialog "Export to Data File", select the storage location for the KB and enter a name for it.
Exporting the Knowledge Base may take a few seconds based on the amount of data you have
stored in it.
After the export, you can import the domain into HEDDA.IO.
10
Installing HEDDA.IO As a full cloud service, you do not need to install HEDDA.IO on a local server. You can deploy the
application including all resources directly from the Azure Marketplace to your Azure Subscription.
Open the Azure Portal at https://portal.azure.com and click on the button "Create a resource". Then
enter the name HEDDA.IO in the search field and press Enter. In the search result list, select
HEDDA.IO.
11
In the next screen, select the software plan for HEDDA.IO and click Create.
Now follow the seven steps to deploy HEDDA.IO via the Azure Portal directly into your subscription.
You can find complete instructions on deploying HEDDA.IO at
https://hedda.io/documentation/
After deploying HEDDA.IO, you must install the HEDDA.IO Excel Add-in to manage knowledge bases
and domains. You can download the Excel add-in either from the home page of the HEDDA.IO
service you created or from the HEDDA.IO website. After you have installed the add-in, you will see a
new tab HEDDA.IO the next time you start Excel.
12
Importing a DQS Knowledge Base to HEDDA.IO To import the previously created DQS Knowledge Base into HEDDA.IO, start Excel and log on to your
HEDDA.IO service via the HEDDA.IO tab. You can get the URL and the API key from the properties of
your service via the Azure Portal. Further information can be found in the HEDDA.IO documentation
at: https://hedda.io/documentation
Once you are connected to your HEDDA.IO instance, you can select Import from the Configuration
Group.
13
The Import dialog allows you to import HEDDA.IO exports as well as DQS exports. To import the
previously exported DQS file, select DQS Import from the HEDDA.IO Import dialog.
When you import a DQS file, it is uploaded from your local computer to the HEDDA.IO service.
The HEDDA.IO service then imports the corresponding DQS file and creates a new knowledge base
and the corresponding domains from this file. Based on the file size, the import may take some time.
During the import, the Import button is disabled at each connected client.
14
After the backup has been imported from the Data Quality Server, you can access the Knowledge
Base and its corresponding domains.
All members, including synonyms and validation status have been imported into HENNDA.IO. Both
DQS and HEDDA.IO can handle UTF-8 and can export or import corresponding members.
15
Creating an SSIS Package To create an SSIS package to clean up data using the previously imported DQS Backup and HEDDA.IO
service, you must first install the HEDDA.IO SSIS component. You can download the component
either from the portal of your previously created service or from the HEDDA.IO Web site at
https://hedda.io/download
After installing the component, open SQL Server Data
Tools. Create a new SSIS project with a data flow using
the HEDDA.IO Domain Cleansing component.
Configure the components to use the previously
imported DQS domain. Then save the data again in a
database.
16
Creating an SSIS-IR with HEDDA.IO To use an Azure Data Factory SSIS-IR with HEDDA.IO components, you must specify a custom setup
script when creating the SSIS-IR.
Follow the next steps to create an Azure Data Factory with HEDDA.IO components.
Create a new Azure Data Factory using the Azure Portal.
You can create the Azure Data Factory in the same resource group in which you created the
HEDDA.IO service. You can of course also use a new or an existing resource group. For performance
reasons however, make sure that the Azure Data Factory is created in the same region in which the
HEDDA.IO service was created.
17
After the Azure Data Factory has been successfully created, go to the Author & Monitor area by
clicking on the corresponding button in the Azure Portal.
Click "Configure SSIS Integration" to create a new SSIS-IR.
In the next dialog you have to configure several parameters to create a new SSIS Integration
Runtime. Since in this step-by-step guide we want to concentrate on the migration from DQS to
HEDDA.IO, we recommend Microsoft Docs for complete information on the individual parameters:
https://docs.microsoft.com/en-us/azure/data-factory/create-self-hosted-integration-runtime
When you create a new SSIS-IR, you must also create an SSIS Catalog. You can use the same SQL
server on which your HEDDA.IO database is hosted. To use the component with the Azure Data
Factory SSIS Integration Runtime, the component must be installed and configured on the
corresponding Azure nodes. The actual installation of the component takes place via a batch file-
which is automatically executed when the Azure SSIS IR is started.
18
The batch file or the location of the batch file must be defined when creating the Azure SSIS IR.
The batch file runs the installer of the HEDDA.IO Domain Cleansing SSIS Component via msiexec. To
do this, the MSI file together with the batch file must be on the same blob storage.
The content of the batch file, called “main.cmd” is:
msiexec /i oh22is.HeddaDomainCleansing.msi /qn
Create a shared access signature for the corresponding blob storage or container on which the batch
file was saved. You can easily create a Shared Access Signature (SAS) with the Azure Storage
Explorer. Note that a date can be specified when creating an SAS. In general, a date not far in the
future is more secure, but this may prevent the Azure SSIS-IR from accessing the blob storage when
it is restarted.
For this reason, choose a date that fits the life cycle of your Azure SSIS-IR and copy the URL of the
SAS before closing the window.
With the new Integration Runtime Setup, you can now easily install third party components via the
Azure portal. Open your already created Azure Data Factory V2 in the Azure Portal and click on
“Author & Monitor”. Then click on “Configure SSIS Integration Runtime” in the Azure Data Factory to
configure a new Azure SSIS-IR.
In the next windows, enter all parameters for the configuration of the Azure SSIS-IR.
In the last window, under “Custom Setup Container SAS URI”, enter the URL of the previously
created Shared Access Signature. The UI then automatically validates the specified SAS. Then click on
the “Validate VNet” button to check your defined VNet.
19
Once all the information has been correctly and properly validated, the Azure SSIS-IR can now be
created using the “Finish” button. When creating the Azure SSIS-IR, the setup is automatically loaded
from the given SAS and executed on the nodes.
For more information on installing HEDDA.IO and the SSIS components, please refer to the
documentation:
https://hedda.io/documentation
https://hedda.io/download
20
Publishing the SSIS Package To publish your newly created SSIS package, click Deploy in the context menu of your SSIS solution.
Then follow the steps of the wizard to deploy your SSIS package to an Azure Data Factory SSIS-IR.
The deployment steps do not differ from those of on-premises deployment.
As with the on-premises SQL Server Integration Services, the deployment process does not check
whether the necessary SSIS components are installed on the runtime. Make sure that you have done
this as described in the step "Creating an SSIS-IR with HEDDA.IO".
With the current version of the SSDT, you have the possibility to enable your SSIS project directly for
Azure. If you have activated these settings, you can add an Azure SSIS Integration Runtime as a
linked Azure Service to your project. You can then debug the packages created in the SSDT directly
on the Azure SSIS-IR.
21
Executing the SSIS Package To run your SSIS package, create a new pipeline within your Azure Data Factory.
In the Activities toolbox, expand General, then drag & drop an Execute SSIS Package activity to the
pipeline designer surface. Define all necessary task settings so that you can run the previously
created package.
If you are not sure how to configure the task, the following article can help you:
https://docs.microsoft.com/en-us/azure/data-factory/how-to-invoke-ssis-package-ssis-activity
After you have configured the Task correctly, you can publish the new pipeline. To do this, click the
Publish All button.
To execute the pipeline, click on the button Trigger and then on the button Trigger Now.
In the Pipeline Run window, select Finish.
You can check the execution of the SSIS packages via the integrated monitor within the Azure Data
Factory or via the SSIS reports of the SSISDB, which you can call via the SQL Server Management
Studio.
22
Conclusion As you can see, you can easily export existing DQS Knowledge Bases and import them into
HEDDA.IO. You can easily migrate your SSIS and DQS workloads completely to Azure. All processes
and data remain in Azure and you can completely modernize your processes.
23
oh22information services GmbH
Otto-Hahn-Str. 22 Am Turm 34
65520 Bad Camberg 53721 Siegburg
Germany Germany
info@oh22.is
https://www.oh22.is
https://www.hedda.io
24
Recommended