Upload
dinhthuan
View
272
Download
5
Embed Size (px)
Citation preview
DQ Analyzer 10User Guide
Created on: 20 Dec 2016
Read the most up-to-date documentation for the latest Ataccama release online at docs.ataccama.com
Copyright © 2007–2016 by Ataccama. All rights reserved.
Table of Contents
1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
2 DQA User Interface Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
3 Creating a Data Quality Profile . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
3.1 Step 1 Select Data to Profile . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
3.2 Step 2 Configure the Profile or Create a Profiling Plan . . . . . . . . . . . . . . . . . . . . . 9
4 Building Custom Profiling Plans . . . . . . . . . . . . . . . . . . . . . . . . . 11
4.1 The Plan Editor . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
4.2 Creating a Plan File . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
4.3 Adding Steps to the Canvas . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
4.4 Connecting Steps . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
4.5 Editing Step Properties . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
4.6 Dealing with Errors . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17
4.7 Adding Comments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18
4.8 Running the Plan . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 194.8.1 Viewing the Console Output . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 194.8.2 Viewing the Plan Execution Progress . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20
4.9 Viewing Historical Run Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21
4.10 Configuring the Profiling Step . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 214.10.1 General Category . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 224.10.2 Input Category . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 264.10.3 Configuring Business Domain Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33
4.11 DQA Steps and Functions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 344.11.1 Available Steps . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 344.11.2 Using Functions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 354.11.3 Using Regular Expressions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36
4.12 Debugging Expressions and Steps . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37
4.12.1 Debugging a Function (Expression) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 374.12.2 Debugging a Regular Expression . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 394.12.3 Debugging a Step . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40
5 Reading a Data Quality Profile . . . . . . . . . . . . . . . . . . . . . . . . . . . 43
5.1 The Profile Viewer . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43
5.2 Inputs and Roll Ups . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43
5.3 Understanding The Profile Statistics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44
5.4 Column Analyses . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 445.4.1 Basic Analyses . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 445.4.2 Frequency Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 465.4.3 Domain Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 475.4.4 Business Domain Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 475.4.5 Mask Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 485.4.6 Quantiles . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 495.4.7 Group Frequency Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49
5.5 Advanced Analyses . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 505.5.1 Primary Keys . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 505.5.2 Foreign Keys . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 515.5.3 Business Rules . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 545.5.4 Dependency Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55
6 Cleaning up Drillthrough DB Tables . . . . . . . . . . . . . . . . . . . . . 58
7 Working with Data Files . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59
7.1 Opening Data Files . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 597.1.1 Editing Metadata . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59
7.2 Viewing Data Files . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 617.2.1 Sorting . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 617.2.2 Filtering . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 617.2.3 Data Coloring . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 627.2.4 Column Visibility . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 647.2.5 Color-coding Column Headers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 647.2.6 Resizing Columns . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 657.2.7 Mark Groups . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 667.2.8 Saving Views . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 66
7.3 Editing Data Files . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 67
7.4 Using Data Files in a Plan . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 69
8 Working with Databases . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 70
8.1 Installing Database Connectivity Drivers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 70
8.2 Connecting to a Database . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 71
8.3 Viewing Table Properties . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 73
8.4 Executing SQL Queries . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 74
9 Tips and Tricks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 76
9.1 Cheat Sheets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 769.1.1 Commonly Used Functions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 769.1.2 Regular Expressions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 80
9.2 IDE Shortcuts . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 829.2.1 Data Viewer . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 829.2.2 SQL Editor . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 839.2.3 Profile Viewer . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 839.2.4 Plan Editor . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 839.2.5 Progress Viewer . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 849.2.6 Common Shortcuts . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 84
9.3 Mouse Gestures in the Plan Editor . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 849.3.1 Basic Concepts . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 859.3.2 Individual Actions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 859.3.3 Drop Step on Another Step . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 879.3.4 Delete Step from Connection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 879.3.5 Drag Step out of Connection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 889.3.6 Summary of Actions and Possible Key Modifiers . . . . . . . . . . . . . . . . . . . . . . . . . . . . 89
DQ Analyzer 10 User Guide
5 of 89
1 Introduction
Ataccama DQ Analyzer is a free high-performance data profiling tool, capable of profiling millions of records fast. You can leverage a number of algorithms to process your data before profiling, and make use of the same powerful business rules engine as Ataccama's full data quality tool, the Data Quality Center (DQC). DQ Analyzer is designed to provide a quick insight into your data and performing advanced analyses iteratively even on large data sets.
This guide is intended to provide an overview of the basic functionality of the product and describe how to perform common functions.
In this guide:
DQA User Interface Overview (see page 6)
Creating a Data Quality Profile (see page 8)
Building Custom Profiling Plans (see page 11)
Reading a Data Quality Profile (see page 43)
Cleaning up Drillthrough DB Tables (see page 58)
Working with Data Files (see page 59)
Working with Databases (see page 70)
Tips and Tricks (see page 76)
Other Help Resources:
Video Tutorials. These demonstrate how to perform common tasks in DQ Analyzer; available on :Ataccama Youtube channel
Creating a Profile
Understanding Profiling Results
Advanced Profiling
Email Analysis
Tutorial Files. DQ Analyzer includes a project, which contains pre-built, runnable Tutorialsconfigurations. Video tutorials will walk you through these.
Help Files. For help on specific functions or features not covered in this guide or the resources mentioned above, extensive documentation is available in the product Help (available in >Help
in the toolbar).Help Contents
DQ Analyzer 10 User Guide
6 of 89
1.
2.
3.
4.
2 DQA User Interface Overview
DQ Analyzer is built on the Eclipse framework, so it is similar in structure and behavior to many Integrated Development Environments (IDEs). The user interface is comprised of four main areas:
The Explorer Panel. The explorer panel offers quick access to the created DQ Projects, local and network disks from which it is possible to quickly locate data files for analysis, and connected databases with easy access to tables to be used for analysis. Many action shortcuts are available by right-clicking on objects in this panel, such as creating new plans and profiles or connecting to a database.
The Main Editing Area. The editing area is where all main activities are performed: plans are constructed and visualized, data files are shown and edited, and profiles are viewed. All of these can be opened in separate tabs next to each other.
The Status Panel. The Status panel contains two tabs by default plus the tab, which Console appear upon the first plan run:
Properties. Shows properties of a selected file, plan, profile or database, e.g. path, size or modification date. Also shows problems with the constructed plan
Run Results. Stores the results of plans run in the past and shows details in case of errors, including the component that contains the problem.
Console. Shows the progress of running a plan or generating a profile
Other tabs can be added and attached (by dragging) to any part of the UI in the Window > menu.Show View > Other...
Configuration Dialogs. (second figure below) There are various dialogs used to configure the different components, similar to the one shown below. These dialogs are typically invoked by double-clicking Steps or via the context (right-click) menu.
DQ Analyzer 10 User Guide
7 of 89
The User Interface
Example of a Configuration Dialog
DQ Analyzer 10 User Guide
8 of 89
1.
2.
3 Creating a Data Quality Profile
A data quality profile is a summary of the state of data. It allows seeing duplicates and dependencies, evaluating business rules, observing patterns in data, and more.
After reading this chapter, you will be able to create a profile and configure it for additional analyses, such as masks, dependencies, and business rules.
3.1 Step 1 Select Data to Profile
Launch the Integrated Development Environment.
In the File Explorer view, select one or several files (CSV, TXT, XLS, XLSX) or database tables, right-click the selection and select Create Profile.
Notes:
In order to profile a database table, you must have a database connection configured (see to learn how to do this).Connecting to a Database (see page 71)
In the case of text files, you may need to assign metadata to them to describe how it is formatted. For more information, see Editing Metadata. (see page 59)
If you select several inputs, you will obtain one profile file for all of them separated by input (one per table/file). See for more Reading a Data Quality Profile (see page 43)information on reading a profile with multiple inputs.
DQ Analyzer 10 User Guide
9 of 89
2.
Selecting Several Database Tables to Profile
3.2 Step 2 Configure the Profile or Create a Profiling Plan
In the profile configuration dialog that opens, specify where to create the profile as well as which columns to profile. Drill-through functionality allows you to see the individual records that comprise the statistics that are generated (database connection required). Finally, there is the option to create a profile or a plan file.
If you select the option and click , the profile will be generated immediately using the Profile Finishspecified settings and opened in the Profile Viewer. See Reading a Data Quality Profile (see page 43 to learn how to read the data contained in the Profile.)
If you select the option, a plan for generating a profile will be created. This option is useful Plan fileif you wish to modify or filter the data before profiling it or if you want to do some advanced configuration of the profiling algorithm (such as adding business rules or performing primary key analysis, for example). See the next section for more Configuring the Profiling Step (see page 21)information.
DQ Analyzer 10 User Guide
10 of 89
Profile Configuration
DQ Analyzer 10 User Guide
11 of 89
4 Building Custom Profiling Plans
Sometimes it is necessary to cleanse and standardize data to get more accurate profiling results. In this case, it is necessary to build a plan.
A plan file defines the logic and rules to be applied to the input data in order to produce the desired output. Plans are created by placing Steps onto a canvas and connecting them together. Steps are data processing algorithms that can be used to read, transform and analyze data, among other actions.
Examples of complex profiling plans are available in the Tutorials project in DQ Analyzer.
4.1 The Plan Editor
The image below shows a plan in the Plan Editor, which is launched every time you open or create a plan. The Plan Editor consists of the , where the plan logic is defined (by connecting (1) CanvasSteps together), and the , where the various steps and actions are listed.(2) Palette
The Plan Editor
DQ Analyzer 10 User Guide
12 of 89
1.
2.
4.2 Creating a Plan File
To create a new plan file:
Select by right-clicking on a project or folder in the explorer panel. Alternatively New > Planuse the toolbar. Both options are shown below
Creating a New Plan
Specify the of the plan and the place ( ) for storing it.Name Container
DQ Analyzer 10 User Guide
13 of 89
2.
Creating a New Plan File
4.3 Adding Steps to the Canvas
To add steps to the canvas, do one of the following:
Drag needed steps from the Palette to the Canvas.
Dragging a Step to the Canvas
DQ Analyzer 10 User Guide
14 of 89
Press or and select the step from a filterable list.CTRL + I Insert
Insert Step Dialog
To learn how particular steps work, go through plans in the tutorial project: DQ Projects > Tutorials.
4.4 Connecting Steps
To connect two steps, drag from the endpoint of one step to the endpoint of another step.out in
DQ Analyzer 10 User Guide
15 of 89
Connecting Steps
4.5 Editing Step Properties
Most steps require (or benefit from) some configuration to perform their functions, which is done by accessing the step properties.
To edit step properties, double-click the step or right-click it and select Edit Properties:
DQ Analyzer 10 User Guide
16 of 89
Calling the Properties Dialog
In the image below, a regular expression is defined in the Regex Matching step:
Editing Regex Matching Properties
In the image below, the Column Assigner step is edited: a column is created and an expression is defined for it:
DQ Analyzer 10 User Guide
17 of 89
Editing Column Assigner Properties
Press to get a list of available functions and inputs columns. Press Ctrl+Space to get a list of available input columns.Ctrl+Space+Space
4.6 Dealing with Errors
Errors that may arise when constructing the plan are be reported to the tab of the Properties Status Panel:
Errors in the Constructed Plan
Selecting an individual step will show only the warnings and errors for that Step. Double-clicking on an error in the panel will open the step properties dialog to the field which contains the Propertieserror.
DQ Analyzer 10 User Guide
18 of 89
4.7 Adding Comments
To add a comment to your plan to explain its logic, select from the Palette and click Commentanywhere on the Canvas.
Adding a Comment
To edit the comment, double-click on it. The image below shows the comment editor, which allows changing the text, background, and border color as well as the text itself:
Editing a Comment
DQ Analyzer 10 User Guide
19 of 89
4.8 Running the Plan
When the plan is built and contains no errors it can be run. To do that simply click the button Runas seen below:
Running a Plan
When the plan is finished running, a message will appear:
Plan Run Successful Message
4.8.1 Viewing the Console Output
During and after plan execution, you can see plan execution logs in the tab of the Status Console Panel:
DQ Analyzer 10 User Guide
20 of 89
Plan Run Progress Monitoring
4.8.2 Viewing the Plan Execution Progress
To open plan execution progress while the plan is being executed, click the icon in Show Progressthe Status Panel.
Monitoring Plan Progress
A new tab that opens shows the total number of records passing to each step.
DQ Analyzer 10 User Guide
21 of 89
Plan Execution Progress in a Separate Tab
4.9 Viewing Historical Run Results
To view all plan executions in the current sessions, switch to the tab in the Status Run ResultsPanel and select a particular run. You will be able to review the errors that occurred.
Run Results
4.10 Configuring the Profiling Step
If you choose to create a profile using a plan, a Profiling step will appear on the canvas, connected to a data source.
Example of a Simple Profiling Plan
This simple plan contains two steps: one for reading the data and the other to generate the Profile.
This section will deal will all possible configurations of the profiling step.
Double-click the Profiling step to open the Profiling Step Editor. Here you can modify the existing configuration and add additional analyses to run.
There are two categories of settings in the Profiling Step Editor: settings, which apply to all Generalinputs; and -specific settings that apply separately to each input.Input
DQ Analyzer 10 User Guide
22 of 89
4.10.1 General Category
Basic Tab
The Basic tab contains fields for specifying the step name, output file name and location and default locale for the generated files.
Basic Profile Settings
Masks Tab
The Masks tab contains the masks that have been defined and is where new masks can be created and edited. A mask is a way of showing the structure of the data rather than the content of the data. User-defined codes are used instead of the actual characters in the data to describe these patterns. For example, the mask "D" could be used to represent a digit, while "L" could be used to represent a letter. The codes and rules for the masks can be defined here. You can assign any of the created masks to the columns in your data in the tab for each of your inputs. See Data Data Tab
below.(see page 27)
DQ Analyzer 10 User Guide
23 of 89
Masks Tab of the Profiling Step Properties
Each mask configuration contains the following settings:
Characters this is the type (or types) of characters that should be masked. The pre-defined .types are: [:all:], [:letter:], [:lowercase:], [:uppercase:], [:digit:], and [:white:], where [:white:] means all whitespace characters (i.e., spaces). Multiple characters can be used in the same mask (e.g., [:digit:][:white:]).Characters that are not masked will be shown as they are in the data when is Copy otherschecked. For example, if only numbers are masked, the Mask results could show " main #street" in an address field, where is a numerical mask for a sequence of digits.#
Symbol. The symbol that is used to represent a single Character. "L" for letter, for example.
Repeated Symbol. This is the character that is used to represent a sequence of characters defined in the Characters field. "W" for a sequence of more than 2 letters, for example.
Repeated Threshold. The minimum number of Characters that represents a sequence. For example, the minimum number of letters in a row to be considered a word.
The Repeated Symbol and Repeated Threshold fields may to left blank to mask all characters individually regardless of the sequence length.
To create a new mask, click the button at the bottom of the list of masks. To delete a mask, click +the button. A mask may contain multiple masking rules (called character groups), as shown above. -To add a new character group, click the button. The order of character groups might be Add
DQ Analyzer 10 User Guide
24 of 89
important when creating a mask with several character group, one of which is a subset of another, especially when the [:all:] character type is used in conjunction with other types. For example, if you want to mark all letter with "L," all digits with "D" and any other symbol with "$," the rule containing [:all:] should be placed last in the list. Use , , and buttons to change the To Top Up Down To Bottomplacing of rules relative to each other.
Drill-through Tab
The Drill-through tab specifies whether drill-through functionality should be used or not. Enabling drill-through allows to inspect the individual records that comprise the generated statistics and other measures shown in the Profile viewer. Because it will store additional fields, it requires a database connection for use. To allow drill-through in the generated profile, click Enable Drill-
.through
Drill-through Configuration
Parameters:
Database Name. Name of a connected database to use for storing drill-through tables. For a list of available databases, use the content-assist functionality (invoked by ).Ctrl + Space
Table Prefix. Optional prefix which will be added to the drill-through tables created for this profile.
Display Limit. Optional parameter specifying how many rows of data will be written to the drill-through tables. This is a useful parameter for large inputs since limiting its value to a representative number of rows will lead to considerable space savings. When Display Limit is
DQ Analyzer 10 User Guide
25 of 89
1.
2.
3.
specified, the application selects the records to be written into drill-through tables based on their "value added" to the general picture of the data or, in other words, their uniqueness. This way, given the Display Limit value, the application tries to create as representative sample of the data as possible.
When drill-through is enabled, it can be used by right-clicking on many of the statistics shown in the profile viewer (e.g. Median value) and selecting .Drill-through
With time, drill-through tables can take up a lot of DB space. To learn how to clean up obsolete drill-through tables, see .Cleaning up Drillthrough DB Tables (see page 58)
Foreign Keys Tab
Foreign Key analysis can be performed when there are two or more different inputs connected to the same Profiling step. Adding several inputs to the Profiling step is described in Input Category
below.(see page 26)
To create a new Foreign Key analysis:
Click the button at the bottom of the list, similar to creating a new Mask analysis.+
Enter the names of the inputs to analyze in the and fields. Left Input Name Right Input NameContent-assist ( ) can be used to list the inputs of the step.Ctrl + Space
Enter columns from each input to analyze the section.Components
DQ Analyzer 10 User Guide
26 of 89
Foreign Key Analysis Configuration
Reading foreign key analysis results is discussed in .Foreign Keys (see page 51)
4.10.2 Input Category
The Input category contains settings that are specific to each Profiling step input. To add a new input to the Profiling step, click above the category list. This will add a new connection point to the step icon in the Plan editor so that a new input (e.g. Text File Reader or JDBC Reader step) can be connected. After this, all additional inputs should be configured at least in terms of data to profile ( tab) and optionally in other tabs as described in the sections below.Data
The adjacent button can be used to delete an input.
DQ Analyzer 10 User Guide
27 of 89
Multiple Profiling Inputs
Data Tab
The Data tab shows all the data that will be profiled. It also contains the individual settings for each column to be profiled. By default the settings that are defined in the Create Profile wizard area apply to all columns, but this tab allows configuring each column separately.
Data Tab
The Data tab has the following configuration possibilities:
Expression - the name of the column or expression to profile
Type - the data type originating input step (e.g. Text File Reader) – for reference only. To change the data type of a column, use the Metadata Editor or input step.
DQ Analyzer 10 User Guide
28 of 89
Mask - specifies which mask to use for the analysis (choose from those defined in the Maskstab under the categoryGeneral
Analyze Domain - specifies whether the Profiling step should output the domain for the given column
Standard Stats - specifies whether the Profiling step should output the standard statistics, like , , , etc. (the statistics under the tab of profiling Minimum value Maximum value Extremes Basics
results)
Frequency - specifies settings for frequency analysis of values
None - do not perform frequency analysis
Discrete - perform frequency analysis without any range limitations
Range - frequency analysis will be performed on the first n characters, the value of n is specified in this field.
Group Size - specifies settings for group size analysis (the analysis of the distribution of value duplication intensity: how to many values have one, two, three and so forth duplicates). Group size of 1 represents unique values.
None - do not perform frequency analysis
Discrete - perform frequency analysis without any range or interval limitations
Range
From - the beginning value for the first bucket
To - the beginning value for the last bucket: [ ; )<to_value> infinity
Steps - the number of buckets
Interval - sets bucket width
Locale - specifies the locale for the given column; if left empty, the default locale is used
Comment - free text describing the data
When connected to an incoming step (such as Text File Reader), the button can be Fill Columns… used to automatically add columns that are connected to the input, rather than manually typing the incoming column names. The Type column uses the types defined in the originating input step (e.g. Text File Reader) and is for reference only. To change the data type of a column, use the Metadata Editor or input step.
Dependencies Tab
The Dependencies tab can be used to define an analysis to test the dependency of the fields in one column on the fields in other columns, such as whether birth number is related in some way to birth date.
To add a new dependency, click . To delete a dependency, click .+ -
DQ Analyzer 10 User Guide
29 of 89
Dependency Configuration Dialog
A dependency configuration consists of three parts:
Name - Identifier of the dependency
Determinant - one or several expressions (or columns) that define the key
Dependants - one or several expressions (or columns) for which the dependency is tested (these are not combined)
Threshold - the most frequent dependent value (modus) has to occur in at least [threshold] percent of records with a given determinant to be considered dependent
The table below demonstrates several scenarios of dependencies
# Determinant Dependant Notes
1 123456 John Smith The whole group of records with Determinant value '123456' is considered to be dependent only if Dependant Threshold is set to 80% or lower (only 8 of 10 records with this particular Determinant value have the same Dependant value)
2 123456 John Smith
3 123456 John Smith
4 123456 John Smith
DQ Analyzer 10 User Guide
30 of 89
# Determinant Dependant Notes
5 123456 John Smith
6 123456 John Smith
7 123456 John Smith
8 123456 John Smith
9 123456 Jill Barber
10 123456 Steve Goldberg
11 111111 Laura Mitchel
Violating the dependency if Threshold is set to more than 50 (only 50% percent of records with the same Determinant value have the same - most frequent - Dependant value)12 111111 Miriam
Green
13 any value Null Determinant
14 222222 Jim White unique Determinant value
15 333333 Jim Bush unique Determinant value
16 444444 Linda Smith dependent
17 444444 Linda Smith
18 555555 Berry Perry dependent
19 555555 Berry Perry
20 555555 Berry Perry
For more details on dependency analysis, go to the product Help ( ) and Help > Help Contentssearch for the Profiling step.
Reading dependency analysis results is discussed in .Dependency Analysis (see page 55)
DQ Analyzer 10 User Guide
31 of 89
Roll Ups Tab
A Roll Up is a way to look at a specific subset of the profiled data. Entering a column name in the Roll Ups Expression field will create a list of separate profile analyses for each value of that column. For example, if the data contains a column of genders, creating a roll up of this column will allow viewing profile results for each gender value in that column, including empty rows, which might reveal new patterns in the data. When a roll up is defined, the Inputs and Roll Ups panel will be shown in the Profile Viewer. Inputs and Roll Ups panel will be shown in the Profile Viewer.
Roll Ups Tab
Reading results per roll up is discussed in .Inputs and Roll Ups (see page 43)
Business Rules Tab
A business rule is a Boolean expression that is evaluated and its results are presented in the Profile Viewer. Some examples are shown below.
DQ Analyzer 10 User Guide
32 of 89
Business Rules Tab
See the Expressions Help page for a description of available expressions and their usage. Go to and search for .Help > Help Contents Expressions
Reading business rule results is discussed in .Business Rules (see page 54)
Primary Keys Tab
To analyze the uniqueness of a particular column and determine whether it is a primary key, add a new primary key analysis and enter the name of the column to analyze in the field of Expressionthe section. Several columns can be analyzed in conjunction as seen in the figure Componentsbelow:
DQ Analyzer 10 User Guide
33 of 89
Configuring Primary Key Analysis
Reading business rule results is discussed in .Primary Keys (see page 50)
4.10.3 Configuring Business Domain Analysis
Business domain analysis tries to determine the kind of data stored in the analyzed column in the business context, e.g., name, address, postal code, SWIFT code, etc. It is possible to configure two thresholds that determine how many domains will be displayed as "matched" in the resulting profile.
The settings are:
Name Default Description
Strict Threshold
25 At least one domain has to match the percentage of records set here to be displayed in the results.
DQ Analyzer 10 User Guide
34 of 89
1.
2.
3.
Name Default Description
Loose Threshold
20 If at least one domain satisfies the Strict Threshold setting, other domains are shown in the results if the percentage of records that match them is at least the number set here.
To change the default settings:
Switch the layout of the Profiling step.
Select the node.Business Domains
Change settings and save changes.
See for information on business domain analysis results.Column Analyses (see page 44)
4.11 DQA Steps and Functions
DQ Analyzer offers various steps and functions inside them for constructing plan files. The algorithms and logic used for creating a plan file will vary from project to project; an introduction to steps and functions is provided below.
4.11.1 Available Steps
Steps can perform many types of functions, such as transforming data, filtering and categorizing data, and reading data. Below is an overview of the steps available with DQ Analyzer.
A complete description of steps and their usage can be found in Product Help (Help > Help in the main menu) under .Contents Steps
Icon Step Name Step Description
Column Assigner Assigns the result of an expression to a column.
Condition Directs the data flow (True->right & false->left).
Text File Reader Reads data from a text file.
DQ Analyzer 10 User Guide
35 of 89
Icon Step Name Step Description
Fixed Width File Reader
Reads data from a fixed-width delimited text file.
JDBC Reader Reads data from a JDBC (database) data source.
Excel File Reader Reads files created by Microsoft Excel.
Profiling Comprehensive analysis written to a file ( )..profile
Regex Matching Parses the input string based on a regular expression capturing groups.
Trash Discards data flow.
Union same Like SQL Table union but applies only if flows are exactly same.
Nearly 180 steps are available with the purchase of or other commercial Data Quality CenterAtaccama products.
4.11.2 Using Functions
There are many functions available that can be used inside steps. Some of the common functions are listed below.
Function Description Return Value(s)
matches Full match – input data with a regular expression True/false
DQ Analyzer 10 User Guide
36 of 89
Function Description Return Value(s)
find Partial match – regular expression in the input string True/false
substr Getsubstringofinputstring. Starting with zero. String
4.11.3 Using Regular Expressions
Ataccama products support regular expressions for pattern matching. Some of the basic regular expressions are listed below.
Regular Expression Matches
\d Number
[A-Z] Uppercase letter
[a-z] Lowercase letter
\s Whitespace
. (dot) Any character
? Once or none
+ Once or more
* Zero or more times
{2,6} At least 2 times, maximum 6 times
^ Beginning of string
$ End of string
Two regular expressions and their uses are shown below.
Regular Expression String Sample Usage
[A-Z] [0-9] [A-Z]\s?[0-9] [A-Z][0-9] Canadian ZIP code (e.g., A3A 9S9)
DQ Analyzer 10 User Guide
37 of 89
Regular Expression String Sample Usage
(\d{3} \d{2} \d{4}|\d{9}|\d{3}-\d{2}-\d{4})
US Social Security Number (123 45 6789 or 123456789 or 123-45-6789)
Full description of all functions and regular expressions is available in Product Help (Help > Help ) under .Contents Expressions
Read how to debug functions and regular expressions in Debugging Expressions and Steps .(see page 37)
4.12 Debugging Expressions and Steps
When working with functions and regular expressions, it is convenient to check their results on a small sample of data without the need to run the plan. This can be done by debugging an expression or a step as a whole.
Regular expressions or functions can be implemented and debugged in different parts of different steps, e.g., the Profiling step (in the or tab), Column Assigner or Regex Data Business RulesMatching step.
4.12.1 Debugging a Function (Expression)
Debugging a function is presented on the example of the in the Business Rules Profiling step.
DQ Analyzer 10 User Guide
38 of 89
Accessing the Expression Debugger
To debug the function (expression) click after which you will see the following dialog:Debug...
Expression Debugger Dialog
You can edit the expression on the top, input the data next to the column name, and see the result below after clicking . Expression tree to the right of the Data sources section shows the Evaluateevaluation/transformation steps the value goes through before reaching the final transformation. It also shows the result of each transformation in the brackets.
DQ Analyzer 10 User Guide
39 of 89
4.12.2 Debugging a Regular Expression
Regular expressions can be debugged in the Regex Matching step.
To debug a regular expression, select it in the navigation panel of the Properties of Regex dialog on the left and click next to the field.Matching Debug... Pattern
Locating the Debug Button
You will see the following screen:
DQ Analyzer 10 User Guide
40 of 89
Regular Expression Debugging
Enter the text into the field and press . You will see a few substitution options Input text Evaluateand their results.
In case you spot an error in your regular expression, you can edit it in the same dialog and repeat the evaluation procedure.
4.12.3 Debugging a Step
If one step contains several transformations (regular expressions or functions), it is useful to debug the step as a whole to see how the output columns values are affected.
To do that, right-click on the step and select . You will see the following window:Debug
DQ Analyzer 10 User Guide
41 of 89
Step Debugger
The window is divided into two parts: and , each listing the same columns, Step Debugger in outeven if some of them are not present in the source data. You can filter out which columns you want to see both in the input and output parts by clicking the button (a small table icon Filter Columnsas pointed out in the figure above). This feature is shown below together with the feature, Filterwhich is useful for narrowing down the selection when dealing with many columns.
Selecting the Columns to Display
DQ Analyzer 10 User Guide
42 of 89
The difference with debugging a single expression is that you cannot edit any expressions in this dialog, but you can test your expressions on real data. To do this, right-click anywhere on the part inand select . After this locate the file with input data and click . The table will be Import Data... OKpopulated with the first 500 rows of data from the selected file.
The functionality is not available for databases tables.Import Data
The imported data should have the same structure (number of columns and column names) as configured in the step, so this action is best to perform on the actual data that will be used in the step for successful import.
DQ Analyzer 10 User Guide
43 of 89
5 Reading a Data Quality Profile
After you have run a profiling plan or created a one-time profile (see Creating a Data Quality Profile ), a file with a extension will appear in the File Explorer (it will be (see page 8) .profile
immediately opened in the later case). Double-clicking such a file will open the Profile Viewer.
5.1 The Profile Viewer
The Profile Viewer contains several tabs and windows, which are described in further sections. The data can be exported to XML or HTML format by using the button in the top right part of the Viewer.
Profile Reader
5.2 Inputs and Roll Ups
The Profiling Step may take any number of inputs, which are shown in this panel (if there is more than one input). Additionally, each input may have any number of "roll ups" assigned to it - ways of grouping the data by specific parameters. For example, roll ups could be used to view data profiles by gender, department or country.
See and for information on adding Input Category (see page 26) Roll Ups Tab (see page 31)multiple inputs and configuring roll ups.
DQ Analyzer 10 User Guide
44 of 89
For information on how to create a profile, see .Creating a Data Quality Profile (see page 8)
5.3 Understanding The Profile Statistics
Depending on the configuration upon its creation, the profile will contain a set of statistics, which can be divided into two categories:
Column Analyses (see page 44) - learn how to read Basic, Frequency, Domain, Mask, Quantiles, and Group Frequency analyses
Advanced Analyses (see page 50) - learn how to read Primary Keys, Foreign Keys, Business Rules, and Dependency analyses
5.4 Column Analyses
The Column analyses tab presents statistical analyses and pattern information about the columns that have been profiled. Each column in the input data is listed as a row in the table, which presents information such as data type, value counts, and minimum/maximum values.
5.4.1 Basic Analyses
The tab provides simple statistics about the data that has been profiled and shows a chart of Basicduplicate and distinct data as a percentage of the whole.
DQ Analyzer 10 User Guide
45 of 89
Example of Basic Analyses Results
Interpreting Counts
The table lists the following values:Counts
Null: all records that are empty or have "Null" as their value.
Non-null: all records that are not empty or null (Duplicate + Distinct)
Duplicate: the number of values that are the same as other values in the list
Distinct: the number of non-null values that are different from each other (Non-unique + Unique)
Non-unique: the number of values that have at least one duplicate in the list
Unique: the number of values that have no duplicates
To illustrate the meaning of these values, take the following data as an example.
Record No. Value
1 John Smith
2 John Smith
3 Rebecca Davis
DQ Analyzer 10 User Guide
46 of 89
Record No. Value
4 Paul Adams
5
The table for this data would be as follows:Counts
Type Count Records Explanation
Null 1 Record 5 The last record is empty
Non-null 4 Records 1-4
The first 4 records contain data
Duplicate 1 Record 2 There is one duplicate of the John Smith record (Record 1)
Distinct 3 Records 1, 3, 4
These records contain distinct values
Non-unique
1 Record 1 John Smith has a duplicate record - it isn't unique
Unique 2 Records 3 and 4
Rebecca Davis and Paul Adams appear only once in the list, they have no duplicates
5.4.2 Frequency Analysis
The tab shows the number of times each value in the data occurs (shown as Frequency Analysisboth an absolute count and as a percentage of the whole).
DQ Analyzer 10 User Guide
47 of 89
Example of Frequency Analysis Results
5.4.3 Domain Analysis
This is an analysis to determine the likely type of the data in each column (whether the data is text, a number or a date, for example). The probable types are listed, along with exceptions (such as a text string found in a list of dates).
Example of Domain Analysis Results
5.4.4 Business Domain Analysis
Business domain analysis tries to determine the kind of data stored in the analyzed column in the business context, e.g., name, address, postal code, SWIFT code, etc.
DQ Analyzer 10 User Guide
48 of 89
The result shows all matched domains and their probabilities, i.e., the percentage of records that match each given domain. By default, at least 25% of records have to match one domain – this domain domain appears in the list of matched domains. For any other domain to appear in the list, it needs to have at least 20% of matched records. These settings can be set in the Profiling step. See
.Configuring the Profiling Step (see page 21)
In the example below, 79.21% of records match the Gender domain for the selected column. No other domains are shown since the percentage of records that they match is lower than the loose threshold.
Example of Business Domain Analysis Results
5.4.5 Mask Analysis
The tab shows the syntactic patterns of the data, i.e. the structure of the data rather Mask Analysisthan the content of the data. Codes ("masks") are used to describe these patterns. For example, the code "W" is used by default to represent a word (the number of letters required to make a word can be defined in the Profiling Step properties), while "L" is used to represent a letter.
This type of analysis can be useful when, for example, looking at a column of names, where one or two words are common, but single letters and numbers are not. Finding unexpected patterns in the data can provide information about the overall level of data quality.
Example of Mask Analysis Results
DQ Analyzer 10 User Guide
49 of 89
5.4.6 Quantiles
The tab displays the data values that occur at designated intervals in the ordered data Quantilesset. The first value in the list is at 0% and the last value is at 100%. The median value is at the 50% marker.
Example of Quantile Analysis Results
5.4.7 Group Frequency Analysis
The tab shows the number of times that each non-null frequency count is repeated in the Groupsselected column. If all values are unique, the group size will be 1, as there are no duplicate values. Each time a value is repeated, it forms a new group. In the example below, there are 130 464 values that have no duplicates, 2 033 values that are repeated twice, 73 values which are repeated three times and so on.
Example of Group Frequency Analysis Results
DQ Analyzer 10 User Guide
50 of 89
5.5 Advanced Analyses
This section explains how to read the results of advanced analyses: primary keys, foreign keys, business rules, and dependencies.
5.5.1 Primary Keys
Primary Keys tab will be shown in the main panel of the Profile Viewer if primary key analysis was configured in the Profiling Step. This tab will show whether the specified columns could be used as primary keys (their uniqueness is tested).
Locate the Primary Keys tab and select one of the primary keys to see its detailed results.
Primary Keys Tab Location in the Profile
DQ Analyzer 10 User Guide
51 of 89
Primary Key Analysis Results
See for information on configuring this analysis.Primary Keys Tab (see page 32)
5.5.2 Foreign Keys
Foreign Keys will be shown below the main input(s) in the Inputs and Roll Ups Section of the Profile Viewer if their analysis was configured in the Profiling step. The results of this analysis will show whether the key from one input can be considered a foreign key in relation to the other (parent) entity coming from a second input.
Locate the node and select one of the previously defined foreign keys to see its detailed Foreign Keysresults.
Location of the Foreign Keys Node in the Profile
The first chart, a Venn diagram, shows the overall match between the left and right inputs.
DQ Analyzer 10 User Guide
52 of 89
The second diagram shows the results of matching, treating the left input as the parent and the right input as the child.
The third diagram shows the result of matching, treating the right input as the parent and the left input as the child.
DQ Analyzer 10 User Guide
53 of 89
DQ Analyzer 10 User Guide
54 of 89
Foreign Key Analysis Results
See for information on configuring this analysis.Foreign Keys Tab (see page 25)
5.5.3 Business Rules
When configured in the Profiling Step properties, this tab is shown and displays the results of the evaluation of any number of Boolean expressions relating to the input data. The example below (taken from the Advanced Profiling sample) shows a business rule that checks the length of each SIN number and tests whether it is 9 digits in length. It evaluates to true if the length is 9 digits and false otherwise.
Locate the tab and select one of the defined business rules to see its detailed results.Business Rules
Business Rules Tab Location in the Profile
DQ Analyzer 10 User Guide
55 of 89
Example of Business Rule Testing Results
See for information on configuring this analysis.Business Rules Tab (see page 31)
5.5.4 Dependency Analysis
If configured in the Profiling step, Dependency Analysis discovers whether values of Dependants (selected columns or expressions) depend on the value of a Determinant (one or more columns or expressions combined into a single key). Each group of records with the same determinant value is examined, and if the most frequent dependant value is present in at least a certain percent of records (defined upon configuration in the field), the whole group is considered to be Thresholddependent. Otherwise, the group is considered not to be dependent.
Locate the tab for the given input of roll up and select one of the previously defined Dependency dependency analyses to see its detailed results.
Dependency Tab Location in the Profile
Results of dependency analysis present records divided into the following groups:
Null - records whose determinant value is null
DQ Analyzer 10 User Guide
56 of 89
Violations - records for which the dependency level is lower than the threshold
Dependencies - records for which the dependency level is higher than the threshold, further divided into:
Trivial - records which have one dependant per determinant
Non-trivial - records with two or more dependants per determinant
Non-trivial null - records with two or more dependants per determinant, with dependant values always equal to null
DQ Analyzer 10 User Guide
57 of 89
Dependency Analysis Results
See for more information on the configuration and logic of Dependencies Tab (see page 28)dependency analysis.
DQ Analyzer 10 User Guide
58 of 89
6 Cleaning up Drillthrough DB Tables
Every time you run a plan which contains a Profiling Step with enabled Drill-through functionality, a table will be created in the specified database. These tables will have a prefix in their PROF_
names unless it has been specified in the Profiling Step . These tables will also get a Properties suffix, where stands for the number of inputs used in the Profiling Step, beginning with _0_x x
. For example. you might see a table with the following name: , 0 TEST.PROF_34C76810_0_1
which means that this table is used for the second input of the corresponding Profile. You will also see the same table with the suffix, which corresponds to the first input used in the profile. _0_0
One more table that is always created is called . This table stores the connections PROF_META
between profiles and drill-through tables in the database.
Drill-through tables accumulate and are not automatically cleaned up when profiles are deleted, which, for some users, might lead to a lack of space.
To clean up unused tables, right-click the used database and select . You Drop Obsolete Profiles...will see the following screen:
Dropping Obsolete Profiles
You will see the profiles which no longer exist, the user that created them, the date of creation, and the corresponding table in the database, which will be dropped.
If you share the database schema with others, the dialog window above will treat the tables created by others as obsolete since you will not have the corresponding profiles in your DQ project. Make sure to drop only your own obsolete profiles.
As you can see in the figure above, the profile has four tables associated with it (_0_eeg1.profile
(0...3) suffix), which will all be deleted. The other two profiles have one table each in the database (_0_0 suffix). Select the tables you want to drop (click to select all) a click .Check All OK
DQ Analyzer 10 User Guide
59 of 89
7 Working with Data Files
Existing files can be used as input data for a plan, for example. Files can be added by dragging and dropping from the file system to the desired project in the Navigator panel or by copying them from the destination folder to the desired project folder inside the workspace folder in the file system.
7.1 Opening Data Files
To open a data file locate it in File or Model Explorer and double-click on it. When a data file is opened for the first time (the file does not have letter 'M' in the bottom right corner), the Metadata Editor is launched.
A Data File without Metadata in the File Explorer
7.1.1 Editing Metadata
Metadata describes the format of the data and needs to be assigned for the file – primarily CSV, TXT, and XLS(X) – to be properly used in plans. To edit metadata, right-click on the file and select
.Edit Metadata...
DQ Analyzer 10 User Guide
60 of 89
Advanced Metadata Settings
In Section 1 of the initial tab, the rules for reading and interpreting the raw data can be Formatspecified, the main of which are:
File type. Choose whether the file is delimited or fixed width.
Encoding. Specify the encoding.
Field Separator. Choose which symbol separates your data into fields (columns). The setting is applicable if delimited has been chosen.File Type
Data starts at line. Specify which line the data starts at. The value is usually "2" for data with field names or "1" for data without.
Section 2 of dialog allows to see the columns resulting from specifications in Edit File MetadataSection 1, change their type, and specify their format where applicable. For example, if you choose integer or long data format, you will be able to choose the thousands separator in the Format column.
Section 3 allows to see the final data in the tab and compare it to the original data in Result datathe tab.Decoded original data
DQ Analyzer 10 User Guide
61 of 89
7.2 Viewing Data Files
Double-clicking a text file will open it for viewing in the CSV Viewer. The CSV Viewer displays the data in rows and columns, as defined in the file metadata.
CSV Viewer
7.2.1 Sorting
To sort a column, click the name of the column in the header row. Clicking once will sort the data in ascending order (i.e. smallest-to-largest/A-to-Z), indicated by an up arrow. Clicking again will sort the data in descending order, indicated by a down arrow. Clicking a third time will remove all sorting and revert to the original ordering of the data, indicated by no arrow.
7.2.2 Filtering
To show only a subset of data, click the button in the toolbar as shown below:Switch View
Current View
This will open the View Settings dialog, which contains a tab named .Filter
DQ Analyzer 10 User Guide
62 of 89
Creating Data Filters
To define a filter, click . Use the drop-down controls to select a column to filter and a condition Addto apply (e.g. =, <, contains). Then specify the matching criteria. It is also possible to specify whether the filter should be case sensitive or not. This will display only rows matching the filter criteria. Multiple filters can be defined to further refine the data that is shown. To remove a filter without deleting it, uncheck the checkbox.Enabled
7.2.3 Data Coloring
By default, all data will be shown in black text on a white background using the default font settings. However, rules can be configured, so that certain data values or ranges are colored or formatted differently. This can be useful for visually scanning for key values in a large data file. The conditions are defined similarly to Filters, but there are additional options for coloring and text formatting (available via the column):Coloring
Background. Define the highlighting to apply to the cells
Foreground. Defines the text color
Bold. Tick to make the text bold
Italic. Tick to italicize the text
Additionally, there is the option to define whether the coloring rule should be applied only to the specific cell which matches the coloring rule or to all (or any subset) of the columns (in the Apply tocolumn).
DQ Analyzer 10 User Guide
63 of 89
Defining the Coloring Rules
After the two coloring rules above have been defined and applied, the data in the CSV Viewer will look like this:
CSV Viewer with Coloring Applied
Coloring rules will be applied in the order in which they appear. For example, as visible in the screenshot above, the last line contains a cell with "SIN" in the src_sin column, but the text inside is not painted red as has been defined in the tab of . This is because this rule is Colors View Settingsfollowed by the rule. The order can be changed using the buttons src_birth_date >= 'NULL'
on the right, below the button. If we change the order of the two rules above, src_sin column in Addthe last row now has red font color instead of black (see below). Since the background color was not defined for this rule, it is inherited from the other one and stays yellow.
DQ Analyzer 10 User Guide
64 of 89
CSV Viewer with Coloring Applied and Coloring Rules Order Switched
7.2.4 Column Visibility
For data files with many columns it may be useful to hide certain columns to focus on specific data. This can be done in the tab of the View Settings dialog. Uncheck a column to hide it from Columnsview. When columns are hidden, a note appears at the bottom of the CSV Viewer indicating the number of columns that are hidden and providing a quick link to show them all. A column can also be hidden by right-clicking on it and selecting .Hide Column
7.2.5 Color-coding Column Headers
Many data files use standardized naming conventions to group similar columns. The View Settings dialog allows specifying different colors for column headers based on all or part of the column name. In the Heading tab, a column mask can be defined (e.g. "src*"), which will color all headings which name starts with the text specified. A different background color can be set for each mask that is used.
DQ Analyzer 10 User Guide
65 of 89
Defining Coloring Rules for Headings
The following figure displays the outcome of the settings above.
Example of Colored Headers
7.2.6 Resizing Columns
Column widths can be resized by dragging left or right when the mouse is placed over the column divider in the column header. Columns can be automatically sized to fit their contents by double-clicking the column divider.
Resizing Columns
DQ Analyzer 10 User Guide
66 of 89
Additionally, right-clicking on a column header will bring up a menu which offers, among other functions, the ability to "autofit" the selected column or all columns.
7.2.7 Mark Groups
Another feature available in the column header context menu is the ability to visually mark changes (or identify groups) in data. This can be useful for scanning a specific column visually for changes in the data.
Data with Marked Groups in the Column src_gender
The column whose groups are marked is indicated by an icon showing three parallel horizontal lines next to the column name (as shown above).
7.2.8 Saving Views
To preserve the view settings (including sorting, column widths, and marked groups) for later use, click the View Settings drop-down arrow to open the options menu.
DQ Analyzer 10 User Guide
67 of 89
View Options
Click (for creating a new saved view) or (for saving changes to an existing view) to Save As… Savestore view settings. A list of recently used views will be shown at the top of the menu. A complete list of all saved views is shown in a submenu at the bottom labeled . An asterisk (*) All Saved Viewsnext to the view name indicates that unsaved changes have been made to the current view. To delete a view, select . It is also possible to import and export views for use with Manage Views…other copies of the product using this menu option.
The view called "Default" is a stored view with no settings applied. It cannot be changed or deleted. When changes have been made to the default view, the toolbar button label changes to , <custom>which indicates an unsaved view based on the default view. Click to name and store the Save As…new view. The option is the same as clicking the toolbar button with the view name, which Edit…opens the dialog.View Settings
7.3 Editing Data Files
Besides viewing, it is also possible to edit and files directly in the Ataccama IDE. Excel .txt .csv
files can be edited only in MS Excel.
To edit a data file: right-click it and select .Open With > Text Editor
DQ Analyzer 10 User Guide
68 of 89
Opening a Data File in the Text Editor
The file will open as plain text (instead of a table) without its metadata applied.
Text Editor
You can also select to the file in the default editor in your operating Open With > System Editorsystem.
DQ Analyzer 10 User Guide
69 of 89
7.4 Using Data Files in a Plan
The easiest way to add a data file into a plan is by dragging it from the explorer panel directly onto the canvas, where a corresponding step will be generated (e.g., Text File Reader or Excel File Reader).
Dragging a Text File to the Canvas
Text File Reader Appearing on the Canvas
DQ Analyzer 10 User Guide
70 of 89
1.
2.
3.
4.
5.
8 Working with Databases
All Ataccama solutions use databases in one way or another. They can be used as data sources and repositories for storing data. Profiles can be created from data stored in databases too.
8.1 Installing Database Connectivity Drivers
The Ataccama IDE uses the Java Database Connectivity (JDBC) API for connecting to databases. JDBC drivers are available for most database engines and are distributed as a component of the database engine or separately as connectivity components. The licensing terms do not always allow distribution of these drivers with Ataccama products. Therefore only a basic set of drivers for the most common databases is shipped. You may download and install additional drivers on your own.
The following drivers are shipped with the product and are installed automatically when the IDE is started for the first time:
Apache Derby. JDBC driver for the Apache Derby database.
HSQLDB. JDBC driver for the HSQLDB database.
IBM DB2. JDBC driver for the IBM DB2 database.
jTDS. jTDS open-source driver for connecting to both MS SQL and Sybase servers.
Oracle. JDBC driver version 11g for Oracle databases.
Microsoft SQL Server. JDBC driver for Microsoft SQL Server 2000, SQL Server 2005, and SQL Server 2008. Requires JRE version 1.6.
PostgreSQL. JDBC driver for the PostgreSQL database. Supports PostgreSQL 7.3 or newer.
Teradata. JDBC driver version 13.10.00.18 for Teradata databases.
Additional drivers may be installed with the following procedure:
Download the required driver (usually an archive).
Extract the archive to a desired location.
Go to Window > Preferences > [Your Product] > Database.
Add/edit a database driver:
In case of a pre-configured driver – it is in the list, e.g., MySQL – double-click it.
In case you want to add a new database driver, click and provide the name of the Add... new database.
DQ Analyzer 10 User Guide
71 of 89
5.
6.
7.
Click and locate the driver in the extraction folder.Add to Classpath... .jar
Click to finish.OK
The driver will have a green dot and YES in the column of the table.Configured
8.2 Connecting to a Database
To connect to a database, right-click the node in the File Explorer and select Databases New .Database Connection
Creating a Database Connection
This will invoke the following dialog:
DQ Analyzer 10 User Guide
72 of 89
Configuring a Database Connection
A detailed explanation of the New Database Connection parameters is provided below.
Name Required Description
Database type Yes Specifies the database connection type.
Connection name Yes Defines a name for the new database connection.
Host Yes Specifies the host name or IP of the database host.
Port Yes Port for which the database connection will be used.
Database name | Catalog
Yes Specifies the name of the database instance located on the host.
DQ Analyzer 10 User Guide
73 of 89
Name Required Description
Username No The username to connect to the database.
Password No The password to connect to the database.
After the database connection has been made, the database will be shown in the Databases node in the explorer panel. Right-click the connection name and select to establish a Connectconnection.
Connecting a Database
8.3 Viewing Table Properties
Clicking on the table name will show its metadata in the tab.Properties
DQ Analyzer 10 User Guide
74 of 89
Table Properties in the Status Panel
8.4 Executing SQL Queries
To view the results of an SQL query on a table, .double-click the table
SQL Editor Mode
A default query will be shown, listing all table entries (grouped in batches if the number of rows is large). To change the query, edit the query text and click . To retrieve more results from the Executequery, click or (to show all results).Next batch Read Rest
DQ Analyzer 10 User Guide
75 of 89
Refer to the documentation for the JDBC Reader step to learn how to use data from a database inside a Plan file.
DQ Analyzer 10 User Guide
76 of 89
9 Tips and Tricks
Cheat Sheets (see page 76)
IDE Shortcuts (see page 82)
Mouse Gestures in the Plan Editor (see page 84)
9.1 Cheat Sheets
The following materials cover standard and proprietary languages for interacting with data:
Regular Expressions (see page 80) - using regular expressions, the users can extract information or parse a string into multiple elements. The regex engine is based on Java.util.regex, following industry standards.
Commonly Used Functions (see page 76) - powerful rule language that includes a set of built-in functions for most common data transformations and algorithms in the data quality realm. The language is designed to be welcoming to newcomers with heterogeneous backgrounds and can be easily extended.
9.1.1 Commonly Used Functions
Basic String Functions
Function Description
left(srcStr, count) Returns a new string that is a substring of containing srcStr of chars taken from the start of the .count srcStr
right(srcStr, count) Returns a new string that is a substring of containing srcStr of chars taken from the end of the .count srcStr
substr(srcStr, beginIndex), substr(srcStr, beginIndex, strLen)
Returns a new string that is a substring of the string . srcStr
transliterate(srcStr, charsFrom, charsTo)
Transforms characters of the string .srcStr
containsWord(srcStr, srcWord)
DQ Analyzer 10 User Guide
77 of 89
Function Description
Searches for the occurrence of the word in the string srcWord .srcStr
indexOf(srcStr, subStr), indexOf(srcStr, subStr, fromIndex)
Returns the index within the string of the first srcStroccurrence of the specified substring .subStr
lower(srcStr) Transforms all characters of the string to lower case.srcStr
upper(srcStr) Transforms all characters of the string to upper case.srcStr
capitalize(srcStr) Transforms the first character of each word to upper case and all following characters to lower case.
replicate(srcStr, n) Returns copies of the string concatenated together n srcStrwithout any separator.
replace(srcStr, what, withWhat) Replaces occurrences of the string with the string what in the string .withWhat srcStr
Regular Expression Functions
Function Description
find(srcRegex, srcStr [, caseInsensitive])
Verifies whether the string or its parts match the regular srcStr expression .srcRegex
matches(srcRegex, srcStr [, caseInsensitive])
Verifies whether the string matches exactly the pattern srcStr of the regular expression .srcRegex
substituteAll(srcPattern, srcReplacement, srcStr [, caseInsensitiveFlag])
Replaces all occurrences of in string with srcPatter srcStr.srcReplacement
substituteMany(srcPattern, srcReplacement, srcStr, srcVolume [, caseInsensitiveFlag])
Replaces all occurrences of in the string with srcPattern srcStr , the maximum number of replacements is srcReplacement
defined by the parameter .srcVolume
DQ Analyzer 10 User Guide
78 of 89
Conversion Functions
Function Description
toDate(expr, dateFormat[, dateLocale])
Returns the date specified in expression converted to exprdate type DAY.
toString(expr, strFormat[, strLocale])
Converts the expression to a STRING value.expr
Datetime Functions
Function Description
datePart(srcDate, fieldName)
This function returns the value of the field fieldName (YEAR, MONTH of .or DAY) srcDate
dateDiff(startDate, endDate, fieldName)
Returns the difference between and expressed in endDate startDate units.fieldName
today() Returns the current date in type DAY.
dateAdd(srcDate, srcValue, fieldName)
Adds the specified of the type specified by ( , srcValue fieldName YEAR or ) to the .MONTH DAY srcDate
String Distance Functions
Function Description
hamming(srcStr1, srcStr2 [, caseInsensitive])
Returns the Hamming distance between strings and srcStr1 srcStr2.
levenshtein(srcStr1, srcStr2 [, maxDistance][, caseInsensitive])
Returns the Levenshtein distance between strings and srcStr1.srcStr2
editDistance(srcStr1, srcStr2 [, maxDistance] [, caseInsensitive])
Returns the edit distance between strings and .srcStr1 srcStr2
DQ Analyzer 10 User Guide
79 of 89
Function Description
soundex(srcStr) Returns the soundex value of the parameter.srcStr
doubleMetaphone(srcStr), doubleMetaphone(srcStr, isAlternate)
Encodes str1 to a double metaphone primary string. Encodes srcStr to a double metaphone secondary string if the parameter isAlternate is true.
Set Distance Functions
Function Description
intersection(set1, set2, separator [, multiset]) Returns the cardinality of the intersection of sets.
difference(set1, set2, separator [, multiset] [, singularity])
Returns the cardinality of the difference of sets (set1 set2).
symmetricDifference(set1, set2, separator [, multiset ] [, singularity])
Returns the cardinality of the symmetric difference of sets.
union(set1, set2, separator [, multiset]) Returns the cardinality of the union of sets.
Conditionals
Function Description
iif(ifExpr, trueExpr, elseExpr) Returns if is .trueExpr ifExpr TRUE
case(expr, exprValue[, expr, exprValue]...[, defaultExpr])
Returns the value of expression immediately exprValuefollowing the first expression whose value is .expr TRUE
decode(decodeExpr, expr, exprValue[, expr , exprValue]...[, defaultExpr])
Returns the value of expression immediately exprValuefollowing the first expression whose value is equal to expr
.decodeExpr
nvl(expr[, expr]...) Returns the value of the first expression whose value is exprnot .NULL
DQ Analyzer 10 User Guide
80 of 89
9.1.2 Regular Expressions
Metacharacters Defined
MChar Definition Pattern Sample Matches
^ Start of a string. ^abc abc, abcdefg, abc123, ...
$ End of a string. abc$ abc, endsinabc, 123abc, ...
. Any character (except \n newline) a.c abc, aac, acc, adc, aec, ...
| Alternation. bill|ted ted, bill
{...} Explicit quantifier notation. ab{2}c abbc
[...] Explicit set of characters to match. a[bB]c abc, aBc
(...) Logical grouping of part of an expression. (abc){2}
abcabc
* 0 or more of previous expressions. ab*c ac, abc, abbc, abbbc, ...
+ 1 or more of previous expressions. ab+c abc, abbc, abbbc, ...
? 0 or 1 of previous expressions; also forces minimal matching when an expression might match several strings within a search string.
ab?c ac, abc
\ a\sc a c
DQ Analyzer 10 User Guide
81 of 89
MChar Definition Pattern Sample Matches
Preceding one of the above, it makes it a literal instead of a special character. Preceding a special matching character, see below.
Character Escapes
Escaped Char
Description
ordinary characters
Characters other than . $ ^ { [ ( | ) ] } * + ? \ match themselves.
\t Matches a tab \u0009.
\r Matches a carriage return \u000D.
\f Matches a form feed \u000C.
\n Matches a new line \u000A.
\040 Matches an ASCII character as an octal (up to three digits); numbers with no leading zero are backreferences if they have only one digit or if they correspond to a capturing group number. For example, the character \040 represents a space.
\x20 Matches an ASCII character using hexadecimal representation (exactly two digits).
\u0020 Matches a Unicode character using a hexadecimal representation (exactly four digits).
\* When followed by a character that is not recognized as an escaped character, matches that character. For example, \* is the same as \x2A.
DQ Analyzer 10 User Guide
82 of 89
Character Classes
Char Class Description
. Matches any character except \n.
[aeiou] Matches any single character included in the specified set of characters.
[^aeiou] Matches any single character not in the specified set of characters.
[0-9a-fA-F] Use of a hyphen (–) allows specification of contiguous character ranges.
\w Matches any word character. \w is equivalent to [a-zA-Z_0-9].
\W Matches any non-word character. \W is equivalent to [^a- zA-Z_0-9].
\s Matches any white-space character. \s is equivalent to [\f\n\r\t\v].
\S Matches any non-white-space character. \S is equivalent to [^ \f\n\r\t\v].
\d Matches any decimal digit.
\D Matches any non-digit.
9.2 IDE Shortcuts
Below are Ataccama IDE shortcuts organized into categories by particular viewer or editor.
A full list of keyboard shortcuts can be found and called by pressing Ctrl+Shift+L at any time.
To edit shortcuts, go to .Window > Preferences > General > Keys
9.2.1 Data Viewer
Ctrl+F - - Opens Find dialog.Find
Ctrl+L - - Opens dialog where user selects which columns should be visible and Show Columnswhich columns should be hidden.
Ctrl+Home - Selects and reveals the top left cell.
Ctrl+End - Selects and reveals the bottom right cell.
DQ Analyzer 10 User Guide
83 of 89
9.2.2 SQL Editor
Ctrl+F - - Opens Find/Replace dialog if SQL query field has focus (from v8).Find
Ctrl+Enter - - Executes current query.Execute
Ctrl+R - - Reads the rest of the query execution.Read Rest
Ctrl+B - - Reads the next batch of the query execution.Next Batch
Esc - - Stops the query execution.Stop
9.2.3 Profile Viewer
Ctrl+Shift+D - - Displays appropriate drill-through data.Drill-through
9.2.4 Plan Editor
Search and Navigate
Ctrl+F - - Opens Find dialog.Find
Ctrl+G - - Goes to step.Go to Step
Ctrl+double click - Opens component definition (if selected step is a component). From v8 it also opens the file associated with the given step.
Edit
Insert, - - Inserts step.Ctrl+I Insert Step
Drop step to connection - Attaches step to a connection.
Alt+drop step to connection - Attaches step to connection and makes space for it.
Ctrl+drop step to connection - Attaches step to connection through Multiplicator.
Ctrl+Alt+drop step to connection - Attaches step to connection through Multiplicator and makes space for the Multiplicator.
Ctrl+Delete - Removes step from connection.
Ctrl+drag step out of connection - Detaches step from a connection.
Ctrl+Shift+drag step out of connection - Detaches step from connection and disables it.
Layout
Ctrl+Shift+H - Sets horizontal layout to selected step(s).
Ctrl+Shift+V - Sets vertical layout to selected step(s).
Ctrl+Alt+V - - Distributes selected steps/includes horizontally.Distribute Widths
DQ Analyzer 10 User Guide
84 of 89
Ctrl+Alt+H - - Distributes selected steps/includes vertically.Distribute Heights
Ctrl+Alt+Left - - Aligns selected steps/includes left.Align Left
Ctrl+Alt+Right - - Aligns selected steps/includes right.Align Right
Ctrl+Alt+Up - - Aligns selected steps/includes top.Align Top
Ctrl+Alt+Down - - Aligns selected steps/includes bottom.Align Bottom
Run
Ctrl+Shift+D - - Debugs selected step.Debug
Ctrl-F11 - - Runs the plan/component.Run
9.2.5 Progress Viewer
Ctrl+G - - Goes to step.Go to Step
Ctrl+double click - Opens component definition (if selected step is a component).
9.2.6 Common Shortcuts
This section contains some common shortcuts (not all these shortcuts are supported by all editors).
Ctrl+C - Copy
Ctrl+V - Paste
Ctrl+A - Select All
Ctrl+Z - Undo
Ctrl+Y - Redo
Ctrl+S - Save
Ctrl+P - Print
Shift+mouse wheel - Scrolls editor content horizontally.
Ctrl+mouse wheel - Zooms in/out editor content.
9.3 Mouse Gestures in the Plan Editor
Creating/reconnecting steps in DQC plans involves, in addition to choosing/positioning the steps, a lot of reconnecting. The aim of mouse gestures is to simplify this, thus improving the speed of creating and modifying plans.
DQ Analyzer 10 User Guide
85 of 89
9.3.1 Basic Concepts
The only connection that may change during any mouse gesture operation (except for the operation where some steps are deleted) is the connection onto which the step is being dragged/created/removed.
Endpoints for connecting are unconnected ones. If no such exist, that are capable to attach new connection are used.
When connecting single unconnected step (insert or drag), first endpoint from left that matched condition is used.
When dragged step belongs to a subgraph, then for output connection the whole subgraph is searched for endpoints, and first matching endpoint is used
Subgraph is searched in depth-first preorder: first parents, then children, first left branch, then right, first left node's endpoint, then right
Dragging a step on a connection is not enabled when no input endpoint is found
9.3.2 Individual Actions
Create Step/Reader/Component onto a Connection
By inserting a step or dragging data source/component definition onto a connection, both a new element and connections are created.
DQ Analyzer 10 User Guide
86 of 89
By additional holding of the Ctrl key during step creation, a "fork" is created with the Multiplicator step.
Drag Step/Reader/Component onto a Connection
By dragging an existing step onto a connection, the step will be seamlessly integrated into it.
By CTRL-dragging, a "fork" a "fork" is created with the Multiplicator step.
DQ Analyzer 10 User Guide
87 of 89
9.3.3 Drop Step on Another Step
By inserting a new step/component on an existing step with unconnected output, the new step is connected and positioned automatically.
9.3.4 Delete Step from Connection
When pressing CRTL+Del on the selected step, it is deleted while preserving the actual connection.
DQ Analyzer 10 User Guide
88 of 89
9.3.5 Drag Step out of Connection
When CRTL+dragging connected step, the step is detached from connection. When CTRL+SHIFT is pressed while dragging, the detached step will also be disabled.
DQ Analyzer 10 User Guide
89 of 89
9.3.6 Summary of Actions and Possible Key Modifiers
Action No key modifier CTRL pressed SHIFT pressed
ALT pressed
Drag step on connection
connect step into the connection
make a fork and connect
shift nodes to make space
Create step on connection
create and connect to the existing connection
make a fork, create and connect
shift nodes to make space
Drop step on step
create and connect with the existing step
shift nodes to make space
Drag connected step
normal drag detach from the connection
disable the detached step
Delete connected step
delete and break the connection
delete and preserve the connection