DQ Analyzer 10 User Guide - AtaccamaDQA+10+User+Guide.pdf · 9.1 Cheat Sheets ... DQ Analyzer 10 User Guide 5 ... In the File Explorer view, select one or several files (CSV, TXT

DQ Analyzer 10User Guide

Created on: 20 Dec 2016

Read the most up-to-date documentation for the latest Ataccama release online at docs.ataccama.com

Copyright © 2007–2016 by Ataccama. All rights reserved.

https://docs.ataccama.com

Table of Contents

1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5

2 DQA User Interface Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6

3 Creating a Data Quality Profile . . . . . . . . . . . . . . . . . . . . . . . . . . . 8

3.1 Step 1 Select Data to Profile . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8

3.2 Step 2 Configure the Profile or Create a Profiling Plan . . . . . . . . . . . . . . . . . . . . . 9

4 Building Custom Profiling Plans . . . . . . . . . . . . . . . . . . . . . . . . . 11

4.1 The Plan Editor . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11

4.2 Creating a Plan File . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12

4.3 Adding Steps to the Canvas . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13

4.4 Connecting Steps . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14

4.5 Editing Step Properties . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15

4.6 Dealing with Errors . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17

4.7 Adding Comments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18

4.8 Running the Plan . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 194.8.1 Viewing the Console Output . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 194.8.2 Viewing the Plan Execution Progress . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20

4.9 Viewing Historical Run Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21

4.10 Configuring the Profiling Step . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 214.10.1 General Category . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 224.10.2 Input Category . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 264.10.3 Configuring Business Domain Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33

4.11 DQA Steps and Functions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 344.11.1 Available Steps . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 344.11.2 Using Functions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 354.11.3 Using Regular Expressions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36

4.12 Debugging Expressions and Steps . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37

4.12.1 Debugging a Function (Expression) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 374.12.2 Debugging a Regular Expression . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 394.12.3 Debugging a Step . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40

5 Reading a Data Quality Profile . . . . . . . . . . . . . . . . . . . . . . . . . . . 43

5.1 The Profile Viewer . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43

5.2 Inputs and Roll Ups . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43

5.3 Understanding The Profile Statistics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44

5.4 Column Analyses . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 445.4.1 Basic Analyses . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 445.4.2 Frequency Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 465.4.3 Domain Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 475.4.4 Business Domain Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 475.4.5 Mask Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 485.4.6 Quantiles . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 495.4.7 Group Frequency Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49

5.5 Advanced Analyses . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 505.5.1 Primary Keys . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 505.5.2 Foreign Keys . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 515.5.3 Business Rules . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 545.5.4 Dependency Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55

6 Cleaning up Drillthrough DB Tables . . . . . . . . . . . . . . . . . . . . . 58

7 Working with Data Files . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59

7.1 Opening Data Files . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 597.1.1 Editing Metadata . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59

7.2 Viewing Data Files . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 617.2.1 Sorting . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 617.2.2 Filtering . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 617.2.3 Data Coloring . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 627.2.4 Column Visibility . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 647.2.5 Color-coding Column Headers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 647.2.6 Resizing Columns . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 657.2.7 Mark Groups . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 667.2.8 Saving Views . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 66

7.3 Editing Data Files . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 67

7.4 Using Data Files in a Plan . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 69

8 Working with Databases . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 70

8.1 Installing Database Connectivity Drivers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 70

8.2 Connecting to a Database . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 71

8.3 Viewing Table Properties . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 73

8.4 Executing SQL Queries . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 74

9 Tips and Tricks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 76

9.1 Cheat Sheets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 769.1.1 Commonly Used Functions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 769.1.2 Regular Expressions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 80

9.2 IDE Shortcuts . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 829.2.1 Data Viewer . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 829.2.2 SQL Editor . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 839.2.3 Profile Viewer . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 839.2.4 Plan Editor . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 839.2.5 Progress Viewer . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 849.2.6 Common Shortcuts . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 84

9.3 Mouse Gestures in the Plan Editor . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 849.3.1 Basic Concepts . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 859.3.2 Individual Actions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 859.3.3 Drop Step on Another Step . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 879.3.4 Delete Step from Connection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 879.3.5 Drag Step out of Connection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 889.3.6 Summary of Actions and Possible Key Modifiers . . . . . . . . . . . . . . . . . . . . . . . . . . . . 89

DQ Analyzer 10 User Guide

5 of 89

1 Introduction

Ataccama DQ Analyzer is a free high-performance data profiling tool, capable of profiling millions of records fast. You can leverage a number of algorithms to process your data before profiling, and make use of the same powerful business rules engine as Ataccama's full data quality tool, the Data Quality Center (DQC). DQ Analyzer is designed to provide a quick insight into your data and performing advanced analyses iteratively even on large data sets.

This guide is intended to provide an overview of the basic functionality of the product and describe how to perform common functions.

In this guide:

DQA User Interface Overview (see page 6)

Creating a Data Quality Profile (see page 8)

Building Custom Profiling Plans (see page 11)

Reading a Data Quality Profile (see page 43)

Cleaning up Drillthrough DB Tables (see page 58)

Working with Data Files (see page 59)

Working with Databases (see page 70)

Tips and Tricks (see page 76)

Other Help Resources:

Video Tutorials. These demonstrate how to perform common tasks in DQ Analyzer; available on :Ataccama Youtube channel

Creating a Profile

Understanding Profiling Results

Advanced Profiling

Email Analysis

Tutorial Files. DQ Analyzer includes a project, which contains pre-built, runnable Tutorialsconfigurations. Video tutorials will walk you through these.

Help Files. For help on specific functions or features not covered in this guide or the resources mentioned above, extensive documentation is available in the product Help (available in >Help

in the toolbar).Help Contents

https://www.youtube.com/ataccama

https://www.youtube.com/watch?v=WC_VZ5z5q3E

https://www.youtube.com/watch?v=2L35k080ovQ

https://www.youtube.com/watch?v=nQGBYkTXNPc

https://www.youtube.com/watch?v=u45G0yo9sE4


6 of 89

1.

2.

3.

4.

2 DQA User Interface Overview

DQ Analyzer is built on the Eclipse framework, so it is similar in structure and behavior to many Integrated Development Environments (IDEs). The user interface is comprised of four main areas:

The Explorer Panel. The explorer panel offers quick access to the created DQ Projects, local and network disks from which it is possible to quickly locate data files for analysis, and connected databases with easy access to tables to be used for analysis. Many action shortcuts are available by right-clicking on objects in this panel, such as creating new plans and profiles or connecting to a database.

The Main Editing Area. The editing area is where all main activities are performed: plans are constructed and visualized, data files are shown and edited, and profiles are viewed. All of these can be opened in separate tabs next to each other.

The Status Panel. The Status panel contains two tabs by default plus the tab, which Console appear upon the first plan run:

Properties. Shows properties of a selected file, plan, profile or database, e.g. path, size or modification date. Also shows problems with the constructed plan

Run Results. Stores the results of plans run in the past and shows details in case of errors, including the component that contains the problem.

Console. Shows the progress of running a plan or generating a profile

Other tabs can be added and attached (by dragging) to any part of the UI in the Window > menu.Show View > Other...

Configuration Dialogs. (second figure below) There are various dialogs used to configure the different components, similar to the one shown below. These dialogs are typically invoked by double-clicking Steps or via the context (right-click) menu.


7 of 89

The User Interface

Example of a Configuration Dialog


8 of 89

1.

2.

3 Creating a Data Quality Profile

A data quality profile is a summary of the state of data. It allows seeing duplicates and dependencies, evaluating business rules, observing patterns in data, and more.

After reading this chapter, you will be able to create a profile and configure it for additional analyses, such as masks, dependencies, and business rules.

3.1 Step 1 Select Data to Profile

Launch the Integrated Development Environment.

In the File Explorer view, select one or several files (CSV, TXT, XLS, XLSX) or database tables, right-click the selection and select Create Profile.

Notes:

In order to profile a database table, you must have a database connection configured (see to learn how to do this).Connecting to a Database (see page 71)

In the case of text files, you may need to assign metadata to them to describe how it is formatted. For more information, see Editing Metadata. (see page 59)

If you select several inputs, you will obtain one profile file for all of them separated by input (one per table/file). See for more Reading a Data Quality Profile (see page 43)information on reading a profile with multiple inputs.


9 of 89

2.

Selecting Several Database Tables to Profile

3.2 Step 2 Configure the Profile or Create a Profiling Plan

In the profile configuration dialog that opens, specify where to create the profile as well as which columns to profile. Drill-through functionality allows you to see the individual records that comprise the statistics that are generated (database connection required). Finally, there is the option to create a profile or a plan file.

If you select the option and click , the profile will be generated immediately using the Profile Finishspecified settings and opened in the Profile Viewer. See Reading a Data Quality Profile (see page 43 to learn how to read the data contained in the Profile.)

If you select the option, a plan for generating a profile will be created. This option is useful Plan fileif you wish to modify or filter the data before profiling it or if you want to do some advanced configuration of the profiling algorithm (such as adding business rules or performing primary key analysis, for example). See the next section for more Configuring the Profiling Step (see page 21)information.


10 of 89

Profile Configuration


11 of 89

4 Building Custom Profiling Plans

Sometimes it is necessary to cleanse and standardize data to get more accurate profiling results. In this case, it is necessary to build a plan.

A plan file defines the logic and rules to be applied to the input data in order to produce the desired output. Plans are created by placing Steps onto a canvas and connecting them together. Steps are data processing algorithms that can be used to read, transform and analyze data, among other actions.

Examples of complex profiling plans are available in the Tutorials project in DQ Analyzer.

4.1 The Plan Editor

The image below shows a plan in the Plan Editor, which is launched every time you open or create a plan. The Plan Editor consists of the , where the plan logic is defined (by connecting (1) CanvasSteps together), and the , where the various steps and actions are listed.(2) Palette

The Plan Editor


12 of 89

1.

2.

4.2 Creating a Plan File

To create a new plan file:

Select by right-clicking on a project or folder in the explorer panel. Alternatively New > Planuse the toolbar. Both options are shown below

Creating a New Plan

Specify the of the plan and the place ( ) for storing it.Name Container


13 of 89

2.

Creating a New Plan File

4.3 Adding Steps to the Canvas

To add steps to the canvas, do one of the following:

Drag needed steps from the Palette to the Canvas.

Dragging a Step to the Canvas


14 of 89

Press or and select the step from a filterable list.CTRL + I Insert

Insert Step Dialog

To learn how particular steps work, go through plans in the tutorial project: DQ Projects > Tutorials.

4.4 Connecting Steps

To connect two steps, drag from the endpoint of one step to the endpoint of another step.out in


15 of 89

Connecting Steps

4.5 Editing Step Properties

Most steps require (or benefit from) some configuration to perform their functions, which is done by accessing the step properties.

To edit step properties, double-click the step or right-click it and select Edit Properties:


16 of 89

Calling the Properties Dialog

In the image below, a regular expression is defined in the Regex Matching step:

Editing Regex Matching Properties

In the image below, the Column Assigner step is edited: a column is created and an expression is defined for it:


17 of 89

Editing Column Assigner Properties

Press to get a list of available functions and inputs columns. Press Ctrl+Space to get a list of available input columns.Ctrl+Space+Space

4.6 Dealing with Errors

Errors that may arise when constructing the plan are be reported to the tab of the Properties Status Panel:

Errors in the Constructed Plan

Selecting an individual step will show only the warnings and errors for that Step. Double-clicking on an error in the panel will open the step properties dialog to the field which contains the Propertieserror.


18 of 89

4.7 Adding Comments

To add a comment to your plan to explain its logic, select from the Palette and click Commentanywhere on the Canvas.

Adding a Comment

To edit the comment, double-click on it. The image below shows the comment editor, which allows changing the text, background, and border color as well as the text itself:

Editing a Comment


19 of 89

4.8 Running the Plan

When the plan is built and contains no errors it can be run. To do that simply click the button Runas seen below:

Running a Plan

When the plan is finished running, a message will appear:

Plan Run Successful Message

4.8.1 Viewing the Console Output

During and after plan execution, you can see plan execution logs in the tab of the Status Console Panel:


20 of 89

Plan Run Progress Monitoring

4.8.2 Viewing the Plan Execution Progress

To open plan execution progress while the plan is being executed, click the icon in Show Progressthe Status Panel.

Monitoring Plan Progress

A new tab that opens shows the total number of records passing to each step.


21 of 89

Plan Execution Progress in a Separate Tab

4.9 Viewing Historical Run Results

To view all plan executions in the current sessions, switch to the tab in the Status Run ResultsPanel and select a particular run. You will be able to review the errors that occurred.

Run Results

4.10 Configuring the Profiling Step

If you choose to create a profile using a plan, a Profiling step will appear on the canvas, connected to a data source.

Example of a Simple Profiling Plan

This simple plan contains two steps: one for reading the data and the other to generate the Profile.

This section will deal will all possible configurations of the profiling step.

Double-click the Profiling step to open the Profiling Step Editor. Here you can modify the existing configuration and add additional analyses to run.

There are two categories of settings in the Profiling Step Editor: settings, which apply to all Generalinputs; and -specific settings that apply separately to each input.Input


22 of 89

4.10.1 General Category

Basic Tab

The Basic tab contains fields for specifying the step name, output file name and location and default locale for the generated files.

Basic Profile Settings

Masks Tab

The Masks tab contains the masks that have been defined and is where new masks can be created and edited. A mask is a way of showing the structure of the data rather than the content of the data. User-defined codes are used instead of the actual characters in the data to describe these patterns. For example, the mask "D" could be used to represent a digit, while "L" could be used to represent a letter. The codes and rules for the masks can be defined here. You can assign any of the created masks to the columns in your data in the tab for each of your inputs. See Data Data Tab

below.(see page 27)


23 of 89

Masks Tab of the Profiling Step Properties

Each mask configuration contains the following settings:

Characters this is the type (or types) of characters that should be masked. The pre-defined .types are: [:all:], [:letter:], [:lowercase:], [:uppercase:], [:digit:], and [:white:], where [:white:] means all whitespace characters (i.e., spaces). Multiple characters can be used in the same mask (e.g., [:digit:][:white:]).Characters that are not masked will be shown as they are in the data when is Copy otherschecked. For example, if only numbers are masked, the Mask results could show " main #street" in an address field, where is a numerical mask for a sequence of digits.#

Symbol. The symbol that is used to represent a single Character. "L" for letter, for example.

Repeated Symbol. This is the character that is used to represent a sequence of characters defined in the Characters field. "W" for a sequence of more than 2 letters, for example.

Repeated Threshold. The minimum number of Characters that represents a sequence. For example, the minimum number of letters in a row to be considered a word.

The Repeated Symbol and Repeated Threshold fields may to left blank to mask all characters individually regardless of the sequence length.

To create a new mask, click the button at the bottom of the list of masks. To delete a mask, click +the button. A mask may contain multiple masking rules (called character groups), as shown above. -To add a new character group, click the button. The order of character groups might be Add


24 of 89

important when creating a mask with several character group, one of which is a subset of another, especially when the [:all:] character type is used in conjunction with other types. For example, if you want to mark all letter with "L," all digits with "D" and any other symbol with "$," the rule containing [:all:] should be placed last in the list. Use , , and buttons to change the To Top Up Down To Bottomplacing of rules relative to each other.

Drill-through Tab

The Drill-through tab specifies whether drill-through functionality should be used or not. Enabling drill-through allows to inspect the individual records that comprise the generated statistics and other measures shown in the Profile viewer. Because it will store additional fields, it requires a database connection for use. To allow drill-through in the generated profile, click Enable Drill-

.through

Drill-through Configuration

Parameters:

Database Name. Name of a connected database to use for storing drill-through tables. For a list of available databases, use the content-assist functionality (invoked by ).Ctrl + Space

Table Prefix. Optional prefix which will be added to the drill-through tables created for this profile.

Display Limit. Optional parameter specifying how many rows of data will be written to the drill-through tables. This is a useful parameter for large inputs since limiting its value to a representative number of rows will lead to considerable space savings. When Display Limit is


25 of 89

1.

2.

3.

specified, the application selects the records to be written into drill-through tables based on their "value added" to the general picture of the data or, in other words, their uniqueness. This way, given the Display Limit value, the application tries to create as representative sample of the data as possible.

When drill-through is enabled, it can be used by right-clicking on many of the statistics shown in the profile viewer (e.g. Median value) and selecting .Drill-through

With time, drill-through tables can take up a lot of DB space. To learn how to clean up obsolete drill-through tables, see .Cleaning up Drillthrough DB Tables (see page 58)

Foreign Keys Tab

Foreign Key analysis can be performed when there are two or more different inputs connected to the same Profiling step. Adding several inputs to the Profiling step is described in Input Category

below.(see page 26)

To create a new Foreign Key analysis:

Click the button at the bottom of the list, similar to creating a new Mask analysis.+

Enter the names of the inputs to analyze in the and fields. Left Input Name Right Input NameContent-assist ( ) can be used to list the inputs of the step.Ctrl + Space

Enter columns from each input to analyze the section.Components


26 of 89

Foreign Key Analysis Configuration

Reading foreign key analysis results is discussed in .Foreign Keys (see page 51)

4.10.2 Input Category

The Input category contains settings that are specific to each Profiling step input. To add a new input to the Profiling step, click above the category list. This will add a new connection point to the step icon in the Plan editor so that a new input (e.g. Text File Reader or JDBC Reader step) can be connected. After this, all additional inputs should be configured at least in terms of data to profile ( tab) and optionally in other tabs as described in the sections below.Data

The adjacent button can be used to delete an input.


27 of 89

Multiple Profiling Inputs

Data Tab

The Data tab shows all the data that will be profiled. It also contains the individual settings for each column to be profiled. By default the settings that are defined in the Create Profile wizard area apply to all columns, but this tab allows configuring each column separately.

Data Tab

The Data tab has the following configuration possibilities:

Expression - the name of the column or expression to profile

Type - the data type originating input step (e.g. Text File Reader) – for reference only. To change the data type of a column, use the Metadata Editor or input step.


28 of 89

Mask - specifies which mask to use for the analysis (choose from those defined in the Maskstab under the categoryGeneral

Analyze Domain - specifies whether the Profiling step should output the domain for the given column

Standard Stats - specifies whether the Profiling step should output the standard statistics, like , , , etc. (the statistics under the tab of profiling Minimum value Maximum value Extremes Basics

results)

Frequency - specifies settings for frequency analysis of values

None - do not perform frequency analysis

Discrete - perform frequency analysis without any range limitations

Range - frequency analysis will be performed on the first n characters, the value of n is specified in this field.

Group Size - specifies settings for group size analysis (the analysis of the distribution of value duplication intensity: how to many values have one, two, three and so forth duplicates). Group size of 1 represents unique values.

None - do not perform frequency analysis

Discrete - perform frequency analysis without any range or interval limitations

Range

From - the beginning value for the first bucket

To - the beginning value for the last bucket: [ ; )<to_value> infinity

Steps - the number of buckets

Interval - sets bucket width

Locale - specifies the locale for the given column; if left empty, the default locale is used

Comment - free text describing the data

When connected to an incoming step (such as Text File Reader), the button can be Fill Columns… used to automatically add columns that are connected to the input, rather than manually typing the incoming column names. The Type column uses the types defined in the originating input step (e.g. Text File Reader) and is for reference only. To change the data type of a column, use the Metadata Editor or input step.

Dependencies Tab

The Dependencies tab can be used to define an analysis to test the dependency of the fields in one column on the fields in other columns, such as whether birth number is related in some way to birth date.

To add a new dependency, click . To delete a dependency, click .+ -


29 of 89

Dependency Configuration Dialog

A dependency configuration consists of three parts:

Name - Identifier of the dependency

Determinant - one or several expressions (or columns) that define the key

Dependants - one or several expressions (or columns) for which the dependency is tested (these are not combined)

Threshold - the most frequent dependent value (modus) has to occur in at least [threshold] percent of records with a given determinant to be considered dependent

The table below demonstrates several scenarios of dependencies

# Determinant Dependant Notes

1 123456 John Smith The whole group of records with Determinant value '123456' is considered to be dependent only if Dependant Threshold is set to 80% or lower (only 8 of 10 records with this particular Determinant value have the same Dependant value)

2 123456 John Smith

3 123456 John Smith

4 123456 John Smith


30 of 89

# Determinant Dependant Notes

5 123456 John Smith

6 123456 John Smith

7 123456 John Smith

8 123456 John Smith

9 123456 Jill Barber

10 123456 Steve Goldberg

11 111111 Laura Mitchel

Violating the dependency if Threshold is set to more than 50 (only 50% percent of records with the same Determinant value have the same - most frequent - Dependant value)12 111111 Miriam

Green

13 any value Null Determinant

14 222222 Jim White unique Determinant value

15 333333 Jim Bush unique Determinant value

16 444444 Linda Smith dependent

17 444444 Linda Smith

18 555555 Berry Perry dependent

19 555555 Berry Perry

20 555555 Berry Perry

For more details on dependency analysis, go to the product Help ( ) and Help > Help Contentssearch for the Profiling step.

Reading dependency analysis results is discussed in .Dependency Analysis (see page 55)


31 of 89

Roll Ups Tab

A Roll Up is a way to look at a specific subset of the profiled data. Entering a column name in the Roll Ups Expression field will create a list of separate profile analyses for each value of that column. For example, if the data contains a column of genders, creating a roll up of this column will allow viewing profile results for each gender value in that column, including empty rows, which might reveal new patterns in the data. When a roll up is defined, the Inputs and Roll Ups panel will be shown in the Profile Viewer. Inputs and Roll Ups panel will be shown in the Profile Viewer.

Roll Ups Tab

Reading results per roll up is discussed in .Inputs and Roll Ups (see page 43)

Business Rules Tab

A business rule is a Boolean expression that is evaluated and its results are presented in the Profile Viewer. Some examples are shown below.


32 of 89

Business Rules Tab

See the Expressions Help page for a description of available expressions and their usage. Go to and search for .Help > Help Contents Expressions

Reading business rule results is discussed in .Business Rules (see page 54)

Primary Keys Tab

To analyze the uniqueness of a particular column and determine whether it is a primary key, add a new primary key analysis and enter the name of the column to analyze in the field of Expressionthe section. Several columns can be analyzed in conjunction as seen in the figure Componentsbelow:


33 of 89

Configuring Primary Key Analysis

Reading business rule results is discussed in .Primary Keys (see page 50)

4.10.3 Configuring Business Domain Analysis

Business domain analysis tries to determine the kind of data stored in the analyzed column in the business context, e.g., name, address, postal code, SWIFT code, etc. It is possible to configure two thresholds that determine how many domains will be displayed as "matched" in the resulting profile.

The settings are:

Name Default Description

Strict Threshold

25 At least one domain has to match the percentage of records set here to be displayed in the results.


34 of 89

1.

2.

3.

Name Default Description

Loose Threshold

20 If at least one domain satisfies the Strict Threshold setting, other domains are shown in the results if the percentage of records that match them is at least the number set here.

To change the default settings:

Switch the layout of the Profiling step.

Select the node.Business Domains

Change settings and save changes.

See for information on business domain analysis results.Column Analyses (see page 44)

4.11 DQA Steps and Functions

DQ Analyzer offers various steps and functions inside them for constructing plan files. The algorithms and logic used for creating a plan file will vary from project to project; an introduction to steps and functions is provided below.

4.11.1 Available Steps

Steps can perform many types of functions, such as transforming data, filtering and categorizing data, and reading data. Below is an overview of the steps available with DQ Analyzer.

A complete description of steps and their usage can be found in Product Help (Help > Help in the main menu) under .Contents Steps

Icon Step Name Step Description

Column Assigner Assigns the result of an expression to a column.

Condition Directs the data flow (True->right & false->left).

Text File Reader Reads data from a text file.


35 of 89

Icon Step Name Step Description

Fixed Width File Reader

Reads data from a fixed-width delimited text file.

JDBC Reader Reads data from a JDBC (database) data source.

Excel File Reader Reads files created by Microsoft Excel.

Profiling Comprehensive analysis written to a file ( )..profile

Regex Matching Parses the input string based on a regular expression capturing groups.

Trash Discards data flow.

Union same Like SQL Table union but applies only if flows are exactly same.

Nearly 180 steps are available with the purchase of or other commercial Data Quality CenterAtaccama products.

4.11.2 Using Functions

There are many functions available that can be used inside steps. Some of the common functions are listed below.

Function Description Return Value(s)

matches Full match – input data with a regular expression True/false

https://www.ataccama.com/products/dqc


36 of 89

Function Description Return Value(s)

find Partial match – regular expression in the input string True/false

substr Getsubstringofinputstring. Starting with zero. String

4.11.3 Using Regular Expressions

Ataccama products support regular expressions for pattern matching. Some of the basic regular expressions are listed below.

Regular Expression Matches

\d Number

[A-Z] Uppercase letter

[a-z] Lowercase letter

\s Whitespace

. (dot) Any character

? Once or none

+ Once or more

* Zero or more times

{2,6} At least 2 times, maximum 6 times

^ Beginning of string

$ End of string

Two regular expressions and their uses are shown below.

Regular Expression String Sample Usage

[A-Z] [0-9] [A-Z]\s?[0-9] [A-Z][0-9] Canadian ZIP code (e.g., A3A 9S9)


37 of 89

Regular Expression String Sample Usage

(\d{3} \d{2} \d{4}|\d{9}|\d{3}-\d{2}-\d{4})

US Social Security Number (123 45 6789 or 123456789 or 123-45-6789)

Full description of all functions and regular expressions is available in Product Help (Help > Help ) under .Contents Expressions

Read how to debug functions and regular expressions in Debugging Expressions and Steps .(see page 37)

4.12 Debugging Expressions and Steps

When working with functions and regular expressions, it is convenient to check their results on a small sample of data without the need to run the plan. This can be done by debugging an expression or a step as a whole.

Regular expressions or functions can be implemented and debugged in different parts of different steps, e.g., the Profiling step (in the or tab), Column Assigner or Regex Data Business RulesMatching step.

4.12.1 Debugging a Function (Expression)

Debugging a function is presented on the example of the in the Business Rules Profiling step.


38 of 89

Accessing the Expression Debugger

To debug the function (expression) click after which you will see the following dialog:Debug...

Expression Debugger Dialog

You can edit the expression on the top, input the data next to the column name, and see the result below after clicking . Expression tree to the right of the Data sources section shows the Evaluateevaluation/transformation steps the value goes through before reaching the final transformation. It also shows the result of each transformation in the brackets.


39 of 89

4.12.2 Debugging a Regular Expression

Regular expressions can be debugged in the Regex Matching step.

To debug a regular expression, select it in the navigation panel of the Properties of Regex dialog on the left and click next to the field.Matching Debug... Pattern

Locating the Debug Button

You will see the following screen:


40 of 89

Regular Expression Debugging

Enter the text into the field and press . You will see a few substitution options Input text Evaluateand their results.

In case you spot an error in your regular expression, you can edit it in the same dialog and repeat the evaluation procedure.

4.12.3 Debugging a Step

If one step contains several transformations (regular expressions or functions), it is useful to debug the step as a whole to see how the output columns values are affected.

To do that, right-click on the step and select . You will see the following window:Debug


41 of 89

Step Debugger

The window is divided into two parts: and , each listing the same columns, Step Debugger in outeven if some of them are not present in the source data. You can filter out which columns you want to see both in the input and output parts by clicking the button (a small table icon Filter Columnsas pointed out in the figure above). This feature is shown below together with the feature, Filterwhich is useful for narrowing down the selection when dealing with many columns.

Selecting the Columns to Display


42 of 89

The difference with debugging a single expression is that you cannot edit any expressions in this dialog, but you can test your expressions on real data. To do this, right-click anywhere on the part inand select . After this locate the file with input data and click . The table will be Import Data... OKpopulated with the first 500 rows of data from the selected file.

The functionality is not available for databases tables.Import Data

The imported data should have the same structure (number of columns and column names) as configured in the step, so this action is best to perform on the actual data that will be used in the step for successful import.


43 of 89

5 Reading a Data Quality Profile

After you have run a profiling plan or created a one-time profile (see Creating a Data Quality Profile ), a file with a extension will appear in the File Explorer (it will be (see page 8) .profile

immediately opened in the later case). Double-clicking such a file will open the Profile Viewer.

5.1 The Profile Viewer

The Profile Viewer contains several tabs and windows, which are described in further sections. The data can be exported to XML or HTML format by using the button in the top right part of the Viewer.

Profile Reader

5.2 Inputs and Roll Ups

The Profiling Step may take any number of inputs, which are shown in this panel (if there is more than one input). Additionally, each input may have any number of "roll ups" assigned to it - ways of grouping the data by specific parameters. For example, roll ups could be used to view data profiles by gender, department or country.

See and for information on adding Input Category (see page 26) Roll Ups Tab (see page 31)multiple inputs and configuring roll ups.


44 of 89

For information on how to create a profile, see .Creating a Data Quality Profile (see page 8)

5.3 Understanding The Profile Statistics

Depending on the configuration upon its creation, the profile will contain a set of statistics, which can be divided into two categories:

Column Analyses (see page 44) - learn how to read Basic, Frequency, Domain, Mask, Quantiles, and Group Frequency analyses

Advanced Analyses (see page 50) - learn how to read Primary Keys, Foreign Keys, Business Rules, and Dependency analyses

5.4 Column Analyses

The Column analyses tab presents statistical analyses and pattern information about the columns that have been profiled. Each column in the input data is listed as a row in the table, which presents information such as data type, value counts, and minimum/maximum values.

5.4.1 Basic Analyses

The tab provides simple statistics about the data that has been profiled and shows a chart of Basicduplicate and distinct data as a percentage of the whole.


45 of 89

Example of Basic Analyses Results

Interpreting Counts

The table lists the following values:Counts

Null: all records that are empty or have "Null" as their value.

Non-null: all records that are not empty or null (Duplicate + Distinct)

Duplicate: the number of values that are the same as other values in the list

Distinct: the number of non-null values that are different from each other (Non-unique + Unique)

Non-unique: the number of values that have at least one duplicate in the list

Unique: the number of values that have no duplicates

To illustrate the meaning of these values, take the following data as an example.

Record No. Value

1 John Smith

2 John Smith

3 Rebecca Davis


46 of 89

Record No. Value

4 Paul Adams

5

The table for this data would be as follows:Counts

Type Count Records Explanation

Null 1 Record 5 The last record is empty

Non-null 4 Records 1-4

The first 4 records contain data

Duplicate 1 Record 2 There is one duplicate of the John Smith record (Record 1)

Distinct 3 Records 1, 3, 4

These records contain distinct values

Non-unique

1 Record 1 John Smith has a duplicate record - it isn't unique

Unique 2 Records 3 and 4

Rebecca Davis and Paul Adams appear only once in the list, they have no duplicates

5.4.2 Frequency Analysis

The tab shows the number of times each value in the data occurs (shown as Frequency Analysisboth an absolute count and as a percentage of the whole).


47 of 89

Example of Frequency Analysis Results

5.4.3 Domain Analysis

This is an analysis to determine the likely type of the data in each column (whether the data is text, a number or a date, for example). The probable types are listed, along with exceptions (such as a text string found in a list of dates).

Example of Domain Analysis Results

5.4.4 Business Domain Analysis

Business domain analysis tries to determine the kind of data stored in the analyzed column in the business context, e.g., name, address, postal code, SWIFT code, etc.


48 of 89

The result shows all matched domains and their probabilities, i.e., the percentage of records that match each given domain. By default, at least 25% of records have to match one domain – this domain domain appears in the list of matched domains. For any other domain to appear in the list, it needs to have at least 20% of matched records. These settings can be set in the Profiling step. See

.Configuring the Profiling Step (see page 21)

In the example below, 79.21% of records match the Gender domain for the selected column. No other domains are shown since the percentage of records that they match is lower than the loose threshold.

Example of Business Domain Analysis Results

5.4.5 Mask Analysis

The tab shows the syntactic patterns of the data, i.e. the structure of the data rather Mask Analysisthan the content of the data. Codes ("masks") are used to describe these patterns. For example, the code "W" is used by default to represent a word (the number of letters required to make a word can be defined in the Profiling Step properties), while "L" is used to represent a letter.

This type of analysis can be useful when, for example, looking at a column of names, where one or two words are common, but single letters and numbers are not. Finding unexpected patterns in the data can provide information about the overall level of data quality.

Example of Mask Analysis Results


49 of 89

5.4.6 Quantiles

The tab displays the data values that occur at designated intervals in the ordered data Quantilesset. The first value in the list is at 0% and the last value is at 100%. The median value is at the 50% marker.

Example of Quantile Analysis Results

5.4.7 Group Frequency Analysis

The tab shows the number of times that each non-null frequency count is repeated in the Groupsselected column. If all values are unique, the group size will be 1, as there are no duplicate values. Each time a value is repeated, it forms a new group. In the example below, there are 130 464 values that have no duplicates, 2 033 values that are repeated twice, 73 values which are repeated three times and so on.

Example of Group Frequency Analysis Results


50 of 89

5.5 Advanced Analyses

This section explains how to read the results of advanced analyses: primary keys, foreign keys, business rules, and dependencies.

5.5.1 Primary Keys

Primary Keys tab will be shown in the main panel of the Profile Viewer if primary key analysis was configured in the Profiling Step. This tab will show whether the specified columns could be used as primary keys (their uniqueness is tested).

Locate the Primary Keys tab and select one of the primary keys to see its detailed results.

Primary Keys Tab Location in the Profile


51 of 89

Primary Key Analysis Results

See for information on configuring this analysis.Primary Keys Tab (see page 32)

5.5.2 Foreign Keys

Foreign Keys will be shown below the main input(s) in the Inputs and Roll Ups Section of the Profile Viewer if their analysis was configured in the Profiling step. The results of this analysis will show whether the key from one input can be considered a foreign key in relation to the other (parent) entity coming from a second input.

Locate the node and select one of the previously defined foreign keys to see its detailed Foreign Keysresults.

Location of the Foreign Keys Node in the Profile

The first chart, a Venn diagram, shows the overall match between the left and right inputs.


52 of 89

The second diagram shows the results of matching, treating the left input as the parent and the right input as the child.

The third diagram shows the result of matching, treating the right input as the parent and the left input as the child.


53 of 89


54 of 89

Foreign Key Analysis Results

See for information on configuring this analysis.Foreign Keys Tab (see page 25)

5.5.3 Business Rules

When configured in the Profiling Step properties, this tab is shown and displays the results of the evaluation of any number of Boolean expressions relating to the input data. The example below (taken from the Advanced Profiling sample) shows a business rule that checks the length of each SIN number and tests whether it is 9 digits in length. It evaluates to true if the length is 9 digits and false otherwise.

Locate the tab and select one of the defined business rules to see its detailed results.Business Rules

Business Rules Tab Location in the Profile


55 of 89

Example of Business Rule Testing Results

See for information on configuring this analysis.Business Rules Tab (see page 31)

5.5.4 Dependency Analysis

If configured in the Profiling step, Dependency Analysis discovers whether values of Dependants (selected columns or expressions) depend on the value of a Determinant (one or more columns or expressions combined into a single key). Each group of records with the same determinant value is examined, and if the most frequent dependant value is present in at least a certain percent of records (defined upon configuration in the field), the whole group is considered to be Thresholddependent. Otherwise, the group is considered not to be dependent.

Locate the tab for the given input of roll up and select one of the previously defined Dependency dependency analyses to see its detailed results.

Dependency Tab Location in the Profile

Results of dependency analysis present records divided into the following groups:

Null - records whose determinant value is null


56 of 89

Violations - records for which the dependency level is lower than the threshold

Dependencies - records for which the dependency level is higher than the threshold, further divided into:

Trivial - records which have one dependant per determinant

Non-trivial - records with two or more dependants per determinant

Non-trivial null - records with two or more dependants per determinant, with dependant values always equal to null


57 of 89

Dependency Analysis Results

See for more information on the configuration and logic of Dependencies Tab (see page 28)dependency analysis.


58 of 89

6 Cleaning up Drillthrough DB Tables

Every time you run a plan which contains a Profiling Step with enabled Drill-through functionality, a table will be created in the specified database. These tables will have a prefix in their PROF_

names unless it has been specified in the Profiling Step . These tables will also get a Properties suffix, where stands for the number of inputs used in the Profiling Step, beginning with _0_x x

. For example. you might see a table with the following name: , 0 TEST.PROF_34C76810_0_1

which means that this table is used for the second input of the corresponding Profile. You will also see the same table with the suffix, which corresponds to the first input used in the profile. _0_0

One more table that is always created is called . This table stores the connections PROF_META

between profiles and drill-through tables in the database.

Drill-through tables accumulate and are not automatically cleaned up when profiles are deleted, which, for some users, might lead to a lack of space.

To clean up unused tables, right-click the used database and select . You Drop Obsolete Profiles...will see the following screen:

Dropping Obsolete Profiles

You will see the profiles which no longer exist, the user that created them, the date of creation, and the corresponding table in the database, which will be dropped.

If you share the database schema with others, the dialog window above will treat the tables created by others as obsolete since you will not have the corresponding profiles in your DQ project. Make sure to drop only your own obsolete profiles.

As you can see in the figure above, the profile has four tables associated with it (_0_eeg1.profile

(0...3) suffix), which will all be deleted. The other two profiles have one table each in the database (_0_0 suffix). Select the tables you want to drop (click to select all) a click .Check All OK


59 of 89

7 Working with Data Files

Existing files can be used as input data for a plan, for example. Files can be added by dragging and dropping from the file system to the desired project in the Navigator panel or by copying them from the destination folder to the desired project folder inside the workspace folder in the file system.

7.1 Opening Data Files

To open a data file locate it in File or Model Explorer and double-click on it. When a data file is opened for the first time (the file does not have letter 'M' in the bottom right corner), the Metadata Editor is launched.

A Data File without Metadata in the File Explorer

7.1.1 Editing Metadata

Metadata describes the format of the data and needs to be assigned for the file – primarily CSV, TXT, and XLS(X) – to be properly used in plans. To edit metadata, right-click on the file and select

.Edit Metadata...


60 of 89

Advanced Metadata Settings

In Section 1 of the initial tab, the rules for reading and interpreting the raw data can be Formatspecified, the main of which are:

File type. Choose whether the file is delimited or fixed width.

Encoding. Specify the encoding.

Field Separator. Choose which symbol separates your data into fields (columns). The setting is applicable if delimited has been chosen.File Type

Data starts at line. Specify which line the data starts at. The value is usually "2" for data with field names or "1" for data without.

Section 2 of dialog allows to see the columns resulting from specifications in Edit File MetadataSection 1, change their type, and specify their format where applicable. For example, if you choose integer or long data format, you will be able to choose the thousands separator in the Format column.

Section 3 allows to see the final data in the tab and compare it to the original data in Result datathe tab.Decoded original data


61 of 89

7.2 Viewing Data Files

Double-clicking a text file will open it for viewing in the CSV Viewer. The CSV Viewer displays the data in rows and columns, as defined in the file metadata.

CSV Viewer

7.2.1 Sorting

To sort a column, click the name of the column in the header row. Clicking once will sort the data in ascending order (i.e. smallest-to-largest/A-to-Z), indicated by an up arrow. Clicking again will sort the data in descending order, indicated by a down arrow. Clicking a third time will remove all sorting and revert to the original ordering of the data, indicated by no arrow.

7.2.2 Filtering

To show only a subset of data, click the button in the toolbar as shown below:Switch View

Current View

This will open the View Settings dialog, which contains a tab named .Filter


62 of 89

Creating Data Filters

To define a filter, click . Use the drop-down controls to select a column to filter and a condition Addto apply (e.g. =, <, contains). Then specify the matching criteria. It is also possible to specify whether the filter should be case sensitive or not. This will display only rows matching the filter criteria. Multiple filters can be defined to further refine the data that is shown. To remove a filter without deleting it, uncheck the checkbox.Enabled

7.2.3 Data Coloring

By default, all data will be shown in black text on a white background using the default font settings. However, rules can be configured, so that certain data values or ranges are colored or formatted differently. This can be useful for visually scanning for key values in a large data file. The conditions are defined similarly to Filters, but there are additional options for coloring and text formatting (available via the column):Coloring

Background. Define the highlighting to apply to the cells

Foreground. Defines the text color

Bold. Tick to make the text bold

Italic. Tick to italicize the text

Additionally, there is the option to define whether the coloring rule should be applied only to the specific cell which matches the coloring rule or to all (or any subset) of the columns (in the Apply tocolumn).


63 of 89

Defining the Coloring Rules

After the two coloring rules above have been defined and applied, the data in the CSV Viewer will look like this:

CSV Viewer with Coloring Applied

Coloring rules will be applied in the order in which they appear. For example, as visible in the screenshot above, the last line contains a cell with "SIN" in the src_sin column, but the text inside is not painted red as has been defined in the tab of . This is because this rule is Colors View Settingsfollowed by the rule. The order can be changed using the buttons src_birth_date >= 'NULL'

on the right, below the button. If we change the order of the two rules above, src_sin column in Addthe last row now has red font color instead of black (see below). Since the background color was not defined for this rule, it is inherited from the other one and stays yellow.


64 of 89

CSV Viewer with Coloring Applied and Coloring Rules Order Switched

7.2.4 Column Visibility

For data files with many columns it may be useful to hide certain columns to focus on specific data. This can be done in the tab of the View Settings dialog. Uncheck a column to hide it from Columnsview. When columns are hidden, a note appears at the bottom of the CSV Viewer indicating the number of columns that are hidden and providing a quick link to show them all. A column can also be hidden by right-clicking on it and selecting .Hide Column

7.2.5 Color-coding Column Headers

Many data files use standardized naming conventions to group similar columns. The View Settings dialog allows specifying different colors for column headers based on all or part of the column name. In the Heading tab, a column mask can be defined (e.g. "src*"), which will color all headings which name starts with the text specified. A different background color can be set for each mask that is used.


65 of 89

Defining Coloring Rules for Headings

The following figure displays the outcome of the settings above.

Example of Colored Headers

7.2.6 Resizing Columns

Column widths can be resized by dragging left or right when the mouse is placed over the column divider in the column header. Columns can be automatically sized to fit their contents by double-clicking the column divider.

Resizing Columns


66 of 89

Additionally, right-clicking on a column header will bring up a menu which offers, among other functions, the ability to "autofit" the selected column or all columns.

7.2.7 Mark Groups

Another feature available in the column header context menu is the ability to visually mark changes (or identify groups) in data. This can be useful for scanning a specific column visually for changes in the data.

Data with Marked Groups in the Column src_gender

The column whose groups are marked is indicated by an icon showing three parallel horizontal lines next to the column name (as shown above).

7.2.8 Saving Views

To preserve the view settings (including sorting, column widths, and marked groups) for later use, click the View Settings drop-down arrow to open the options menu.


67 of 89

View Options

Click (for creating a new saved view) or (for saving changes to an existing view) to Save As… Savestore view settings. A list of recently used views will be shown at the top of the menu. A complete list of all saved views is shown in a submenu at the bottom labeled . An asterisk (*) All Saved Viewsnext to the view name indicates that unsaved changes have been made to the current view. To delete a view, select . It is also possible to import and export views for use with Manage Views…other copies of the product using this menu option.

The view called "Default" is a stored view with no settings applied. It cannot be changed or deleted. When changes have been made to the default view, the toolbar button label changes to , <custom>which indicates an unsaved view based on the default view. Click to name and store the Save As…new view. The option is the same as clicking the toolbar button with the view name, which Edit…opens the dialog.View Settings

7.3 Editing Data Files

Besides viewing, it is also possible to edit and files directly in the Ataccama IDE. Excel .txt .csv

files can be edited only in MS Excel.

To edit a data file: right-click it and select .Open With > Text Editor


68 of 89

Opening a Data File in the Text Editor

The file will open as plain text (instead of a table) without its metadata applied.

Text Editor

You can also select to the file in the default editor in your operating Open With > System Editorsystem.


69 of 89

7.4 Using Data Files in a Plan

The easiest way to add a data file into a plan is by dragging it from the explorer panel directly onto the canvas, where a corresponding step will be generated (e.g., Text File Reader or Excel File Reader).

Dragging a Text File to the Canvas

Text File Reader Appearing on the Canvas


70 of 89

1.

2.

3.

4.

5.

8 Working with Databases

All Ataccama solutions use databases in one way or another. They can be used as data sources and repositories for storing data. Profiles can be created from data stored in databases too.

8.1 Installing Database Connectivity Drivers

The Ataccama IDE uses the Java Database Connectivity (JDBC) API for connecting to databases. JDBC drivers are available for most database engines and are distributed as a component of the database engine or separately as connectivity components. The licensing terms do not always allow distribution of these drivers with Ataccama products. Therefore only a basic set of drivers for the most common databases is shipped. You may download and install additional drivers on your own.

The following drivers are shipped with the product and are installed automatically when the IDE is started for the first time:

Apache Derby. JDBC driver for the Apache Derby database.

HSQLDB. JDBC driver for the HSQLDB database.

IBM DB2. JDBC driver for the IBM DB2 database.

jTDS. jTDS open-source driver for connecting to both MS SQL and Sybase servers.

Oracle. JDBC driver version 11g for Oracle databases.

Microsoft SQL Server. JDBC driver for Microsoft SQL Server 2000, SQL Server 2005, and SQL Server 2008. Requires JRE version 1.6.

PostgreSQL. JDBC driver for the PostgreSQL database. Supports PostgreSQL 7.3 or newer.

Teradata. JDBC driver version 13.10.00.18 for Teradata databases.

Additional drivers may be installed with the following procedure:

Download the required driver (usually an archive).

Extract the archive to a desired location.

Go to Window > Preferences > [Your Product] > Database.

Add/edit a database driver:

In case of a pre-configured driver – it is in the list, e.g., MySQL – double-click it.

In case you want to add a new database driver, click and provide the name of the Add... new database.


71 of 89

5.

6.

7.

Click and locate the driver in the extraction folder.Add to Classpath... .jar

Click to finish.OK

The driver will have a green dot and YES in the column of the table.Configured

8.2 Connecting to a Database

To connect to a database, right-click the node in the File Explorer and select Databases New .Database Connection

Creating a Database Connection

This will invoke the following dialog:


72 of 89

Configuring a Database Connection

A detailed explanation of the New Database Connection parameters is provided below.

Name Required Description

Database type Yes Specifies the database connection type.

Connection name Yes Defines a name for the new database connection.

Host Yes Specifies the host name or IP of the database host.

Port Yes Port for which the database connection will be used.

Database name | Catalog

Yes Specifies the name of the database instance located on the host.


73 of 89

Name Required Description

Username No The username to connect to the database.

Password No The password to connect to the database.

After the database connection has been made, the database will be shown in the Databases node in the explorer panel. Right-click the connection name and select to establish a Connectconnection.

Connecting a Database

8.3 Viewing Table Properties

Clicking on the table name will show its metadata in the tab.Properties


74 of 89

Table Properties in the Status Panel

8.4 Executing SQL Queries

To view the results of an SQL query on a table, .double-click the table

SQL Editor Mode

A default query will be shown, listing all table entries (grouped in batches if the number of rows is large). To change the query, edit the query text and click . To retrieve more results from the Executequery, click or (to show all results).Next batch Read Rest


75 of 89

Refer to the documentation for the JDBC Reader step to learn how to use data from a database inside a Plan file.


76 of 89

9 Tips and Tricks

Cheat Sheets (see page 76)

IDE Shortcuts (see page 82)

Mouse Gestures in the Plan Editor (see page 84)

9.1 Cheat Sheets

The following materials cover standard and proprietary languages for interacting with data:

Regular Expressions (see page 80) - using regular expressions, the users can extract information or parse a string into multiple elements. The regex engine is based on Java.util.regex, following industry standards.

Commonly Used Functions (see page 76) - powerful rule language that includes a set of built-in functions for most common data transformations and algorithms in the data quality realm. The language is designed to be welcoming to newcomers with heterogeneous backgrounds and can be easily extended.

9.1.1 Commonly Used Functions

Basic String Functions

Function Description

left(srcStr, count) Returns a new string that is a substring of containing srcStr of chars taken from the start of the .count srcStr

right(srcStr, count) Returns a new string that is a substring of containing srcStr of chars taken from the end of the .count srcStr

substr(srcStr, beginIndex), substr(srcStr, beginIndex, strLen)

Returns a new string that is a substring of the string . srcStr

transliterate(srcStr, charsFrom, charsTo)

Transforms characters of the string .srcStr

containsWord(srcStr, srcWord)


77 of 89


Searches for the occurrence of the word in the string srcWord .srcStr

indexOf(srcStr, subStr), indexOf(srcStr, subStr, fromIndex)

Returns the index within the string of the first srcStroccurrence of the specified substring .subStr

lower(srcStr) Transforms all characters of the string to lower case.srcStr

upper(srcStr) Transforms all characters of the string to upper case.srcStr

capitalize(srcStr) Transforms the first character of each word to upper case and all following characters to lower case.

replicate(srcStr, n) Returns copies of the string concatenated together n srcStrwithout any separator.

replace(srcStr, what, withWhat) Replaces occurrences of the string with the string what in the string .withWhat srcStr

Regular Expression Functions


find(srcRegex, srcStr [, caseInsensitive])

Verifies whether the string or its parts match the regular srcStr expression .srcRegex

matches(srcRegex, srcStr [, caseInsensitive])

Verifies whether the string matches exactly the pattern srcStr of the regular expression .srcRegex

substituteAll(srcPattern, srcReplacement, srcStr [, caseInsensitiveFlag])

Replaces all occurrences of in string with srcPatter srcStr.srcReplacement

substituteMany(srcPattern, srcReplacement, srcStr, srcVolume [, caseInsensitiveFlag])

Replaces all occurrences of in the string with srcPattern srcStr , the maximum number of replacements is srcReplacement

defined by the parameter .srcVolume


78 of 89

Conversion Functions


toDate(expr, dateFormat[, dateLocale])

Returns the date specified in expression converted to exprdate type DAY.

toString(expr, strFormat[, strLocale])

Converts the expression to a STRING value.expr

Datetime Functions


datePart(srcDate, fieldName)

This function returns the value of the field fieldName (YEAR, MONTH of .or DAY) srcDate

dateDiff(startDate, endDate, fieldName)

Returns the difference between and expressed in endDate startDate units.fieldName

today() Returns the current date in type DAY.

dateAdd(srcDate, srcValue, fieldName)

Adds the specified of the type specified by ( , srcValue fieldName YEAR or ) to the .MONTH DAY srcDate

String Distance Functions


hamming(srcStr1, srcStr2 [, caseInsensitive])

Returns the Hamming distance between strings and srcStr1 srcStr2.

levenshtein(srcStr1, srcStr2 [, maxDistance][, caseInsensitive])

Returns the Levenshtein distance between strings and srcStr1.srcStr2

editDistance(srcStr1, srcStr2 [, maxDistance] [, caseInsensitive])

Returns the edit distance between strings and .srcStr1 srcStr2


79 of 89


soundex(srcStr) Returns the soundex value of the parameter.srcStr

doubleMetaphone(srcStr), doubleMetaphone(srcStr, isAlternate)

Encodes str1 to a double metaphone primary string. Encodes srcStr to a double metaphone secondary string if the parameter isAlternate is true.

Set Distance Functions


intersection(set1, set2, separator [, multiset]) Returns the cardinality of the intersection of sets.

difference(set1, set2, separator [, multiset] [, singularity])

Returns the cardinality of the difference of sets (set1 set2).

symmetricDifference(set1, set2, separator [, multiset ] [, singularity])

Returns the cardinality of the symmetric difference of sets.

union(set1, set2, separator [, multiset]) Returns the cardinality of the union of sets.

Conditionals


iif(ifExpr, trueExpr, elseExpr) Returns if is .trueExpr ifExpr TRUE

case(expr, exprValue[, expr, exprValue]...[, defaultExpr])

Returns the value of expression immediately exprValuefollowing the first expression whose value is .expr TRUE

decode(decodeExpr, expr, exprValue[, expr , exprValue]...[, defaultExpr])

Returns the value of expression immediately exprValuefollowing the first expression whose value is equal to expr

.decodeExpr

nvl(expr[, expr]...) Returns the value of the first expression whose value is exprnot .NULL


80 of 89

9.1.2 Regular Expressions

Metacharacters Defined

MChar Definition Pattern Sample Matches

^ Start of a string. âbc abc, abcdefg, abc123, ...

$ End of a string. abc$ abc, endsinabc, 123abc, ...

. Any character (except \n newline) a.c abc, aac, acc, adc, aec, ...

| Alternation. bill|ted ted, bill

{...} Explicit quantifier notation. ab{2}c abbc

[...] Explicit set of characters to match. a[bB]c abc, aBc

(...) Logical grouping of part of an expression. (abc){2}

abcabc

* 0 or more of previous expressions. ab*c ac, abc, abbc, abbbc, ...

+ 1 or more of previous expressions. ab+c abc, abbc, abbbc, ...

? 0 or 1 of previous expressions; also forces minimal matching when an expression might match several strings within a search string.

ab?c ac, abc

\ a\sc a c


81 of 89

MChar Definition Pattern Sample Matches

Preceding one of the above, it makes it a literal instead of a special character. Preceding a special matching character, see below.

Character Escapes

Escaped Char

Description

ordinary characters

Characters other than . $ ^ { [ ( | ) ] } * + ? \ match themselves.

\t Matches a tab \u0009.

\r Matches a carriage return \u000D.

\f Matches a form feed \u000C.

\n Matches a new line \u000A.

\040 Matches an ASCII character as an octal (up to three digits); numbers with no leading zero are backreferences if they have only one digit or if they correspond to a capturing group number. For example, the character \040 represents a space.

\x20 Matches an ASCII character using hexadecimal representation (exactly two digits).

\u0020 Matches a Unicode character using a hexadecimal representation (exactly four digits).

\* When followed by a character that is not recognized as an escaped character, matches that character. For example, \* is the same as \x2A.


82 of 89

Character Classes

Char Class Description

. Matches any character except \n.

[aeiou] Matches any single character included in the specified set of characters.

[âeiou] Matches any single character not in the specified set of characters.

[0-9a-fA-F] Use of a hyphen (–) allows specification of contiguous character ranges.

\w Matches any word character. \w is equivalent to [a-zA-Z_0-9].

\W Matches any non-word character. \W is equivalent to [â- zA-Z_0-9].

\s Matches any white-space character. \s is equivalent to [\f\n\r\t\v].

\S Matches any non-white-space character. \S is equivalent to [^ \f\n\r\t\v].

\d Matches any decimal digit.

\D Matches any non-digit.

9.2 IDE Shortcuts

Below are Ataccama IDE shortcuts organized into categories by particular viewer or editor.

A full list of keyboard shortcuts can be found and called by pressing Ctrl+Shift+L at any time.

To edit shortcuts, go to .Window > Preferences > General > Keys

9.2.1 Data Viewer

Ctrl+F - - Opens Find dialog.Find

Ctrl+L - - Opens dialog where user selects which columns should be visible and Show Columnswhich columns should be hidden.

Ctrl+Home - Selects and reveals the top left cell.

Ctrl+End - Selects and reveals the bottom right cell.


83 of 89

9.2.2 SQL Editor

Ctrl+F - - Opens Find/Replace dialog if SQL query field has focus (from v8).Find

Ctrl+Enter - - Executes current query.Execute

Ctrl+R - - Reads the rest of the query execution.Read Rest

Ctrl+B - - Reads the next batch of the query execution.Next Batch

Esc - - Stops the query execution.Stop

9.2.3 Profile Viewer

Ctrl+Shift+D - - Displays appropriate drill-through data.Drill-through

9.2.4 Plan Editor

Search and Navigate

Ctrl+F - - Opens Find dialog.Find

Ctrl+G - - Goes to step.Go to Step

Ctrl+double click - Opens component definition (if selected step is a component). From v8 it also opens the file associated with the given step.

Edit

Insert, - - Inserts step.Ctrl+I Insert Step

Drop step to connection - Attaches step to a connection.

Alt+drop step to connection - Attaches step to connection and makes space for it.

Ctrl+drop step to connection - Attaches step to connection through Multiplicator.

Ctrl+Alt+drop step to connection - Attaches step to connection through Multiplicator and makes space for the Multiplicator.

Ctrl+Delete - Removes step from connection.

Ctrl+drag step out of connection - Detaches step from a connection.

Ctrl+Shift+drag step out of connection - Detaches step from connection and disables it.

Layout

Ctrl+Shift+H - Sets horizontal layout to selected step(s).

Ctrl+Shift+V - Sets vertical layout to selected step(s).

Ctrl+Alt+V - - Distributes selected steps/includes horizontally.Distribute Widths


84 of 89

Ctrl+Alt+H - - Distributes selected steps/includes vertically.Distribute Heights

Ctrl+Alt+Left - - Aligns selected steps/includes left.Align Left

Ctrl+Alt+Right - - Aligns selected steps/includes right.Align Right

Ctrl+Alt+Up - - Aligns selected steps/includes top.Align Top

Ctrl+Alt+Down - - Aligns selected steps/includes bottom.Align Bottom

Run

Ctrl+Shift+D - - Debugs selected step.Debug

Ctrl-F11 - - Runs the plan/component.Run

9.2.5 Progress Viewer

Ctrl+G - - Goes to step.Go to Step

Ctrl+double click - Opens component definition (if selected step is a component).

9.2.6 Common Shortcuts

This section contains some common shortcuts (not all these shortcuts are supported by all editors).

Ctrl+C - Copy

Ctrl+V - Paste

Ctrl+A - Select All

Ctrl+Z - Undo

Ctrl+Y - Redo

Ctrl+S - Save

Ctrl+P - Print

Shift+mouse wheel - Scrolls editor content horizontally.

Ctrl+mouse wheel - Zooms in/out editor content.

9.3 Mouse Gestures in the Plan Editor

Creating/reconnecting steps in DQC plans involves, in addition to choosing/positioning the steps, a lot of reconnecting. The aim of mouse gestures is to simplify this, thus improving the speed of creating and modifying plans.


85 of 89

9.3.1 Basic Concepts

The only connection that may change during any mouse gesture operation (except for the operation where some steps are deleted) is the connection onto which the step is being dragged/created/removed.

Endpoints for connecting are unconnected ones. If no such exist, that are capable to attach new connection are used.

When connecting single unconnected step (insert or drag), first endpoint from left that matched condition is used.

When dragged step belongs to a subgraph, then for output connection the whole subgraph is searched for endpoints, and first matching endpoint is used

Subgraph is searched in depth-first preorder: first parents, then children, first left branch, then right, first left node's endpoint, then right

Dragging a step on a connection is not enabled when no input endpoint is found

9.3.2 Individual Actions

Create Step/Reader/Component onto a Connection

By inserting a step or dragging data source/component definition onto a connection, both a new element and connections are created.


86 of 89

By additional holding of the Ctrl key during step creation, a "fork" is created with the Multiplicator step.

Drag Step/Reader/Component onto a Connection

By dragging an existing step onto a connection, the step will be seamlessly integrated into it.

By CTRL-dragging, a "fork" a "fork" is created with the Multiplicator step.


87 of 89

9.3.3 Drop Step on Another Step

By inserting a new step/component on an existing step with unconnected output, the new step is connected and positioned automatically.

9.3.4 Delete Step from Connection

When pressing CRTL+Del on the selected step, it is deleted while preserving the actual connection.


88 of 89

9.3.5 Drag Step out of Connection

When CRTL+dragging connected step, the step is detached from connection. When CTRL+SHIFT is pressed while dragging, the detached step will also be disabled.


89 of 89

9.3.6 Summary of Actions and Possible Key Modifiers

Action No key modifier CTRL pressed SHIFT pressed

ALT pressed

Drag step on connection

connect step into the connection

make a fork and connect

shift nodes to make space

Create step on connection

create and connect to the existing connection

make a fork, create and connect


Drop step on step

create and connect with the existing step


Drag connected step

normal drag detach from the connection

disable the detached step

Delete connected step

delete and break the connection

delete and preserve the connection

Documents

DQ Analyzer 10 User Guide - AtaccamaDQA+10+User+Guide.pdf · 9.1 Cheat Sheets ... DQ Analyzer 10 User Guide 5 ... In the File Explorer view, select one or several files (CSV, TXT