Enterprise Discovery Best Practices - Informatica Library/1/0807... · Enterprise Discovery Best Practices ... F o r e i g n k e y p r o f i l e t h a t d i s c o v e r s t a b l

Enterprise Discovery Best Practices

© 1993-2015 Informatica Corporation. No part of this document may be reproduced or transmitted in any form, by any means (electronic, photocopying, recording or otherwise) without prior consent of Informatica Corporation. All other company and product names may be trade names or trademarks of their respective owners and/or copyrighted materials of such owners.

AbstractThis article describes the best practice guidelines that you can follow when you perform enterprise discovery for different use cases. An enterprise discovery profile runs multiple data discovery tasks on many data sources and generates a consolidated summary of the profile results.

Supported Versions• Data Quality 9.6.1

Table of ContentsIntroduction. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2

Sampling Options. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3

Profile Functions. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4

Conclusion. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6

IntroductionEnterprise discovery is a process that finds column profile statistics, data domains, primary keys, and foreign keys in many data sources spread across multiple connections or schemas.

You can use enterprise discovery to solve use cases ranging from analyzing tables for specific properties to performing the entire data discovery analysis. You need to consider the cost impact and benefits before you extend the profile operation from a single table to a schema, especially because the tables in the schema might vary considerably in the number of column and rows. To meet this challenge, make sure that you choose the design-time and run-time parameters for enterprise discovery.

Enterprise discovery has the following steps:

1. Column profile that discovers the basic column data.

2. Data domain discovery that discovers the functional meaning or semantics of column data.

3. Primary key profile that discovers the key structure of a table.

4. Foreign key profile that discovers table relationships in the schema.

You can choose to run one or more steps when you run the enterprise discovery profile.

How you choose the right parameters for each step of the enterprise discovery depends on the use case. The two primary use cases include screening summary profile statistics where enterprise discovery derives the results from a sample and complete analysis where the profile statistics come from the entire data set.

Screening Versus Complete Analysis

Use the screening use case to scan the tables in the schema for specific statistics and take the required actions. The most common use case is to screen a schema for specific data domains or patterns. Generally, you set the parameter values to reduce the total time for the profile run. There might be use cases where you apply aggressive sampling to prioritize accuracy over profile run time. To make aggressive sampling effective, the assumption is that the data is consistent in each table. If this assumption of data consistency does not hold, you must run a column profile on the entire data set.

The complete analysis use case prioritizes complete and accurate information over the profile run time. One way to reduce the profile run time is to provide more hardware resources for profiling, such as additional cores or machines. This approach of adding more resources works because profiling is scalable.

2

Sampling OptionsYou can use profiling to try different sampling techniques with differing trade-offs. Not all profiling steps support all the sampling options. The sampling options are as follows:

• Complete– Selects all the rows in the data set for analysis.

• First N Rows–Selects the first ‘N’ rows of the data set, or the entire data set if the number of rows are fewer than ‘N’. This option is the fastest method because the Data Integration Service stops processing source rows after reading the first ‘N’ from the source. However, if the data source changes over a period of time, this option might not select a representative source sample.

• Random N Rows– Selects the specified number of rows randomly throughout the data set. First, profiling determines the number of rows in the data set to compute the percentage of rows to sample. Determining the number of rows might be costly for those data sources that do not keep track of this metric. Then, profiling reads the entire data set discarding those rows that are not in the sample. This step generally takes more time than sampling the first N rows.

• Random Sample (Auto)– Uses the database to return a sample of rows based on more efficient sampling techniques that are specific to each database type. If a database does not support sampling push down, profiling uses a random sample of about 100,000 rows.

For profiling the enterprise discovery, the first two sampling options are available at the EDD job level. These techniques are applied to all the data objects when the EDD job is created. If you require the Random N Rows or the Random Sample (Auto) option, go to each source object in the EDD profile and change it. When the EDD profile runs, the profile picks up the manually changed sampling options.

When you run an enterprise discovery profile, the profile can use the Complete and First N Rows sampling options at the profile job level. When the profile first creates the enterprise discovery job, the profile applies the sampling techniques to all the data objects. Based on the requirement of other sampling techniques, the profile applies those sampling techniques to individual data objects. When the enterprise discovery job runs, the job uses the newer sampling parameters.

The following table describes the sampling options that each profile type supports:

Sampling Option Column Profile

Data Domain Discovery

Primary Key Profile

Foreign Key Profile

EDD Default

Complete Yes Yes - Yes Column profiling, Foreign key profiling

First N Rows Yes Yes Yes - Data domain discovery, Primary key profiling

Random N Rows Yes - - - -

Random Sample (Auto)

Yes - - - -

Sample Size

After you decide to use a sampling option, the next logical question is what is the right sample size. If you have a data source that has a population fully contained in the data set, you can use any number of the available calculators. These calculators determine the random sample size based on the level of acceptable error.

The following image depicts the number of rows that you can select based on the data set size for 95% and 99% accuracy levels. A general guideline is to select 1000 rows if 5% error is acceptable and 17,000 rows for 1% error.

3

When you use the First N Rows sampling technique, you can treat it as a random sample if there is no correlation between the rows. In this case, the general recommendation of 1000 row sample for 5% error and 17,000 row sample for 1% error applies. Otherwise, you can use your best judgment to determine an appropriate sample size. Note that the column profile and data domain discovery are optimized for processing sample sizes of 100,000 rows or fewer.

FilteringFiltering is another way to sample the data source. You must apply filtering to individual tables because filtering is specific to the columns in the table. Use filtering to select the rows that meet the specific criteria or goals.

Note: It is outside the scope of this document to make filter recommendations.

Profile FunctionsThe specific recommendations for each profiling function within an enterprise discovery profile vary based on the screening and complete analysis use cases.

Data Domain Discovery

Data domain discovery uses data rules to discover the semantic content of a column. If a data value matches the data rule in a data domain, it adds to the count that conforms to the data domain. A null represents a missing value and does not add to the count because it does not give any information about the data domain.

For most columns, the data is consistent throughout the table. In these situations, if a data domain matches the column, it matches with all parts of the table including the initial rows. Therefore, the recommendation is to sample the first 1000 rows. This recommendation is also applicable for the screening analysis use case.

You might have a use case that mandates a data domain to be inferred if a single row matches its data rule. For these use cases, set the minimum conformance percent to zero and the sampling option to All Rows.

In the Analyst and Developer tools, you have a Verify option for data domain discovery. You can click Verify to get the counts of the conforming and non-conforming rows. You can also drill down into either set of rows. For interactive use cases, you can follow this approach with a small sample size for a quick and effective analysis. The small sample size provides for a quick profile. Then, use the Verify option only for those inferred data domains that require further investigation based on any unusual results.

4

Column Profiling

The majority of the use cases for Column Profiling require you to select all rows for the analysis. These use cases require you to know the exact number of rows in the data source and the aggregate numbers. The general recommendation is to run a column profile on all rows of all the tables in the enterprise discovery profile.

There are a few use cases where you do not need to compute the aggregate statistics on the entire data source. These use cases include the screening use case and interactive use case. For screening, you might want to compute and analyze aggregates such as the percentage null, patterns, and data type. If the profile results for a column display unusual values, you can run a profile on the column that uses the entire data set.

For the interactive use case, you might know the approximate row count of a table. If the table has a high row count, instead of profiling all the rows, use a sample to identify the problems in the data set. After the initial inspection, you can decide whether to run a profile for the entire data set or not.

If you apply any of the sampling options, the recommendation is to use a sample size less than or equal to 100,000 rows. This size makes sure that the sample is processed entirely within the DIS in one pass. This process is faster than splitting the work between the DIS and database.

Primary Key Profiling

Primary key profiling requires a sample of the data set because the algorithm consumes many resources when the profile runs on the entire data set. Even with aggressive sampling, the primary key profiling algorithm can run for a long time based on the complexity of the data. The general recommendation is to use a sample size equal to the square of the number of columns. For example, if the table has 50 columns, you can use a sample size of 2500 (502).

Primary key profile results might display too many candidate primary keys or false positives. You can reduce the false positives in the following ways:

• Increasing the number of rows reduces false positives by increasing the probability that the extra rows might violate the false positives. The assumption is that a small sample might support these false positives. The recommendation is that when you increase the number of rows, verify that there are resources to accommodate the computation required by the algorithm. It is best to run primary key profiling first to get a baseline for the table before increasing the number of rows in the sample.

• Increasing the Minimum Percent Conformance or decreasing the Maximum Violation Rows reduces the false positives when there is a strong primary for the table. This action reduces false positives by aggressively eliminating the false positives by a small violation threshold.

• Decreasing the Max Key Columns might reduce the number of false positives by reducing the potential number of column combinations. When you allow more potential columns in a key, the probability of source data supporting the false positive key might be more. You can follow this method if the schema does not contain any table with a many primary key columns.

• Setting the Exclude data objects with parameter to exclude the documented, user-defined, and approved keys. You can use this parameter if the data source enforces primary keys.

When you review the primary key profile results, you can use the Verify option to get the exact conformance of the key. All the duplicate rows and key columns that contain nulls count towards the number of violating rows.

Foreign Key Profiling

Foreign key profiling does not require sampling because the profile uses all of the data for each source. When you run a foreign key profile for the first time, the profile computes the signatures, which is an expensive operation. Subsequent profiles compute the signatures for new tables. Therefore, after the initial computation, the foreign key profiles run faster because the signature computation is complete.

The exception to the reuse of the signatures is when you change some of the parameters to the foreign key profile. The parameters include using a different data type classification, case sensitivity, and trimming whitespace. When you change any of these parameters or when you select the Regenerate Signature option, the signatures are recomputed because these signatures take the preceding parameters into consideration.

5

ConclusionEnterprise discovery, by default, is configured for the screening use case. You can use sampling to reduce the overall cost of profiling a schema.

You can tune the default parameters to enable the exact computation of all the profiling results. If you have to use sampling options in profiles, you can verify specific results to compute the exact values.

AuthorsJeff MillmanDevelopment Architect

6

Documents

Enterprise Discovery Best Practices - Informatica Library/1/0807... · Enterprise Discovery Best Practices ... F o r e i g n k e y p r o f i l e t h a t d i s c o v e r s t a b l