PowerPoint Presentationdownload.microsoft.com/documents/hk/technet/techdays2013/Day … ·...

Preview:

Citation preview

Data Quality

Services 101

Knowledge

Base Driven

Data Quality

Matching

Integration

DQS with

MDS and SSIS

DQ Issues and DQ Dimensions

Name Gender Street House # Zip code City State D.O.B

John Doe Male 60th street 45 New York New York 08/12/64

Jane Doe Male Jonathan ln 36 10023 Poughkeepsy NY 21-dec-1954

Name Gender Street House # Zip

code

City State D.O.B

John Doe Male E 60th St 45W 10022 New York NY 08/12/64

Jane Doe Female Jonathan

Lane

36 10023 Poughkeepsie NY 12/21/54

Name Address Postal Code City State

John Smith 545 S Valley View Drive # 136 34563 Anytown New York

Margaret & John smith 545 Valley View ave unit 136 34563-2341 Anytown New York

Maggie Smith 545 S Valley View Dr Anytown New York

John Smith 545 Valley Drive St. 34253 NY NY

Name Address Zip Code City State Cluster

John Smith 545 S Valley View Drive # 136 34563 Anytown New York 1

Margaret & John smith 545 Valley View ave unit 136 34563-2341 Anytown New York 1

Maggie Smith 545 S Valley View Dr Anytown New York 1

John Smith 545 Valley Drive St. 34253 NY NY 2

Before

Before

After

After

Completeness Accuracy Conformity Consistency Uniqueness

Build

Use

DQ Projects

Knowledge

Management

Connect

Knowledge

Base

Build

Use

DQ Projects

Knowledge

Management

Connect

Integrated

ProfilingKnowledge

Base

9

Amend, remove or

enrich data that is

incorrect or incomplete.

This includes correction,

enrichment and

standardization .

Identifying, linking or

merging related

entries within or

across sets of data.Cleansing Matching

Profiling MonitoringAnalysis of the data

source to provide

insight into the quality

of the data and help to

identify data quality

issues.

Tracking and

monitoring

the state of Quality

activities and Quality

of Data.

Matching

Reference

Data

DQ Clients

DQS UI

DQ Server

DQ Projects Store Common Knowledge Store Knowledge Base Store

DQ Engine

3rd Party

/ Internal

MS DQ

Domains Store

Reference

Data

Services

Reference

Data Sets

SSIS DQ

Component

DQ Active

Projects

MS Data

Domains

Local

Data

Domains

Published

KBs

Knowledge

Discovery

Data

Profiling &

Exploration

Cleansing

Knowledge

Discovery

and

Management

Interactive

DQ Projects

Data

Exploration

Azure Market Place

Categorized

Reference Data

Categorized

Reference Data

Services

Reference Data API

(Browse, Get,

Update…)

RD Services API

(Browse, Set,

Validate…)

MDS Excel

Add in

Future Clients –

Excel,

Dynamics

15010 NE 36th Street

RDS –

Reference

Data

In order

Knowledge

Base

Parsing

• When you don’t have enough knowledge in your

knowledge base

• Sample : Mellissa DataWhen to Use

• Handing over the dirty job Advantage

• Paying subscription fee

• Large volumes of data may cause performance issues on

cloudDisadvantage

15010 NE 36th Street , Redmond, WA, USA

USA, 15010 NE 36th Street , Redmond, WA

15010 NE 36th Street , Redmond, WA

Data Issues

There are different ways to represent the same person or address in a database:

Data is ‘fuzzy’ in nature (spelling mistakes, abbreviations etc.).

A matching policy is prepared in the knowledge base.

A matching policy consists of matching rules that

assess how well one record matches to another.

Specify in the rule whether records’ values have to be

an exact match, similar, or prerequisite.

Train your policy by running and tuning each rule

separately.

Similarity, select

Similar if field

values can be

similar. Select Exact

if field values must

be identical.

Weight, determines

the contribution of

each domain in the

rule to the overall

matching score for

two records.

Prerequisite

validates whether

field values return a

100% match; else

the records are not

considered a match.

Minimum

matching score is

the threshold at or

above which two

records are

considered to be a

match.

Uniqueness Usage Description Domains

Low • Define as Prerequisite

• Define with lower weights

Provides discriminatory

information

Gender, City, State

High • Define as Similar or Exact

• Define with higher weights

Provides highly identifiable

information and is highly

discriminatory

Names (First, Last,

Company),

Address Line 1

Completeness Usage Description

Low Do not use or define with low weight High level of missing values

High Include for matching if the column

provides highly identifiable information

Low level of missing values

In Overlapping clusters a record may appear more than once in various clustered

results. This structure may be harder to read since the same record exists in multiple

clusters.

In Non-Overlapping clusters, the system unifies clusters containing the same

record. This structure is easier to read as you won't repeat the same observation

twice.

Overlapping Clusters

(A~B) , (B~C)

Non-Overlapping Cluster

(A~B~C)

DQS Component Overview

Reference Data

Definition

Values/RulesSource +

MappingDQS Cleansing

Component

SSIS Package

Destination

Design Run

Activity

MonitoringInteractive Cleansing

Project

http://social.technet.microsoft.com/wiki/contents/articles/14065.tsql-script-to-delete-dqs-projects-leftover-from-ssis-dqs-cleansing-component.aspx

Thank you

Recommended