37
DUPLICATE RECORD DETECTION AHMED K. ELMAGARMID PURDUE UNIVERSITY, WEST LAFAYETTE, IN Senior member, IEEE PANAGIOTIS G. IPEIROTIS LEONARD N. STERN SCHOOL OF BUSINESS, NEW YORK, NY Member, IEEE computer security VASSILIOS S. VERYKIOS UNIVERSITY OF THESSALY, VOLOS, GREECE Member ,IEEE computer security. PRESENTED BY SHILPA MURTHY

Duplicate record detection

  • Upload
    said

  • View
    48

  • Download
    5

Embed Size (px)

DESCRIPTION

Duplicate record detection. AHMED K. ELMAGARMID PURDUE UNIVERSITY, WEST LAFAYETTE, IN Senior member, IEEE PANAGIOTIS G. IPEIROTIS LEONARD N. STERN SCHOOL OF BUSINESS, NEW YORK, NY Member, IEEE computer security VASSILIOS S. VERYKIOS UNIVERSITY OF THESSALY, VOLOS, GREECE - PowerPoint PPT Presentation

Citation preview

Page 1: Duplicate record detection

DUPLICATE RECORD DETECTION

AHMED K. ELMAGARMID PURDUE UNIVERSITY, WEST LAFAYETTE, INSenior member, IEEEPANAGIOTIS G. IPEIROTIS LEONARD N. STERN SCHOOL OF BUSINESS, NEW YORK, NY Member, IEEE computer securityVASSILIOS S. VERYKIOSUNIVERSITY OF THESSALY, VOLOS, GREECEMember ,IEEE computer security.

PRESENTED BY SHILPA MURTHY

Page 2: Duplicate record detection

INTRODUCTION TO THE PROBLEM Databases play an important role in today’s

IT based economy Many businesses and organizations depend

on the quality of data(or the lack thereof) stored in the databases.

Any discrepancies in the data can have significant cost implications to a system that relies on information to function.

Page 3: Duplicate record detection

DATA QUALITY

Data are not carefully controlled for quality nor defined in a consistent way across different data sources, thus data quality is compromised due to many factors .//examples

Data errors.Ex: Microsft instead of Microsoft Integrity errors. Ex: EmployeeAge=567 Multiple conventions for information.Ex: 44

W.4th street and 44 west fourth street.

Page 4: Duplicate record detection

DATA HETEROGENEITY While integrating data from different sources

into a warehouse , organizations become aware potential systematic differences and these problems and conflicts fall under a umbrella term called as “DATA HETEROGENEITY”.

Two types of heterogeneity can be distinguished: Structural heterogeneity and lexical heterogeneity.

Page 5: Duplicate record detection

DATA QUALITY

Data cleaning refers to the process of resolving identification problems in the data.

Structural heterogeneity Different record structure Addr versus City, State, and Zip code [1]

Lexical heterogeneity Identical record structure, but data is different 44 W. 4th St. versus 44 West Fourth Street [1]

Page 6: Duplicate record detection

TERMINOLOGY

DuplicateRecord

Detection

Recordlinkage

Recordmatching

Datadeduplication

Mergepurge

Instanceidentification

Databasehardening

Namematching

Coreferenceresolution

Identityuncertainty

Page 7: Duplicate record detection

DATA PREPARATION

Step before the duplicate record detection. Improves the quality of the data Makes data more comparable and more

usable. Data preparation stage includes three steps.

Page 8: Duplicate record detection

STEPS IN DATA PREPARATION

Parsing Data transformation Standardization

Page 9: Duplicate record detection

PARSING

Locates, identifies and isolates individual data elements

Makes it easier to correct, standardize and match data

Comparison of individual components rather than complex strings

For example, the appropriate parsing of the name and address components into consistent packets is a very important step.

Page 10: Duplicate record detection

DATA TRANSFORMATION

Simple conversions of data type Field renaming Decoding field values Range checking: involves examining data in

a field to ensure that it falls within the expected range ,usually a numeric or date range

Dependency checking: is slightly more complex kind of data transformation where we check the values in a particular field to the values in another field to ensure minimal level of consistency in data

Page 11: Duplicate record detection

DATA STANDARDIZATION

Represent certain fields in a standard format Addresses

US Postal Service Address Verification tool Date and time formatting Names (first, last, middle, prefix, suffix) Titles

Page 12: Duplicate record detection

LAST STEP IN DATA PREPARATION

Store data in tables having comparable fields.

Identify fields suitable for comparison Not foolproof Data may still contain inconsistencies due to

misspellings and different conventions to represent data

Page 13: Duplicate record detection

FIELD MATCHING TECHNIQUES

o Most common sources of mismatches in database entries is due to typographical errors

o The field matching metrics that have been designed to overcome this problem are :

o Character –based similarity metricso Token based similarity metricso Phonetic similarity metricso Numeric similarity metrics

Page 14: Duplicate record detection

CHARACTER BASED SIMILARITY

Works best on typographical errors Edit distance

Shortest sequence of edit commands that can transform a string s into t

Three types of edit operations . If (cost =1) this version of edit distance is referred to as the “Levenshein” distance.

Insert, delete, replace operations. Example. S1=“tin” s2= “tan” We need to replace “I” to “A” to convert string s1

to s2. The edit distance here is 1. because we needed

only one operation to convert s1 to s2.

Page 15: Duplicate record detection

CHARACTER BASED SIMILARITY

• Affine gap distance Strings that have been truncated

John R. Smith versus Jonathan Richard Smith• Smith-Waterman distance

Substring matching which ignores the prefix and suffixExample: Prof. John.R.Smith and John.R.Smith,Prof

• Jaro distance Compares first and last name

• Q-GramsDivides string into a series of substrings of length q.E.g.: NELSON and NELSEN are phonetically similar but spelled differently. The q-grams for these words are NE LS ON and NE LS EN .

Page 16: Duplicate record detection

TOKEN BASED SIMILARITY

Works best when word (tokens) are transposedAtomic Strings

Computational averageWHIRL

Weights words based on frequency to determine similarity

The words in the database have a weight associated with it, which is calculated using a cosine similarity metric.

Example: in a database of company names the words “AT&T” and “IBM” are less frequent than the word “inc.”

Similarity of John Smith and Mr.John Smith is close to 1. But the similarity of comptr department and deprtment

of computer is zero since it doesn’t take care of misspelled words.

Q-Grams with weighting Extends WHIRL to handle spelling errors

Page 17: Duplicate record detection

PHONETIC SIMILARITY

Comparison based on how words sound

Page 18: Duplicate record detection

NUMERIC SIMILARITY

Considers only numbers Convert numbers to text data Simple range queries

Authors provided no insight in this area

Page 19: Duplicate record detection

SUMMARY OF METRICS

Comparison

Metrics

Edit Distance

(Levenshtein)

Affine Gap

SmithWaterman

JaroDistance

AtomicStrings

WHIRL

Q-GramsSoundex

NYSIIS

OxfordName

Compression

Metaphone

DoubleMetaphone

Numeric

Page 20: Duplicate record detection

DUPLICATE RECORD DETECTION

The methods described till now have been describing about similarity checking in single fields.

The real life situations consist of multiple fields which have to be checked for duplicate records.

Page 21: Duplicate record detection

CATEGORIZING METHODS

• Probabilistic approaches and supervised machine learning techniques

• Approaches that rely on domain knowledge or Generic distance metrics

Page 22: Duplicate record detection

PROBABILISTIC MATCHINGMODELS

Models derived from Bayes theorem Use prior knowledge to make decision about

current data set A tuple pair is assigned to one of the two classes

M or U. M class represents(match) same entity, and the U class represents(non-match) different entity.

This can be determined by calculating the probability distribution.

Rule-based decision tree If-then-else traversal

Page 23: Duplicate record detection

SUPERVISED LEARNING

Relies on the existence of trained data. The trained data is in the form of record

pairs. These record pairs are labeled matching or

not. SVM approach out performs all the simpler

approaches. The post processing step is to create a graph

for all the records linking the matching records.

Records are considered identical using the transitivity relation applied on the connected components.

Page 24: Duplicate record detection

ACTIVE LEARNING

Page 25: Duplicate record detection

DISTANCE BASED TECHNIQUES

o This method can be used when there is absence of training data or human effort to create matching models.

o Treat a record as a one long field Use a distance metric Best matches are ranked using a weighting

algorithm Alternatively, use a single field

Must be highly discriminating

Page 26: Duplicate record detection

RULE BASED TECHNIQUES

Relies on business rules to derive key Must determine functional dependencies Requires subject matter expert to build matching

rules

Page 27: Duplicate record detection

RULE BASED TECHNIQUES

This figure depicts the equation theory that dictates the logic of domain equivalence.

It specifies an inference about the similarity of the records.

Page 28: Duplicate record detection

UNSUPERVISED LEARNING

Classify data as matched or unmatched without a training set.

The comparison vector generally depicts which category it belongs to. If it does not then it has to be done manually.

One way to avoid manual labeling is to use the clustering algorithms.

Group together similar comparison vectors. Each cluster contains vectors with similar

characteristics. By knowing the real class of only few vectors

we can infer the class of all the vectors.

Page 29: Duplicate record detection

TECHNIQUES TO IMPROVE EFFICIENCY

Reduce the number of record comparisons Improve the efficiency of record comparison

Page 30: Duplicate record detection

COMPARATIVE METRICS

Elementary nested loop Compare every record in one table to another

table Requires A*B comparisons (Cartesian product)

which is very expensive Cost required for a single comparison

Must consider number of fields/record

Page 31: Duplicate record detection

REDUCE RECORD COMPARISONS Blocking Sorted Neighborhood Clustering and Canopies Set Joins

Page 32: Duplicate record detection

BLOCKING

Basic: Compute a hash value for each recordOnly compare records in the same bucket

Subdivide files into subsets (blocks)Soundex, NYSIIS, or Metaphone

Drawback Increases in speed may increase number

of false mismatchesCompromise is multiple runs using

different blocking fields

Page 33: Duplicate record detection

SORTED NEIGHBORHOOD

Create composite key, sort data, merge

AssumptionDuplicate records will be close in sorted

systemHighly dependent upon the comparison

key

Page 34: Duplicate record detection

CLUSTERING AND CANOPIES

Clustering: Duplicate records are kept in a cluster and only the representative of a cluster is kept for future comparisons.

This reduces the total number of record comparisons without compromising the accuracy.

Canopies: The records are grouped into overlapping clusters called as “canopies” and then the records are compared which lead to better qualitative results.

Page 35: Duplicate record detection

SOFTWARE TOOLS

Open Architecture Freely Extensible Biomedical Record Linkage (FE

BRL) - Python

TAILOR – MLC++, DBGen WHIRL – C++ Flamingo Project - C BigMatch - C

Page 36: Duplicate record detection

DATABASE TOOLS

Commercial RDBMS SQL Server 2005 implements “fuzzy matches” Oracle 11g implements these techniques in its

utl_match package Levenshtein Distance Soundex Jaro Winkler

Page 37: Duplicate record detection

CONCLUSIONS Lack of a standardized, large-scale

benchmarking data set Training data is needed to produce matching

models Research diversion

Databases emphasize simple, fast, and efficient techniques

Machine learning and statistics rely on sophisticated techniques and probabilistic models

More synergy is needed among various communities

Detection systems need to be adaptive over time