Upload
brittney-heath
View
224
Download
3
Tags:
Embed Size (px)
Citation preview
The de-duplication problemGiven a list of semi-structured records, find all records that refer to a same entity Example applications:
Data warehousing: merging name/address lists Entity:
a) Personb) Household
Automatic citation databases (Citeseer): references Entity: paper
ExampleLink together people from several data sources
Taxpayers
SOUZA ,D ,D ,,
GORA
VILLA
VIMAN NAGAR
411014 ,
Land records
CHAFEKAR ,RAMCHANDRA
….
Passport
DERYCK ,D ,SOZA
,03 ,GERA VILLA
,VIMAN NAGAR PUNE
,411014 ,1
Transport Telephone
SOUZA ,D ,D ,,GORA VILLA ,,VIMAN NAGAR ,,,411014 ,
DERYCK ,D ,SOZA ,03 ,GERA VILLA ,,VIMAN NAGAR PUNE, 411014
Duplicates:
CHAFEKAR ,RAMCHANDRA ,DAMODAR ,SHOP 8 ,H NO 509 NARAYAN PETH PUNE 411030 CHITRAV ,RAMCHANDRA ,D ,FLAT 5 ,H NO 2105 SADASHIV PETH PUNE 411 030
Non-duplicates:
Challenges Errors and inconsistencies in data Spotting duplicates might be hard as they
may be spread far apart: may not be group-able using obvious keys
Domain-specific Existing manual approaches require retuning
with every new domain
How are Such Problems Created? Human factors
Incorrect data entry Ambiguity during data
transformations
Application factors Erroneous applications
populating databases Faulty database design
(constraints not enforced)
Obsolence Real-world is dynamic
Database level linkage
Patent database (XML)
InventorTitleAssigneeAbstract
Authors
Computer sciencepublications (DBLP)
Authors Titles
AuthorsTitlesAbstracts
Medical publications
Inventorlist
Probabilistic links
Exploit various kinds of information in decreasing order of information content and increasing computation overhead
Partspart-iddescriptionimportancereplacement frequency
Replacementpart-idship-iddate
OrdersPart-idVendor-idOrder datePrice
Trip-logship-idtrip start datetrip lengthdestination
VendorVendor-idNameAddressJoining-date
Shipship-idnameweightcapacity
Duplicates can be in all of the interlinked tablesGoal: resolving them simultaneously could be better
Multi-table data
Outline Part I: Motivation, similarity measures (90 min)
Data quality, applications Linkage methodology, core measures Learning core measures Linkage based measures
Part II: Efficient algorithms for approximate join (60 min)
Part III: Clustering/partitioning algorithms (30 min)
String similarity measures Token-based
Examples Jaccard TF-IDF Cosine similarities
Suitable for large documents Character-based
Examples: Edit-distance and variants like Levenshtein, Jaro-Winkler Soundex
Suitable for short strings with spelling mistakes Hybrids
Token based Tokens/words
‘AT&T Corporation’ -> ‘AT&T’ , ‘Corporation’
Similarity: various measures of overlap of two sets S,T Jaccard(S,T) = |ST|/|ST| Example
S = ‘AT&T Corporation’ -> ‘AT&T’ , ‘Corporation’ T = ‘AT&T Corp’ -> ‘AT&T’ , ‘Corp.’ Jaccard(S,T) = 1/3
Variants: weights attached with each token Useful for large strings: example web
documents
Cosine similarity with TF/IDF weights Cosine similarity:
Sets transformed to vectors with each term as dimension Similarity: dot-product of two vectors each normalized to unit
length cosine of angle between them
Term weight == TF/IDF log (tf+1) * log idf where
tf : frequency of ‘term’ in a document d idf : number of documents / number of documents containing ‘term’
Intuitively: rare ‘terms’ are more important Widely used in traditional IR
Example: ‘AT&T Corporation’, ‘AT&T Corp’ or ‘AT&T Inc’
Low weights for ‘Corporation’,’Corp’,’Inc’, Higher weight for ‘AT&T’
Edit Distance [G98] Given two strings, S,T, edit(S,T):
Minimum cost sequence of operations to transform S to T. Character Operations: I (insert), D (delete), R (Replace).
Example: edit(Error,Eror) = 1, edit(great,grate) = 2 Folklore dynamic programming algorithm to compute edit();
O(m2) versus O(2m log m) for token-based measures Several variants (gaps,weights) --- becomes NP-complete
easily. Varying costs of operations: can be learnt [RY97].
Observations Suitable for common typing mistakes on small strings
Comprehensive vs Comprenhensive Problematic for specific domains
Edit Distance with affine gaps Differences between ‘duplicates’ often due to
abbreviations or whole word insertions. IBM Corp. closer to ATT Corp. than IBM Corporation John Smith vs John Edward Smith vs John E. Smith
Allow sequences of mis-matched characters (gaps) in the alignment of two strings.
Penalty: using the affine cost model Cost(g) = s+e l s: cost of opening a gap e: cost of extending the gap l: length of a gap
Similar dynamic programming algorithm Parameters domain-dependent, learnable, e.g.,
[BM03, MBP05]
Approximate edit-distance: Jaro Rule
Given strings s = a1,…,ak and t = b1,…,bL ai in s is common to a character in t if there is a bj in t
such that ai = bj i-H j i+H where H = min(|s|,|t|)/2 Let s’ = a1’,…,ak’’ and t’ = b1’,…,bL’’ characters in s (t)
common with t (s) Ts’,t’ = number of transpositions in s’ and t’
Jaro(s,t) = Martha vs Marhta
H = 3, s’ = Martha, t’ = Marhta, Ts’,t’ = 2, Jaro(Martha,Marhta) = (1+1+1/6)/3=0.7
Jonathan vs Janathon (H=4) s’ = jnathn t’ = jnathn Ts’,t’ =
0,Jaro(Jonathan,Janathon)=0.5
)|'|
5.0||
|'|
||
|'|(
3
1 ','
s
T
t
t
s
s ts
Hybrids [CRF03] Example: Edward, John Vs Jon Edwerd Let S = {a1,…,aK}, T = {b1,…bL} sets of terms: Sim(S,T) =
Sim’() some other similarity function C(t,S,T) = {wS s.t v T, sim’(w,v) > t} D(w,T) = maxvTsim’(w,v), w C(t,S,T)
sTFIDF =
)b,a('max1
ji
11simK
K
i
Lj
),,(
),(*),(*),(TStCw
TwDTwWSwW
Soundex Encoding A phonetic algorithm that indexes names by their
sounds when pronounced in english. Consists of the first letter of the name followed by
three numbers. Numbers encode similar sounding consonants.
Remove all W, H B, F, P, V encoded as 1, C,G,J,K,Q,S,X,Z as 2 D,T as 3, L as 4, M,N as 5, R as 6, Remove vowels Concatenate first letter of string with first 3
numerals Ex: great and grate become G6EA3 and G6A3E and
then G63 More recent, metaphone, double metaphone etc.
Learning similarity functions Per attribute
Term based (vector space) Edit based
Learning constants in character-level distance measures like levenshtein distances
Useful for short strings with systematic errors (e.g., OCRs) or domain specific error (e.g.,st., street)
Multi-attribute records Useful when relative importance of match along different
attributes highly domain dependent Example: comparison shopping website
Match on title more indicative in books than on electronics Difference in price less indicative in books than electronics
Machine Learning approachGiven examples of duplicates and non-duplicate
pairs, learn to predict if pair is duplicate or not.
Input features: Various kinds of similarity functions between attributes
Edit distance, Soundex, N-grams on text attributes Absolute difference on numeric attributes Capture domain-specific knowledge on comparing data
The learning approach
f1 f2 …fn Similarity functions
Examplelabeledpairs
Record 1 D Record 2
Record 1 NRecord 3
Record 4 DRecord 5
1.0 0.4 … 0.2 1
0.0 0.1 … 0.3 0
0.3 0.4 … 0.4 1
Mapped examples
Classifier
Record 6 Record 7Record 8 Record 9Record 10Record 11
Unlabeled list0.0 0.1 … 0.3 ?1.0 0.4 … 0.2 ?0.6 0.2 … 0.5 ?0.7 0.1 … 0.6 ?0.3 0.4 … 0.4 ?0.0 0.1 … 0.1 ?0.3 0.8 … 0.1 ?0.6 0.1 … 0.5 ?
0.0 0.1 … 0.3 01.0 0.4 … 0.2 10.6 0.2 … 0.5 00.7 0.1 … 0.6 00.3 0.4 … 0.4 10.0 0.1 … 0.1 00.3 0.8 … 0.1 10.6 0.1 … 0.5 1
AuthorTitleNgrams 0.4
AuthorEditDist 0.8
YearDifference > 1
All-Ngrams 0.48Non-Duplicate
Non Duplicate
Duplicate TitleIsNull < 1
PageMatch 0.5
Non-Duplicate
Duplicate
Duplicate
Duplicate
Similarity
functions
Experiences with the learning approach
Too much manual search in preparing training data Hard to spot challenging and covering sets of
duplicates in large lists Even harder to find close non-duplicates that
will capture the nuances
Active learning is a generalization of this!
examine instances that are similar on one attribute but dissimilar on another
The active learning approach
f1 f2 …fn Similarity functions
Examplelabeledpairs
Record 1 D Record 2
Record 3 NRecord 4
1.0 0.4 … 0.2 1
0.0 0.1 … 0.3 0Classifier
Record 6 Record 7Record 8 Record 9Record 10Record 11
Unlabeled list 0.0 0.1 … 0.3 ?1.0 0.4 … 0.2 ?0.6 0.2 … 0.5 ?0.7 0.1 … 0.6 ?0.3 0.4 … 0.4 ?0.0 0.1 … 0.1 ?0.3 0.8 … 0.1 ?0.6 0.1 … 0.5 ?
0.7 0.1 … 0.6 10.3 0.4 … 0.4 0
Active learner
0.7 0.1 … 0.6 ?0.3 0.4 … 0.4 ?
The ALIAS deduplication system Interactive discovery of deduplication
function using active learning Efficient active learning on large lists
using novel indexing mechanisms Efficient application of learnt function
on large lists using Novel cluster-based evaluation engine Cost-based optimizer
Experimental analysis 250 references from Citeseer 32000
pairs of which only 150 duplicates Citeseer’s script used to segment into
author, title, year, page and rest. 20 text and integer similarity functions Average of 20 runs Default classifier: decision tree Initial labeled set: just two pairs
Benefits of active learning
Active learning much better than random With only 100 active instances
97% accuracy, Random only 30% Committee-based selection close to optimal
Analyzing selected instances Fraction of duplicates in selected
instances: 44% starting with only 0.5% Is the gain due to increased fraction of
duplicates? Replaced non-duplicates in selected set with
random non-dups Result only 40% accuracy!!!
Finding all duplicate pairs in large lists Input: a large list of records R with string attributes Output: all pairs (S,T) of records in R which satisfy
a Similarity Criteria: Jaccard(S,T) > 0.7 Overlapping tokens (S,T) > 5 TF-IDF-Cosine(S,T) > 0.8 Edit-distance(S,T) < k
More complicated similarity functions use these as filters (high recall, low precision)
Naïve method: for each record pair, compute similarity score
I/O and CPU intensive, not scalable to millions of records Goal: reduce O(n2) cost to O(n*w), where w << n
Reduce number of pairs on which similarity is computed
General template for similarity functions Sets: r,s threshol
dCommon
tokens
Approximating edit distance [GIJ+01]
EditDistance(s,t) ≤ d |q-grams(s) q-grams(t)| max(|s|,|t|) - (d-1)*q – 1
Q-grams (sequence of q-characters in a field) ‘AT&T Corporation’ 3-grams: {‘AT&’,’T&T’,’&T ‘, ‘T C’,’
Co’,’orp’,’rpo’,’por’,’ora’,’rat’,’ati’,’tio’,’ion’}
Typically, q=3 Large q-gram sets Approximate large q-gram sets to smaller
sets
Step 1: Pass over data, for each token create list of sets that contain it
Step 2: generate pairs of sets, count and output those with count > T
Pair-Count
Inverted indext1
t2
2
21
The pair-counting table could be large too memory intensiveNot good when list lengths highly skewed
Self-join lists
(Broder et al WWW 1997)
Data
Step 2: Using each record, probemerge lists to find rids in T of
them
Probe-Count
Step 1: Create inverted index
w1w2
Data Inverted index
Heapw1, w4,wk
Threshold sensitive list merge
Lists to be mergedSort by increasing sizeExcept T-1 largest,organize rest in heap(T=3)
Heap
Pop from heap successively
Heap
Search in large lists in increasing orderUse lower bounds to terminate early
(SK04, CGK06)
Summary of the pair creation step Can be extended to the weighted case
fitting the general framework. More complicated similarity functions use
set similarity functions as filters Set sizes can be reduced through
techniques like MinHash (weighted versions also exist) Small sets (average set size < 20), most
database entities with word tokens: use as-is Large sets: web documents, sets of q-grams
Use Minhash or random projection
Creating partitions Transitive closure
Dangers: unrelated records collapsed into a single cluster 2
1
10
7
93
5
8
6
4
2
1
10
7
93
5
8
6
4
3 disagreements
Correlation clustering (Bansal et al 2002) Partition to minimize total disagreements
1. Edges across partitions2. Missing edges within partition
More appealing than clustering: No magic constants: number of clusters,
similarity thresholds, diameter, etc Extends to real-valued scores NP Hard: many approximate algorithms
Empirical results on data partitioning
Setup: Online comparison shopping, Fields: name, model, description, price Learner: Online perceptron learner
Complete-link clustering >> single-link clustering(transitive closure) An issue: when to stop merging clusters
Digital cameras Camcoder Luggage(From: Bilenko et al, 2005)
References [CGGM04] Surajit Chaudhuri, Kris Ganjam, Venkatesh Ganti, Rajeev Motwani: Robust and
Efficient Fuzzy Match for Online Data Cleaning. SIGMOD Conference 2003: 313-324 [CGG+05] Surajit Chaudhuri, Kris Ganjam, Venkatesh Ganti, Rahul Kapoor, Vivek R.
Narasayya, Theo Vassilakis: Data cleaning in microsoft SQL server 2005. SIGMOD Conference 2005: 918-920
[CGK06] Surajit Chaudhuri, Venkatesh Ganti, Raghav Kaushik: A primitive operator for similarity [CRF03] William W. Cohen, Pradeep Ravikumar, Stephen E. Fienberg: A Comparison of String Distance Metrics for Name-Matching Tasks. IIWeb 2003: 73-78
[J89] M. A. Jaro: Advances in record linkage methodology as applied to matching the 1985 census of Tampa, Florida. Journal of the American Statistical Association 84: 414-420.
[ME97] Alvaro E. Monge, Charles Elkan: An Efficient Domain-Independent Algorithm for Detecting Approximately Duplicate Database Records. DMKD 1997
[RY97] E. Ristad, P. Yianilos : Learning string edit distance. IEEE Pattern analysis and machine intelligence 1998.
[SK04] Sunita Sarawagi, Alok Kirpal: Efficient set joins on similarity predicates. SIGMOD Conference 2004: 743-754
[W99] William E. Winkler: The state of record linkage and current research problems. IRS publication R99/04 (http://www.census.gov/srd/www/byname.html)
[Y02] William E. Yancey: BigMatch: A program for extracting probable matches from a large file for record linkage. RRC 2002-01. Statistical Research Division, U.S. Bureau of the Census.