Upload
others
View
4
Download
0
Embed Size (px)
Citation preview
Entity Resolution: Introduction
Data Cleaning & IntegrationCompSci 590.01 Spring 2017
Based on: Getoor & Machanavajjhala’s VLDB 2012 tutorial slides
Cohen’s record linkage tutorialElsner & Schudy’s ILP-NLP slides
What’s ER?
Entity Resolution: identifying and linking/grouping different manifestations of the same real-world object, e.g.:• Different ways of addressing (names, emails,
Facebook accounts) the same person in text• Web pages with different descriptions of the same
business• Different photos taken for the same objectetc.
2
Ironically, ER has duplicate names…
• Record linkage• Duplicate detection• Deduplication• Reference reconciliation• Reference matching• Object consolidation• Fuzzy matching• Entity clustering• Hardening soft databases…
3
Example: IP aliasing problem
… when measuring Internet topology
4
IP Aliasing Problem [Willinger et al. 2009]
Willinger et al. Notices of the AMS, 2009
Example cont’d5
Figure 3. The IP alias resolution problem in practice. This is re-produced from [48] and shows acomparison between the Abilene/Internet2 topology inferred by Rocketfuel (left) and the actual
topology (top right). Rectangles represent routers with interior ovals denoting interfaces. Thehistograms of the corresponding node degrees are shown in the bottom right plot. © 2008 ACM,
Inc. Included here by permission.
(IP)-speaking) routers encountered en route fromthe source to the destination. Instead, since IProuters have multiple interfaces, each with its ownIP address, what traceroute really generates isthe list of (input interface) IP addresses, and a verycommon property of traceroute-derived routesis that one and the same router can appear ondifferent routes with different IP addresses. Unfor-tunately, faithfully mapping interface IP addressesto routers is a difficult open problem known asthe IP alias resolution problem [51, 28], and despitecontinued research efforts (e.g., [48, 9]), it hasremained a source of significant errors. While thegeneric problem is illustrated in Figure 2, its im-pact on inferring the (known) router-level topologyof an actual network (i.e., Abilene/Internet2) ishighlighted in Figure 3—the inability to solve thealias resolution problem renders in this case theinferred topology irrelevant and produces statistics(e.g., node degree distribution) that have little incommon with their actual counterparts.
Another commonly ignored problem is thattraceroute, being strictly limited to IP or layer-3,is incapable of tracing through opaque layer-2clouds that feature circuit technologies such asAsynchronous Transfer Mode (ATM) or Multipro-tocol Label Switching (MPLS). These technologieshave the explicit and intended purpose of hidingthe network’s physical infrastructure from IP, sofrom the perspective of traceroute, a networkthat runs these technologies will appear to providedirect connectivity between routers that are sep-arated by local, regional, national, or even globalphysical network infrastructures. The result isthat when traceroute encounters one of theseopaque layer-2 clouds, it falsely “discovers” ahigh-degree node that is really a logical entity—anetwork potentially spanning many hosts or greatdistances—rather than a physical node of theInternet’s router-level topology. Thus, reports ofhigh-degree hubs in the core of the router-levelInternet, which defy common engineering sense,can often be easily identified as simple artifacts of
590 Notices of the AMS Volume 56, Number 5
Other examples
• Name/attribute ambiguity, data entry errors, missing data, formatting differences, changing attributes…
6
Traditional Challenges in ER• Name/Attribute ambiguity
Thomas CruiseThomas Cruise
Michael Jordan
Traditional Challenges in ER• Name/Attribute ambiguity• Errors due to data entryErrors due to data entry• Missing Values• Changing Attributes• Changing Attributes
• Data formatting
/• Abbreviations / Data Truncation
Traditional Challenges in ER• Name/Attribute ambiguity• Errors due to data entryErrors due to data entry• Missing Values• Changing Attributes• Changing Attributes
• Data formatting
/• Abbreviations / Data Truncation
“Big-data” ER challenges• Larger + more datasets• More heterogeneity
• E.g., not just name matching any more, but matching Amazon profiles with Google browsing history or Facebook friend list
• More linked• Links crucial to ER; e.g., authors + papers + citations
• More complex structures• E.g., Walmart = Walmart Pharmacy?
• Diverse domains• No one-size-fit-all method
• Diverse applications• Different accuracy requirements; e.g., web search vs.
comparison shopping
7
Outline
• Data preparation and matching features• Pairwise-ER• Leveraging constraints in ER• Record linkage: exclusivity• Deduplication: transitivity• “Collective” ER: general
• Next lecture
8
Normalization
• Schema normalization• Schema matching: e.g., contact# vs. phone• Compound attributes: e.g., addr vs. (street, city, st, zip)• Nested or set-valued attributes: e.g., properties for rent
with a set of tags, multiple phone numbers
• Data normalization• Capitalization, white-space normalization• Correcting typos, replacing abbreviations, variations,
nick names• Usually done by employing “dictionaries”: e.g., lists of
businesses, postal addresses, etc.
9
Matching features
Give two records , compute a “comparison” vector of similarity scores for corresponding features• E.g., to match two bibliographical references,
compute ⟨1st-author-match-score, title-match-score, venue-match-score, year-match-score, …⟩• Score can be Boolean (match, or mismatch), or
reals (based on some distance function)
10
Quick tour of matching features• Difference between numeric values• Domain-specific, like Jaro (for names)• Edit distance: good for typos in strings
• Levenshtein, Smith-Waterman, affine gap
• Phonetic-based• Soundex
• Translation-based• Set similarity
• Jaccard, Dice• For text fields (set of words) or relational features (e.g., set of
authors of a paper)
• Vector-based• Cosine similarity, TF/IDF (good for text)
11
Jaro
Specifically designed for names by U.S. Census• Given 𝑠 and 𝑡, 𝑐 is common if 𝑠& = 𝑡( = 𝑐 and 𝑖 − 𝑗 ≤ -./ 0 , 2
3• 𝑐4 and 𝑐3 are a transposition if 𝑐4 and 𝑐3 are
common but appear in different orders in 𝑠 and 𝑡
• Jaro similarity = 45
60+ 6
2+ 689
36, where 𝑚 = #
commons and 𝑥 = some measure of # transpositions• Jaro-Winkler further weighs errors early in the
strings more heavily
12
Levenshtein
• Distance between strings s and t = shortest sequence of edit commands that transform s to t• Copy character from s over to t• Delete a character in s (cost 1)• Insert a character in t (cost 1)• Substitute one character for another (cost 1)
13
W I L L I A M _ C O H E N
W I L L L I A M _ C O H O NC C C C I C C C C C C C S C
0 0 0 0 1 1 1 1 1 1 1 1 2 2
s
t
op
cost
Computing Levenshtein
𝐷 𝑖, 𝑗 = score of best alignment between 𝑠4𝑠3 ⋯ 𝑠&and 𝑡4𝑡3 ⋯ 𝑡(
= minA𝐷 𝑖 − 1, 𝑗 − 1 + 𝑑 𝑠&, 𝑡( sub/copy𝐷 𝑖 − 1, 𝑗 + 1delete𝐷 𝑖, 𝑗 − 1 + 1insert
where 𝑑 𝑠&, 𝑡( = 𝟏 𝑠& ≠ 𝑡( ,and let 𝐷 0,0 = 0, 𝐷 𝑖, 0 = 𝑖, and𝐷 0, 𝑗 = 𝑗• Can then normalize using lengths of 𝑠 and 𝑡:
1 − 𝐷 𝑠 , 𝑡 /max 𝑠 , 𝑡
14
Smith-Waterman
• Find longest “soft matching” subsequence
𝑆 𝑖, 𝑗 = max
0start over𝑆 𝑖 − 1, 𝑗 − 1 − 𝑑 𝑠&, 𝑡( sub/copy𝑆 𝑖 − 1, 𝑗 − 𝐺delete𝑆 𝑖, 𝑗 − 1 − 𝐺insert
where 𝑑 𝑠&, 𝑡( = 𝟏 𝑠& ≠ 𝑡( − 2 ⋅ 𝟏 𝑠& = 𝑡( ,(linear) gap penalty 𝐺 = 1,and let 𝑆 0,0 = 0, 𝑆 𝑖, 0 = 0, and𝑆 0, 𝑗 = 0
15
Smith-Waterman example16
C O H E NM 0 0 0 0 0
C +2 +1 0 0 0
C +2 +1 0 0 0
O +1 +4 +3 +2 +1
H 0 +3 +6 +5 +4
N 0 +2 +5 +5 +7
𝑆 𝑖, 𝑗 = max
0start over𝑆 𝑖 − 1, 𝑗 − 1 − 𝑑 𝑠&, 𝑡( sub/copy𝑆 𝑖 − 1, 𝑗 − 𝐺delete𝑆 𝑖, 𝑗 − 1 − 𝐺insert
Affine gap distance
• Smith-Waterman fails on some pairs that seem quite similar:William W. Cohen vs.William W. “Don’t call me Dubya” Cohen• Intuitively, single long inserts are “cheaper” than a lot of
short inserts
• Idea: instead of charge 𝑛𝐺 for a gap of 𝑛 chars, charge 𝐴 + 𝑛 − 1 𝐵, where 𝐴 is the cost of opening a gap, and 𝐵 is the cost of continuing it
17
Dynamic programming, again
• 𝑆 𝑖, 𝑗 = max𝑆 𝑖 − 1, 𝑗 − 1 − 𝑑 𝑠&, 𝑡(𝐼0 𝑖 − 1, 𝑗 − 1 − 𝑑 𝑠&, 𝑡(𝐼2 𝑖 − 1, 𝑗 − 1 − 𝑑 𝑠&, 𝑡(
• 𝐼0 𝑖, 𝑗 = max S 𝑆 𝑖 − 1, 𝑗 − 𝐴𝐼0 𝑖 − 1, 𝑗 − 𝐵
• 𝐼2 𝑖, 𝑗 = max T𝑆 𝑖, 𝑗 − 1 − 𝐴𝐼2 𝑖, 𝑗 − 1 − 𝐵
18
Best score in which 𝑠&is aligned with a gap
Best score in which 𝑡(is aligned with a gap
−𝐵
−𝐵
−𝑑 𝑠&, 𝑡( 𝑆
𝐼0
𝐼2−𝑑 𝑠&, 𝑡(
−𝑑 𝑠&, 𝑡(−𝐴−𝐴
Set similarity
Given two sets 𝐴 and 𝐵
• Jaccard distance: 1 − U∩WU∪W
• Dice distance: 1 − 3 U∩WU Y W
• Not a distance metric (triangle inequality doesn’t hold)• Note the connection to the F1 measure, which is the
harmonic mean of• Precision: TP/(TP+FP)• Recall: TP/(TP+FN)
19
Cosine similarity and TF/IDF
• Let 𝑈 = 𝑥4, 𝑥3, … , 𝑥\ be the universe of all elements (e.g., possible words in English)• A multiset 𝐷 with elements drawn from 𝑈 (e.g., a
document) can be represented as an 𝑛-dim vector ⟨𝑤4, 𝑤3, … ,𝑤\⟩• Each 𝑤& can be as simple as 𝑐 𝐷, 𝑥& , count of 𝑥& in 𝐷
• Cosine similarity between 𝐷4 and 𝐷3 is^_⋅^`^_ ^`
, where ⋅ is the 𝐿3 (Euclidean) normal
20
TF/IDF
Alternatively, if you have a corpus 𝒟 of 𝐷’s, define
• Term frequency 𝑇𝐹 𝐷, 𝑥 = log4h 1 + 𝑐 𝐷, 𝑥 , where 𝑐 𝐷, 𝑥 is 𝑥’s number of occurrences in 𝐷• Inverse document frequency 𝐼𝐷𝐹 𝒟, 𝑥 =log4h
𝒟^i 𝒟,9
, where 𝐷𝐹 𝒟, 𝑥 is the number of 𝐷’s in 𝒟 containing 𝑥• Let 𝑤& = 𝑇𝐹 𝐷, 𝑥& ⋅ 𝐼𝐷𝐹 𝒟, 𝑥&• Idea: elements that don’t serve to distinguish a 𝐷 within 𝒟 (e.g., stop words) are weighed down
21
Tokening and shingling
What are the “elements” in text?Do we lose the sequencing information by treating text as a bag of elements?• Simply split by non-alphanumeric characters?• How about “San Francisco”?• Can use a language model to find sequences of words
that appear “more than random”
• Or additionally treat 𝑛-grams (all subsequences of length 𝑛) as your “elements” (shingling)
22
Outline
• Data preparation and matching features• Pairwise-ER• Leveraging constraints in ER• Record linkage: exclusivity• Deduplication: transitivity• “Collective” ER: general
• Next lecture
23
Pairwise-ER
Given a vector of component-wise similarity scores for records 𝑥 and 𝑦, compute 𝑃 𝑥and𝑦matchPossible solutions• Check the weighed sum of component-wise scores
against a threshold to determine match/non-match• E.g., 0.5×1st-author-match-score + 0.2×venue-match-score
+ 0.3×title-match-score ≥ 0.8• Formulate rules about what constitutes a match• E.g., (1st-author-match-score> 0.7 AND
venue-match-score > 0.8) OR (title-match-score > 0.9AND venue-match-score > 0.9)
Hard to come up with weights, thresholds, and rules!
24
Fellegi & Sunter (Science, 1969)
• Given record pair 𝑟 = 𝑥, 𝑦 to match, with 𝛾 as the score vector• Let 𝑀 denote matches and 𝑈 non-matches• Decision rule:
𝑅 =𝑃 𝛾 𝑟 ∈ 𝑀𝑃 𝛾 𝑟 ∈ 𝑈
• Non-match if 𝑅 ≤ 𝑡z, match if 𝑡{ ≤ 𝑅, uncertain otherwise
• Naïve Bayes assumption: 𝑃 𝛾 𝑟 ∈ 𝑀 = Π&𝑃 𝛾&|𝑟 ∈ 𝑀
25
Supervised ML for pairwise ER• Naïve Bayes, decision trees (Cochinwala et al., IS 2001),
support vector machines (Bilenko & Mooney, KDD 2003; Christen KDD 2008), ensembles of classifiers (Chen et al., SIGMOD 2009), Conditional Random Fields (Gupta & Sarawagi, VLDB 2009), etc.• Imbalanced classes: typically many more negatives
(𝑂 𝑅 3 ) than positives (𝑂 𝑅 )• Pairs/matches are not i.i.d.• E.g., 𝑥, 𝑦 ∈ 𝑀 and 𝑦, 𝑧 ∈ 𝑀 implies 𝑥, 𝑧 ∈ 𝑀
• Constructing a training set is hard• Most pairs are “easy non-matches”• Some pairs are inherently ambiguous (e.g., is Paris Hilton
person or business?); others have missing attributes (e.g., Starbucks, Durham, NC)
26
Active learning
• Focus labeling efforts to reduce the “confusion region” of classifiers• To assess uncertainty, use the classifier’s output
(e.g., posterior probabilities of a Bayesian classifier), or votes by a “committee” (multiple weak classifiers)• Again, beware of evaluation metric—0-1 loss is no
good; need maximize recall with acceptable precision• Arasu et al. SIGMOD 2010; Bellare et al. KDD 2012
27
Outline
• Data preparation and matching features• Pairwise-ER• Leveraging constraints in ER• Record linkage: exclusivity• Deduplication: transitivity• “Collective” ER: general
• Next lecture
28
Constraint under record linkage
• Record linkage: link records between two databases (each has been deduplicatedindependently)
• Exclusivity constraint: a record in one database can match at most one record in the other database• Pairwise ER may well match with multiple records!
29
Weighted bipartite matching
• Nodes in 𝑁4 and 𝑁3are records from the two respective databases• For each 𝑟4 ∈ 𝑁4 and 𝑟3 ∈ 𝑁3 draw an edge 𝑟4, 𝑟3
and assign it a weight based on the pairwise similarity score (e.g., log odds of match)• Find a matching (i.e., a set of edges without
common nodes) that maximizes the sum of weights• Can be done in 𝑂 𝑅 5 time using the Hungarian
Algorithm• In practice, no need to generate all 𝑂 𝑅 3 edges
because some pairs are obviously non-matches (Gupta and Sarawagi, VLDB 2009)
30
Outline
• Data preparation and matching features• Pairwise-ER• Leveraging constraints in ER• Record linkage: exclusivity• Deduplication: transitivity• “Collective” ER: general
• Next lecture
31
Constraint under deduplication
• Deduplication: given a database containing potential duplicate mentions of the same entities, partition mentions into equivalence classes
• Transitivity constraint:• If 𝑥, 𝑦 ∈ 𝑀 and 𝑦, 𝑧 ∈ 𝑀, we must have 𝑥, 𝑧 ∈ 𝑀• Pairwise ER may or may not give us 𝑥, 𝑧 in this case
• A quick fix—compute transitive closure on the inferred match relationships?• Bad idea in some cases: graphs
resulting from pairwise ER can have diameter > 20 (Rastogi et al. Corr 2012)
32
Added by transitive closure
Clustering-based ER
• Resolution decisions are not made independently for each pair of records—good• Unsupervised—good, although often still needs
pairwise similarity as input• Existing clustering algorithms may be used, but• Number of clusters not known in advance• Many, many small (possibly singleton) clusters—not
what most existing clustering algorithms expect
33
Possible clustering approaches
• Hierarchical clustering• Bilenko et al. ICDM 2005
• Nearest-neighbor-based methods• Chaudhuri et al., ICDE 2005
• Correlation clustering• Soon et al. CL 2001, Ng et al. ACL 2002, Bansal et al. ML
2004, Elsner et al. ACL 2008, Ailon et al. JACM 2008, etc.
34
Correlation clustering
• Key advantage: no need to give the number of clusters; find the optimal number automatically
• Key idea: maximize the sum of• Similarities between nodes within the same clusters• Disimilarities between nodes in different clusters
35
Integer linear program formulation• Constants
• 𝑤9�8 ∈ 0,1 , cost of clustering 𝑥 and 𝑦 together• 𝑤9�Y ∈ 0,1 , cost of putting 𝑥 and 𝑦 in different clusters
• Variables• 𝑟9� = 1 if 𝑥 and 𝑦 are in the same cluster, or 0 otherwise
• Minimize ∑ 𝑟9�𝑤9�8 + 1 − 𝑟9� 𝑤9�Y�9�
subject to ∀𝑥, 𝑦, 𝑧 ∈ 𝑅:𝑟9� + 𝑟9� + 𝑟�� ≠ 2• The constraint is basically transitivity• Note that what matters is the net weight 𝑤9�± = 𝑤9�Y − 𝑤9�8
• Setting up weights using pairwise similarity 𝑝9�• Additive: 𝑤9�8 = 1 − 𝑝9�;𝑤9�Y = 𝑝9�• Or logarithmic:𝑤9�8 = log 1 − 𝑝9� ;𝑤9�Y = log 𝑝9�
• Problem is known to be NP-hard
36
Greedy algorithms
• Step through the records in some random order• To label the next record 𝑥, use a heuristic rule to
pick an existing cluster• Or start a new cluster with 𝑥 by itself
• In practice, run the algorithm multiple times and take the best answer
37
Greedy algorithms
Step through the nodes in random order.Use a linking rule to place each unlabeled node.
Previously assigned
Next node
?
12
Red arc: negative 𝑤±; prefer separateGreen arc: positive 𝑤±; prefer together
FIRST rule
• Soon et al. CL 2001
38
First link (Soon ‘01)
Previously assigned
Next node
?
the most recent positive arc
13
or start a new cluster if all arcs are negative
BEST rule
• Ng et al. ACL 2002
39
Best link (Ng+Cardie ‘02)
Previously assigned
Next node
?
the highest scoring arc
14
or start a new cluster if all arcs are negative
VOTE rule
• Elsner et al. ACL 2008
40
Voted link
Previously assigned
Next node
?
the cluster with highest arc sum
15
or start a new cluster if all arcs are negative
PIVOT
• Ailon et al. JACM 2008
41
Pivot (Ailon+al ‘08)
Create each whole cluster at once.Take the first node as the pivot.
add all nodeswith positive arcs
pivot node
16
Pivot
Choose the next unlabeled node as the pivot.
new pivot node
add all nodeswith positive arcs
17
Comparison of heuristics
Ailon et al. JACM 2008:• PIVOT has approximation guarantees• 5-approximation if 𝑤9�Y + 𝑤9�8 = 1 (probability
constraints)• 2-approximation if weights satisfy triangle inequality
Elsner & Schudy, ILP-NLP 2009:• VOTE works well in practice• Local improvement can always be used in post-
processing
42