15
Reuse and Adaptation for Entity Resolution through Transfer Learning Saravanan Thirumuruganathan Shameem A Puthiya Parambath Mourad Ouzzani Nan Tang Shafiq Joty Qatar Computing Research Institute, HBKU, Qatar Nanyang Technological University, Singapore {sthirumuruganathan, spparambath, mouzzani, ntang}@hbku.edu.qa [email protected] ABSTRACT Entity resolution (ER) is one of the fundamental problems in data integration, where machine learning (ML) based clas- sifiers often provide the state-of-the-art results. Consider- able human effort goes into feature engineering and training data creation. In this paper, we investigate a new problem: Given a dataset DT for ER with limited or no training data, is it possible to train a good ML classifier on DT by reusing and adapting the training data of dataset DS from same or related domain? Our major contributions include (1) a dis- tributed representation based approach to encode each tuple from diverse datasets into a standard feature space; (2) iden- tification of common scenarios where the reuse of training data can be beneficial; and (3) five algorithms for handling each of the aforementioned scenarios. We have performed comprehensive experiments on 12 datasets from 5 differ- ent domains (publications, movies, songs, restaurants, and books). Our experiments show that our algorithms provide significant benefits such as providing superior performance for a fixed training data size. 1. INTRODUCTION Entity resolution (ER) – which identifies pairs of duplicate entities – is a fundamental problem in data integration. Sev- eral studies [29, 37, 19] show that machine learning (ML)- based methods often provide state-of-the-art results for ER. The seedy underbelly of such superior performance is the considerable human effort that underlies these ML methods. Critical prerequisites to obtain the aforementioned state-of- the-art results often include the availability of enough la- beled data in the form of matching and non-matching tu- ple pairs, good feature engineering, and fine-tuning of the models. Each of these steps requires considerable human effort, which is obviously expensive. Furthermore, many of the sophisticated ML models require large training data to achieve good results [30]. In fact, it has been reported [18] that achieving F-measures of 99% with random forests can require up to 1.5M labels even for relatively clean datasets. Large organizations often have 100s of datasets to be in- tegrated and deduplicated [51]. Furthermore, organizations continuously produce or acquire new datasets, and each time they have to identify duplicates before feeding these datasets to downstream processes. Unfortunately, they need to spend a lot of human effort with each new dataset if the previous steps are all repeated from scratch. Reuse and Adaptation for ER. In this paper, we inves- tigate the following problem for ER: Given a target dataset DT to be deduplicated with lim- ited or no training data, is it possible to train a good ML classifier for DT by reusing and adapting the training data from a related dataset DS ? An affirmative answer to the above problem has the po- tential to dramatically reduce the human effort needed. We would like to note that our work is orthogonal and can be used in conjunction with other approaches to reduce labeling effort such as active learning [48] and weak supervision [46]. Prior Art and Their Limitations. Consider source DS and target DT datasets, with labeled training data D L S and D L T , respectively, where “+/-” means a matching/non- matching tuple pair, as shown in Figure 1. Consider three classical methods for designing an ML classifier, M(DT ), for DT . (1) No Transfer (NoT) only uses the training data D L T from the target to train M(DT ), while ignoring the training data, D L S , from the source. (2) Naive Transfer (NvT). The other end of the spectrum is to blindly apply the classifier built on DS to DT . (3) Transfer learning (TL). A smarter approach is to use the training data not only from the target (D L T ), but also from the source (D L S ), which is the chief goal of transfer learning (TL) [40, 52]. Obviously, (1) NoT will not work when DT has no training dataset (i.e., D L T = ), or D L T is very small thus producing a classifier which will be biased and likely overfit, especially for high capacity ML classifiers such as deep learning, ran- dom forests, and SVMs. (2) NvT works well only if the two datasets are very similar in terms of both features and data distribution, which are often violated in practice, for example when we have different schemas and distributions of duplicates between the source and the target. Although intuitive, as we shall elaborate in Section 4 and in the exper- iments, directly applying prior TL methods (3) is doomed to fail since for ER purposes, datasets DS and DT might have a number of differences that make knowledge transfer chal- lenging. These can be major such as different schemata, or subtle such as different vocabulary, different distributions of duplicates, different duplicate to non-duplicate ratios, and divergence in the most discriminative features. Research Challenges. Let X be the similarity vector be- tween two tuples and y be the label (1 for duplicate or 0 for not). The Fellegi-Sunter, a central model for solving the ER problem [21], can be considered as a probabilistic decision rule that declares a given X as a duplicate if the likelihood P (y|X) exceeds a given threshold. Any ML classifier can be used to learn this probability from the data. Given this setting, the fundamental challenges of TL for ER are: (I) Prior Probability Shift occurs when the probability 1 arXiv:1809.11084v1 [cs.DB] 28 Sep 2018

Reuse and Adaptation for Entity Resolution through ... · (III) Class Conditional Probability Distribution Shift oc-curs when the quantities P S(yjX) and P T (yjX) di er, i.e., given

  • Upload
    others

  • View
    1

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Reuse and Adaptation for Entity Resolution through ... · (III) Class Conditional Probability Distribution Shift oc-curs when the quantities P S(yjX) and P T (yjX) di er, i.e., given

Reuse and Adaptation for Entity Resolutionthrough Transfer Learning

Saravanan Thirumuruganathan‡ Shameem A Puthiya Parambath‡

Mourad Ouzzani‡ Nan Tang‡ Shafiq Joty†

‡Qatar Computing Research Institute, HBKU, Qatar †Nanyang Technological University, Singapore{sthirumuruganathan, spparambath, mouzzani, ntang}@hbku.edu.qa [email protected]

ABSTRACTEntity resolution (ER) is one of the fundamental problems indata integration, where machine learning (ML) based clas-sifiers often provide the state-of-the-art results. Consider-able human effort goes into feature engineering and trainingdata creation. In this paper, we investigate a new problem:Given a dataset DT for ER with limited or no training data,is it possible to train a good ML classifier on DT by reusingand adapting the training data of dataset DS from same orrelated domain? Our major contributions include (1) a dis-tributed representation based approach to encode each tuplefrom diverse datasets into a standard feature space; (2) iden-tification of common scenarios where the reuse of trainingdata can be beneficial; and (3) five algorithms for handlingeach of the aforementioned scenarios. We have performedcomprehensive experiments on 12 datasets from 5 differ-ent domains (publications, movies, songs, restaurants, andbooks). Our experiments show that our algorithms providesignificant benefits such as providing superior performancefor a fixed training data size.

1. INTRODUCTIONEntity resolution (ER) – which identifies pairs of duplicate

entities – is a fundamental problem in data integration. Sev-eral studies [29, 37, 19] show that machine learning (ML)-based methods often provide state-of-the-art results for ER.

The seedy underbelly of such superior performance is theconsiderable human effort that underlies these ML methods.Critical prerequisites to obtain the aforementioned state-of-the-art results often include the availability of enough la-beled data in the form of matching and non-matching tu-ple pairs, good feature engineering, and fine-tuning of themodels. Each of these steps requires considerable humaneffort, which is obviously expensive. Furthermore, many ofthe sophisticated ML models require large training data toachieve good results [30]. In fact, it has been reported [18]that achieving F-measures of ∼99% with random forests canrequire up to 1.5M labels even for relatively clean datasets.

Large organizations often have 100s of datasets to be in-tegrated and deduplicated [51]. Furthermore, organizationscontinuously produce or acquire new datasets, and each timethey have to identify duplicates before feeding these datasetsto downstream processes. Unfortunately, they need to spenda lot of human effort with each new dataset if the previoussteps are all repeated from scratch.

Reuse and Adaptation for ER. In this paper, we inves-tigate the following problem for ER:

Given a target dataset DT to be deduplicated with lim-ited or no training data, is it possible to train a good MLclassifier for DT by reusing and adapting the training datafrom a related dataset DS?

An affirmative answer to the above problem has the po-tential to dramatically reduce the human effort needed. Wewould like to note that our work is orthogonal and can beused in conjunction with other approaches to reduce labelingeffort such as active learning [48] and weak supervision [46].

Prior Art and Their Limitations. Consider source DS

and target DT datasets, with labeled training data DLS

and DLT , respectively, where “+/−” means a matching/non-

matching tuple pair, as shown in Figure 1. Consider threeclassical methods for designing an ML classifier, M(DT ), forDT . (1) No Transfer (NoT) only uses the training data DL

T

from the target to train M(DT ), while ignoring the trainingdata, DL

S , from the source. (2) Naive Transfer (NvT). Theother end of the spectrum is to blindly apply the classifierbuilt on DS to DT . (3) Transfer learning (TL). A smarterapproach is to use the training data not only from the target(DL

T ), but also from the source (DLS ), which is the chief goal

of transfer learning (TL) [40, 52].Obviously, (1) NoT will not work when DT has no training

dataset (i.e., DLT = ∅), or DL

T is very small thus producinga classifier which will be biased and likely overfit, especiallyfor high capacity ML classifiers such as deep learning, ran-dom forests, and SVMs. (2) NvT works well only if thetwo datasets are very similar in terms of both features anddata distribution, which are often violated in practice, forexample when we have different schemas and distributionsof duplicates between the source and the target. Althoughintuitive, as we shall elaborate in Section 4 and in the exper-iments, directly applying prior TL methods (3) is doomed tofail since for ER purposes, datasets DS and DT might havea number of differences that make knowledge transfer chal-lenging. These can be major such as different schemata, orsubtle such as different vocabulary, different distributions ofduplicates, different duplicate to non-duplicate ratios, anddivergence in the most discriminative features.

Research Challenges. Let X be the similarity vector be-tween two tuples and y be the label (1 for duplicate or 0 fornot). The Fellegi-Sunter, a central model for solving the ERproblem [21], can be considered as a probabilistic decisionrule that declares a given X as a duplicate if the likelihoodP (y|X) exceeds a given threshold. Any ML classifier canbe used to learn this probability from the data. Given thissetting, the fundamental challenges of TL for ER are:

(I) Prior Probability Shift occurs when the probability

1

arX

iv:1

809.

1108

4v1

[cs

.DB

] 2

8 Se

p 20

18

Page 2: Reuse and Adaptation for Entity Resolution through ... · (III) Class Conditional Probability Distribution Shift oc-curs when the quantities P S(yjX) and P T (yjX) di er, i.e., given

Target Dataset DTSource Dataset DS

(1) use adda to train M(DT)

Trai

ning

dat

a Matching

PairsNon-matching

Pairs

M(DS): Classifier for DS M(DT): Classifier for DT (2) use M(DS)

(3) use xxx and xxxx to train M(DT)

Trai

ning

dat

a Matching

PairsNon-matching

Pairs

DTL

DSL

DS

L DT

LDT

L

Figure 1: Illustration of Designing ER Classifiers

distribution of the labels vary between DS and DT . Forexample, DS could be a relatively clean dataset with veryfew duplicates while DT could be a noisy dataset where amuch larger fraction of tuple pairs are duplicates resultingin very different prior probabilities PS(y) and PT (y).

(II) Covariate Shift occurs when DS and DT havevery different distributions of similarity vectors PS(X) andPT (X). For instance, if DS is mostly clean, then most Xfrom PS(X) might have low similarity values.

(III) Class Conditional Probability Distribution Shift oc-curs when the quantities PS(y|X) and PT (y|X) differ, i.e.,given the same similarity vector X, the likelihood that it isa duplicate might vary between DS and DT .

(IV) Sample Selection Bias occurs when the process bywhich datasets DS and DT are obtained is very different.One common culprit is blocking that can be considered asa non-random process that selects a subset of all possibletuple pairs (i.e., candidate set) using a label dependent pro-cess where (potential) duplicates are preferentially chosenover (potential) non-duplicates. If DS and DT use differentblocking processes (such as varying aggressiveness for prun-ing), the generated candidate sets could have selection bias.Covariate shift is a special case of sample selection bias.

(V) Imbalanced Data. ER is usually an imbalanced clas-sification problem as non-duplicates out-number duplicatesby orders of magnitude. This issue is compounded by thefact that the sizes of DS and DT (or their correspondinglabeled subsets DL

S , DLT ) could be imbalanced as well. For

example, DS might have millions of tuple pairs while DT

could have only thousands. If we blindly apply TL, anysignal from DT will be overwhelmed by DS .

Contributions.

(1) Identification of Common ER Reuse Scenarios. Weidentify a set of real-world scenarios that might benefitfrom the reuse of ER classifiers. Each of these scenariosconstitutes an equivalence class requiring a specific reusetechnique. (Section 2)

(2) Feature Space Standardization. A fundamental road-block in applying TL for ER is the heterogeneity of theschemata and thereby features for the classifiers. We pro-pose an effective mechanism such that tuple pairs from eachdataset are encoded into the same feature space, which fa-cilitates the reuse of classifiers. Our method is based ondistributed representations (DRs), a fundamental concept indeep learning. This has a number of advantages such as al-lowing semantic similarity, making our approach as close tooff-the-shelf methods as possible. (Section 3)

(3) Reuse Algorithms Spectrum. We propose five algorithmsthat can handle most of the common labeling data scenar-ios in applying TL to ER and the aforementioned challenges.

These include no transfer, naive transfer, instance weight-ing, feature augmentation with/without labeled data. Thealgorithms are classifier-agnostic and can be readily adaptedto all of the popular ER classifiers such as random forests,logistic regression, and SVMs. (Section 4)

(4) Practical Issues. We describe a number of practical is-sues that one might encounter when trying to reuse MLclassifiers for ER and provide some empirical rule-of-thumbadvices that are backed by extensive experiments. Some ofthe important issues that we study include: (a) how to reusewhen there are multiple possible sources, (b) is it possibleto reuse a dataset from an unrelated domain, (c) when andwhen not to reuse, and (d) how to select an appropriatesource dataset to reuse from a set of candidates. (Section 5)

(5) Experiments. We conduct comprehensive experiments

on 12 datasets from 5 domains (publications, movies, songs,restaurants, and books). The results show that TL can pro-vide significant benefits such as providing superior perfor-mance for a fixed training dataset size. (Section 6)

2. BACKGROUND AND SCENARIOSIn this section, we first introduce key concepts related

to ER (Section 2.1) and then categorize the different datascenarios for applying TL to ER (Section 2.2).

2.1 Entity ResolutionLet T and T ′ be two relations with aligned schema{A1, A2, . . . , Am}. We denote by t[Aj ] the value of attributeAj on tuple t. The problem of entity resolution (ER) is,given all distinct tuple pairs (t, t′) ∈ T × T ′, to determinewhich pairs of tuples refer to the same real-world entity. Apair of tuples is said to match (resp. mismatch) when theyrefer to the same (resp. different) real-world entity.

Blocking. For efficiency reasons, ER solutions typicallyfirst run blocking methods, which generate a candidate setC ⊆ T×T ′ that includes tuple pairs that are likely to match.

Training Data. Most, if not all, ER solutions need train-ing data, which is formalized as follows. A labeled trainingdataset is L ⊆ T × T ′ × {0, 1}, where the triplet (t, t′, 1)(resp. (t, t′, 0)) denotes that tuples t and t′ are (resp. arenot) duplicates.

Fellegi-Sunter Model [21] is a formal framework for prob-abilistic ER and most prior ML works are simple variantsof this approach. Basically, they train an ML classifier Mon the labeled dataset L such that M can accurately dis-tinguish tuple pairs in candidate set C to be either matchor non-match, by using vectors of similarity scores betweenaligned attributes as features.

Monotonicity of Precision was introduced in [2]; it statesthat if the similarity score of (t1, t2) is higher than (t′1, t

′2),

then (t1, t2) is more likely to be a duplicate. Although themonotonicity assumption could be violated for particulartuple pairs, it usually holds in an aggregate sense.

2.2 ER Scenarios for TLTransfer Learning (TL) seeks to utilize the knowledgefrom one or more related sources, usually in the form of fea-tures, classifier, and training data, in order to design an MLclassifier for a target dataset with limited or no training data(please refer to survey [40] and Section 7 for more details.).

2

Page 3: Reuse and Adaptation for Entity Resolution through ... · (III) Class Conditional Probability Distribution Shift oc-curs when the quantities P S(yjX) and P T (yjX) di er, i.e., given

Feature space of DT Feature space of DS

Feature Truncation

Feature Standardization

Figure 2: Feature Truncation and Standardization

In this work, we primarily seek to transfer both features(by encoding tuples in a standard feature space) and train-ing data, based on (a) encoding tuple pairs as a similarityvector as per the Fellegi-Sunter model, and (b) assuming themonotonicity of precision. Directly transferring ML classi-fiers is often much trickier and require non-trivial changes.Our approach, that does not transfer ML classifiers, has anumber of advantages such as requiring minimal changes toexisting ER pipelines and providing flexibility to use differ-ent classifiers between the source and target datasets.

ER Scenarios for TL. Let the target dataset be DT =(T, T ′). Let the source dataset be DS = (TS , T

′S).

We will postpone the discussion for n sources {D1S =

(T1, T′1), . . . , Dn

S = (Tn, T′n)} to Section 5.

We categorize ER scenarios based on the amount of train-ing data available for building an ML classifier. We use theterms “limited” and “adequate” to loosely denote that theamount of training data required for building an ML classi-fier is not sufficient and sufficient, respectively. Of course,this can vary for each dataset and ML classifier, and is oftendetermined by a domain expert. For example, deep learningbased methods may require much larger training data thansimpler models such as decision trees.

We consider four major scenarios for (Source, Target)that commonly occur in practice: (1) (Adequate, Nothing),(2) (Adequate, Limited), (3) (Limited, Limited), and (4)(Adequate, Adequate).

Scenario (Adequate, Nothing) is hard as one has to build aclassifier using the training data from DS that yet performssatisfactorily on DT . For the scenario (Adequate, Limited),we wish to “augment” the limited training data from in DT

with the training data from DS . This augmentation mustbe done in a careful manner so that the signals from thesource datasets do not swamp the ones from DT . (Limited,Limited) is a tricky scenario where we do not have adequatelabeled data for both source and target datasets. Insteadof throwing up our hands in the air, we can try to “pool”these limited resources so that we can get an effective classi-fier that can perform well on each of these datasets. One canapply a traditional ML classifier and achieve satisfactory re-sults for the final scenario (Adequate, Adequate). However,it is still possible to improve the performance of a classi-fier by transferring some additional relevant knowledge andmake it more robust.

3. FEATURE SPACE STANDARDIZATIONIn this section, we first describe the rationale for encoding

all the datasets into a standard feature space (Section 3.1).We then propose such a feature space based on distributedrepresentations – a fundamental concept from deep learning.We also propose an algorithm to encode each tuple into thisstandard feature space and describe the various advantages

conferred by such an approach (Section 3.2).

3.1 Rationale for Feature StandardizationFeature Space Truncation vs Standardization. Given“source” DS and “target” DT , it is not necessary – and evenunlikely – that they share a common schemata. Consider anexample where DT and DS have 10 and 15 attributes, re-spectively, with 7 attributes in common. We cannot directlyuse the classifier forDS toDT as they will have totally differ-ent feature spaces. A naive approach would be to (re)buildan ML classifier for DS using only the 7 common attributesand apply it on DT . We dub this simplistic and unappealingapproach as feature space truncation.

We instead advocate for a feature space standardizationapproach where tuples from each relation in DS and DT areall encoded into a standard feature space of dimension d.Then we can build an ML classifier based on the Fellegi-Sunter model, where the input vector measures the simi-larity between two tuples in this standardized (or shared)feature space. Figure 2 illustrates the difference betweenfeature truncation and feature standardization.

Feature Space Standardization in Other Domains.Such standardization has been successfully used in a num-ber of domains. Consider an object recognition task in thecomputer vision domain on an input image with 1024×1024pixels. Any image – whether it is that of a bird, an ani-mal, a mountain or some random object – can be encodedas an image with 1024 × 1024 pixels. In other words, allimages are implicitly encoded into the same feature space.Hence, an ML model for classifying the given image as catsor dogs (possibly trained over images photographed indoors)can also be used to classify cats vs. dogs from another do-main (images of cats and dogs taken outdoors). A simi-lar phenomenon happens in information retrieval (IR) also.Given a fixed vocabulary – say from the English dictionary– it is possible to encode every English document as term-frequency vectors under the bag of words semantics. An MLmodel trained for categorizing an email as spam vs. non-spam in one domain can be applied over another domainas long as the inputs can be encoded in this feature space.One might be tempted to use the IR approach for featurespace standardization in ER. Specifically, construct a dictio-nary of all unique words from DS and DT and then encodeeach tuple either as a term-frequency or TF-IDF vector un-der the bag-of-words semantics. However, this approach hasa number of disadvantages. First, this has to be repeatedfor different DS and DT . Second, this simplistic encodingdoes not allow the use of sophisticated similarity measuressuch as semantic measures. As we shall see, our proposedapproach dramatically improves on this naive way.

3.2 Distributed Representation for TuplesWe propose a standard feature space by using a funda-

mental concept from deep learning, namely distributed rep-resentations (DRs) of words. We first describe these DRs(see [24] for more details). We then present how to com-pose a DR for a tuple from its component words. Note thatthe rest of the paper is oblivious to the specific feature spaceand readily admit other approaches that could encode tuplesfrom various datasets as a vector in a fixed feature space.

Distributed Representation for Words. DRs of words(a.k.a. word embeddings), recently introduced to deep learn-

3

Page 4: Reuse and Adaptation for Entity Resolution through ... · (III) Class Conditional Probability Distribution Shift oc-curs when the quantities P S(yjX) and P T (yjX) di er, i.e., given

0.9 0.9 -0.2king 0.7

0.9 -0.1 0.8queen -0.2

0.2 -0.2 0.9woman 0.9

0.3 0.8 -0.1man 0.1

d-dimension

Figure 3: Sample Word Embeddings

ing, are learned from the data such that semantically relatedwords have embeddings that are often close to each other.Typically, these approaches map each word in a dictionaryinto a high dimensional vector (e.g., 300 dimensions) wherethe geometric relation between the vectors of two words –such as vector differencing or cosine similarity – encodes asemantic relationship between them. There exists a cornu-copia of DRs for words including word2vec [36], GloVe [43],fastText [11] and very recently ELMo [45]. Figure 3 showssome sample word embeddings. More discussion of DRs forER can be found in [19, 37].

Distributed Representation for Tuples. We begin bytreating each tuple as a document by breaking the attributeboundaries. For example, a tuple with four attributes (title,author, venue, year) representing the foundational paper ondatabase is represented as a document “A relational modelof data for large shared data banks. Edgar Codd. Commu-nications of the ACM. 1970.”.

By treating each tuple as a sentence, one can leverage theextensive work from NLP that can encode sentences, para-graphs, and documents into a DR [32]. However, these ap-proaches are time consuming and recent works such as [56,3] have shown that such complicated methods are often out-performed in a transfer learning setting by simpler composi-tional approaches. Hence, we propose a simple, yet efficientand effective algorithm by synthesizing approaches from re-cent work on deep learning for ER [19, 37] and theoreticalunderpinnings of word embeddings [3, 26].

DR Composition. Let us first consider a simple scenariowhere each of the words in the tuple is present in the DRdictionary. A natural approach for composing a DR for a tu-ple is to simply average the DRs of all words in the sentence.

DR(t) =1

|t|∑w∈t

DR(w) (1)

DR(t) and DR(w) denote the DRs for tuple t and wordw, respectively. However, it is possible to do better by com-puting the weighted average of the DRs instead of simpleaveraging. Smooth Inverse Frequency (SIF) [3] is one suchweighting scheme that is reminiscent of TF-IDF weightingfrom IR. Intuitively, the weight of a word is a/(a + p(w))where a is a parameter (typically set to 0.0001) and p(w)the frequency of word w. This simple approach is known tooutperform both simple averaging and even some complexapproaches [3, 56] due to the fact that the salient words oftenhave a higher impact than common words. The projectionof the average vectors on their first principal component isthen removed to reduce the bias. The frequency of a wordw can be computed from the union of DS and DT or evenfrom the union of all relations {D1

S , . . . , DnS} from a given

Algorithm 1 Feature Space Standardization

1: Input: Target dataset DT

2: Optional Input: Source datasets DS = {D1S , . . . , D

nS}

3: W = Set of all distinct words from DT

4: WD = Set of all words from the DR dictionary5: WU = W \WD

6: // Compute DR for unknown words7: for each word w ∈WU do8: Collect context Cw from DS ∪DT

9: Compute DR(w) using Equation 310: // For computing weights for SIF11: Estimate frequency of each word w ∈W over DS ∪DT

12: for each tuple t ∈ DT do13: Compute DR(t) using Equation 214: // Computing distributional similarity vector15: for each (t, t′) in candidate set C of DT do16: DR(t, t′) = |DR(t)−DR(t′)|

domain even if they are not involved in the given ER task.

DR(t) =1

|t|∑w∈t

a

a+ p(w)DR(w) (2)

Handling Out-of-Vocabulary Words. A potential wrin-kle in tuple composition happens when some of the wordsin the dataset are not present in the DR dictionary. This isdubbed as an out-of-vocabulary (OOV) scenario. Of course,a simplistic and undesirable solution is to simply ignore suchwords when computing the DR of a tuple or to use a randomvector. Prior work such as [19] also proposes expensive so-lutions such as building a domain specific DR, using a prob-abilistic retrofitting approach, or subword information [11].

In this work, we use a simple technique that approximatesthe DR of an unknown word from its context. Imagine thatyou are reading a sentence with a single unknown word wwith no access to a dictionary. Usually, you will seek to inferthe word’s meaning from its context (a list of other knownwords near w). If you also have access to a list of othersentences where this unknown word w occurs, you can get abetter handle on the word’s meaning by examining each ofits contexts [26]. We empirically found that this approachprovides a good-enough approximation of DRs for the un-known words that is comparable with other approaches ata fraction of their time. Formally, we compute DR of w as

DR(w) =1

|Cw|∑c∈Cw

1

|c|∑w′∈c

DR(w′) (3)

where Cw is the set of contexts in which w occurs. In ourpaper, we defined context to be k = 5 words that occurbefore and after w.

Distributional Similarity Vectors. The next step is tocompute the distributional similarity vector for a pair of tu-ples t and t′ that are passed as the input to the ML classifier.Popular approaches to compute the distributional similarityvector DR(t, t′) include vector differencing (op = subtrac-tion) and Hadamard product (op = product). In our paper,we use vector differencing.

DR(t, t′) =

d∑i=1

|DR(t)[i] •DR(t′)[i]| (4)

Putting It All Together. Algorithm 1 shows how to en-code all the tuples in the same feature space such that theirdistributional similarity vector can be passed to an ML clas-sifier for training/prediction under the Fellegi-Sunter model.

4

Page 5: Reuse and Adaptation for Entity Resolution through ... · (III) Class Conditional Probability Distribution Shift oc-curs when the quantities P S(yjX) and P T (yjX) di er, i.e., given

Advantages of DR based Feature Standardization.

(1) All tuples from all datasets are encoded into the samefeature space enabling reuse of ML classifiers.

(2) The feature space has a number of appealing proper-ties such as encoding semantic similarity and a fine-grainedsimilarity computed holistically over d = 300 dimensions.

(3) This standardization allows us to pool training data frommultiple source datasets – whereas under the feature trun-cation approach, this would have resulted in a small set offeatures that are present in each of the source datasets.

(4) It minimizes the effort of a domain expert in identifyingappropriate features, similarity functions, and so on.

(5) Allows reuse of popular DRs such as Word2vec, GloVe,and FastText that are trained on very large corpora.

(6) Allows reuse of ML classifiers even from unrelated do-mains by leveraging the monotonicity of precision.

4. ALGORITHMSOur next goal is to study methods of TL for ER. Af-

ter introducing some notations, we will lay down differentdesiderata for devising TL algorithms for ER (Section 4.1).We then describe our proposed algorithms (Section 4.2).

Notation. For expository convenience, we abuse the use ofDS and DT as the sets of distributional similarity vectors ofthe source and target datasets DS and DT obtained throughAlgorithm 1. Let DL

S ⊆ DS be the set of similarity vectorsfor which the labeled data is available. Let DU

S = DS\DLS be

the corresponding set of unlabeled similarity vectors. Thesets DL

T , DUT are defined analogously for the target DT .

4.1 Algorithm DesiderataThere has been extensive prior work on TL algorithms

for various domains [40]. However, most of these algorithmscannot be directly applied here as they are vulnerable toER specific challenges or ignore key ER properties. Ourgoal is to select and adapt an effective set of algorithmsthat perform well for ER. Most specifically, they must:

D1: successfully address the ER-specific challenges such asimbalanced data, diverse schemata, and varying vocabulary,

D2: be capable of leveraging key ER properties such assimilarity vectors as features and monotonicity of precision,

D3: be simple, efficient, and can be readily included in theexisting ER pipelines such as Magellan [29],

D4: work on classifiers that are widely used in ER,

D5: be dataset and domain agnostic, and

D6: allow seamless transfer from multiple source datasets.A key implication of these desiderata is that we would

prefer algorithms that cause as little changes in the exist-ing pipelines as possible. In fact, all our algorithms re-quire exactly one fundamental requirement – the ability toset weights to each similarity vector. All the major ML li-braries from various languages support this functionality.This is the primary reason we prefer transferring data asagainst transferring ML classifiers. While equally effective,the latter often requires complex modification of the objec-tive function of the ML classifiers that are often non-trivialfor a data cleaning expert and not available in the popularER toolkits.

4.2 TL Algorithms for ER Scenarios

Algorithm 2 Scenario (Adequate, Nothing)

1: Input: DS and DT

2: // Determine weights of each xi ∈ DLS .

3: D = {}4: D = D ∪ {〈xi, 1〉} ∀xi ∈ DS

5: D = D ∪ {〈xi, 0〉} ∀xi ∈ DT

6: Train a logistic regression classifier on D7: Create a weighted dataset Dw where weight of xi ∈ DL

S iscomputed from Equation 5

8: // Train an ML classifier for ER for DT

9: M = ML classifier trained on Dw

10: Apply M on DUT to detect duplicates

We now describe our proposed algorithms for the differentscenarios, which are based on the available training data inDS and DT as introduced in Section 2.2.

Scenario 1: (Adequate, Nothing). This is an extremescenario where we do not have any training data DL

T fromthe target dataset. However, we have an adequate amountof training data DL

S that was used to build a good ML clas-sifier for DS . This scenario is called unsupervised domainadaptation in TL. At first blush, it might seem that we can-not do better than the naive transfer approach. However,this ignores the availability of a key resource – DU

T .Of course, DU

T cannot be directly used to train an MLclassifier – it does not contain any labeled data after all.However, it might still be possible to indirectly use DU

T

to improve the classifier trained on DLS . Note that blindly

training a classifier on DLS optimizes it for good performance

on DS but not DT . This might be appropriate if DS andDT are extremely related – otherwise a better alternative isto ensure that the classifier is optimized for DT instead.

The key insight is that the naive transfer approach assignsequal weights to the similarity vectors from DL

S . However, ifwe move towards a weighted paradigm where different sim-ilarity vectors have different weights based on their fidelityto DT , then the classifier would be able to achieve betterperformance on DT . Suppose we have a mechanism to quan-tify the selection probability that a similarity vector xi camefrom dataset DT . Then we must assign more weight to xiif the probability that it came from DT is high. Meanwhile,we must assign low weights to the data points xi that havea low probability of coming from DT .

There has been extensive work on different weighingschemes by ensuring that the distance (such as KL-divergence [53] and kernel mean matching [25]) between theweighted DL

S and DUT is minimized. We advocate for an al-

ternate weighing scheme originally proposed in [8] that canbe easily incorporated into an existing ER pipeline. Thecrucial observation is that an ML classifier can be used togenerate the probability that xi came from a given dataset.

Our approach proceeds as follows. We create a trainingdataset D by combining DS and DT . We assign a label of 1to all similarity vectors in DS and a label of 0 to all similar-ity vectors from DT . Note that this label is different fromwhether a given similarity vector in DL

S is a duplicate or not.We then train an ML classifier (we use Logistic Regressionin our paper) to predict whether a given similarity vector xicame from DS or DT . Specifically, we assign the followingweight to each similarity vector xi ∈ DS .

w(xi) =1

p(y = 1|xi)− 1 (5)

5

Page 6: Reuse and Adaptation for Entity Resolution through ... · (III) Class Conditional Probability Distribution Shift oc-curs when the quantities P S(yjX) and P T (yjX) di er, i.e., given

Algorithm 3 Scenario (Adequate, Limited)

1: Input: DS and DT

2: // Feature augmentation3: D = {}4: D = D ∪ {〈φS(xi), yi〉} ∀(xi, yi) ∈ DL

S

5: D = D ∪ {〈φT (xi), yi〉} ∀(xi, yi) ∈ DLT

6: // Train an ML classifier for ER on DT

7: M = ML classifier trained on D8: Apply M on DU

T to detect duplicates

We can see why this specific weighing scheme is appropri-ate for our purposes. If the classifier can confidently predictxi as coming from DS , then it is likely to be very differ-ent from DT . In such a case, we must assign a weightclose to 0. However, if the classifier is not very confident(say p(y = 1|xi) ≈ 0.5), then it cannot clearly distinguishwhether xi came from DS or DT . This is a good candidateand must be assigned a higher weight close to 1. Algorithm 2describes the overall approach.

Scenario 2: (Adequate, Limited). The scenario wherewe have adequate labeled data DL

S and limited data fromDL

T is called supervised domain adaptation in TL. One couldapply any of the three algorithms (NoT, NvT, and Algo-rithm 3) for this scenario. Our objective though is to designa classifier that achieves better performance by leveragingDL

T . Intuitively, if DS and DT are very related, one couldget good performance by simply pooling all data points fromDL

S and DLT and training a single classifier. However, if they

are not very related, a different approach is needed.Consider a specific component xj of similarity vector x

that measures the similarity along dimension j. If thedatasets are not too related, then the importance of compo-nent xj might be different between DS and DT . If we traina classifier after blindly pooling DL

S and DLT , the learned

parameters might not be appropriate. However, we cannotonly train on DL

T as it is limited.The way out of this conundrum is feature augmentation

that enables us to learn parameters jointly when appropriateand learn individually otherwise. There has been extensivework on feature augmentation such as [33, 10, 31]. We ad-vocate for a replication based feature augmentation [14, 15]that requires minimal changes to the existing ER pipelines.These approaches pre-process/transform datasets DL

S andDL

T that can then be passed to any ML classifier.Given a similarity vector x of dimension d, the approach

in [14, 15] creates a similarity vector of dimension 3 × dby duplicating each feature xj in a clever manner that isdifferent for DS and DT .

φS(x) = 〈x, x,0〉 (6)

φT (x) = 〈x,0, x〉 (7)

The term 0 corresponds to a d-dimensional vector withall 0s. The first d-dimensions correspond to commonfeatures while the next two d-dimensions are for sourceand target specific features respectively. For example, letx = [0.1, 0.9] be a similarity vector: if x ∈ DL

S , thenφS(x) = [0.1, 0.9, 0.1, 0.9, 0, 0]; and if x ∈ DL

T , then φT (x) =[0.1, 0.9, 0, 0, 0.1, 0.9].

This particular feature augmentation allows one to strate-gically share the parameters when necessary and learn indi-vidually otherwise. The first d dimensions allows us to learnshared parameters while the next d-dimensions allows us to

Algorithm 4 Scenario (Limited, Limited)

1: Input: DS and DT

2: // Feature augmentation3: D = {}4: D = D ∪ {〈φS(xi), yi〉} ∀(xi, yi) ∈ DL

S

5: D = D ∪ {〈φT (xi), yi〉} ∀(xi, yi) ∈ DLT

6: // Data + Feature augmentation7: D = D ∪ {〈φU (xi), 0〉} ∀(xi, yi) ∈ DU

T

8: D = D ∪ {〈φU (xi), 1〉} ∀(xi, yi) ∈ DUT

9: // Train an ML classifier for ER for DT

10: M = ML classifier trained on D11: Apply M on DU

T to detect duplicates

learn parameters individually for DS and DT , respectively.We apply the appropriate transformation on each x fromDL

S and DLT and feed the pooled transformed dataset to any

ML classifier. Algorithm 3 provides the pseudocode.

Scenario 3: (Limited, Limited). This is yet anothertricky scenario where only limited amounts of labeled dataDL

S and DLT are available that might not be sufficient for

training effective individual classifiers for DS and DT . Theobjective here is to train an effective classifier for both DS

and DT by pooling their respective labeled data.Let us first dispense with an easier (sub) scenario. If DS

and DT are very related and the size of the pooled dataDP = DL

S ∪ DLT is deemed to be adequate by a domain

expert, one can simply train an effective ML classifier bytraining it on DP or even using Algorithm 3.

If DS and DT are not too related and/or pooled labeleddata is still limited, then we need a more sophisticated ap-proach that leverages the unlabeled data. Without loss ofgenerality, let us assume that we need to build an ML classi-fier for DT . We need to effectively pool not only DL

S , DLT but

also DUT . Note that if we wish to build an ML classifier for

DS , then we use DLS , DL

T , and DUS . This generic approach of

effectively utilizing both the labeled and unlabeled datasetsis known as semi-supervised domain adaptation in TL.

The key idea is to extend Algorithm 3 to make use ofthe unlabeled target data DU

T . Let us first consider whathappens if we simply apply Algorithm 3 on DP . If DP isnot of adequate size, then the classifier will overfit DP andwill not generalize well to DU

T . So, we need to force thelearned classifier to do well on DU

T without having access tothe labels. An elegant approach to this problem was firstproposed in [31, 16]. First, we transform each x ∈ DU

T by

φU (x) = 〈0, x,−x〉 (8)

We perform data augmentation by creating two copiesof each x ∈ DU

T . We assign the label duplicate for one ofthem and non-duplicate to the other. This seemingly bizarreaugmentation is known to enforce a strong co-regularizationon the classifier by ensuring that the weights learned forthe transformed dataset also agree on the unlabeled data.A rigorous analysis in terms of Rademacher complexity isprovided in [31]. The pseudocode for this approach is givenin Algorithm 4.

Scenario 4: (Adequate, Adequate). This is a relativelystraightforward scenario where we have adequate trainingdata for DT . Any of the algorithms proposed in this sectioncould be used here.

5. PRACTICAL ISSUES IN TL FOR ER

6

Page 7: Reuse and Adaptation for Entity Resolution through ... · (III) Class Conditional Probability Distribution Shift oc-curs when the quantities P S(yjX) and P T (yjX) di er, i.e., given

Next, we shift gears to discuss several issues that arise inpractice and how to deal with them. Section 5.1 discussesTL for multiple sources. Section 5.2 describes the selectionof source data, and Section 5.3 addresses model selection.

5.1 TL from Multiple SourcesExtending our algorithms to handle multiple source

datasets is relatively straightforward. Let {D1S , D

2S , . . . , D

nS}

be the set of n source datasets and DT be the target dataset.

Algorithm NoT : No Transfer. This algorithm is not im-pacted by the number of source datasets as it simply trainsfrom the target dataset.

Algorithm NvT : Naive Transfer. We pool the similar-ity vectors from the labeled subset of all source datasets tocreate dataset DP and build ML classifier on it.

Scenario 1: (Adequate, Nothing). Extension to multi-ple datasets requires two changes to Algorithm 2. In Line 4,we assign a label of 1 to similarity vectors from all sourcedatasets {D1

S , . . . , DnS}. As before, the similarity vectors

from DT are assigned a label of 0 and a logistic regressionclassifier is trained. In Line 7, we compute the weights of thelabeled similarity vectors from each of the source datasetsusing Equation 5. One can train any ML classifier on thisweighted training dataset and apply it on DU

T .

Scenario 2: (Adequate, Limited). If there are n sourcedatasets with d dimensions each, then the transformed fea-ture space has (n + 2) × d dimensions. Given input x, thei-th d-dimensional window is set to x for the transforma-tion function φi

S of source Si. As before, the first and lastd-dimensions are set to x and 0. For example, if there n = 3datasets, the transformation functions are given by:

φ1S(x) = 〈x, x,0,0,0〉 φ2

S(x) = 〈x,0, x,0,0〉 (9)

φ3S(x) = 〈x,0,0, x,0〉 φT (x) = 〈x,0,0,0, x〉 (10)

Scenario 3: (Limited, Limited). The modification forthis scenario is identical to that of Scenario 2, i.e., φS andφT . The only additional change that is needed is the featuretransformation function φU for unlabeled data. Continuingthe above example for n = 3, we have:

φU (x) = 〈0,0,0, x,−x〉 (11)

Scenario 4: (Adequate, Adequate). As mentioned be-fore, any of the five proposed algorithms can be used forthis scenario. Depending on the specific algorithm, one canreuse the appropriate extension to multiple sources.

5.2 Source Selection and Negative TransferThe selection of an appropriate source dataset to transfer

from is of fundamental importance and has a disproportion-ate impact on the performance of the classifier on datasetDT . There has been extensive empirical research (such as[47]) that shows that a naive transfer between two dissimi-lar datasets can affect the performance of a classifier on thetarget dataset. Negative transfer [44, 40] is an extreme casewhere transfer from a source dataset results in a reducedperformance on the targeted dataset. However, research ondataset relatedness and negative transfer is still in its in-fancy thereby requiring guidance from a domain expert.

The key challenge is that dataset relatedness is inherentlymultidimensional in nature. Two datasets might be differ-ent from each other in terms of prior probability distribution

Algorithm 5 Source Relatedness via Selection Probability

1: Input: DS and DT

2: Add an origin column to DS and DT

3: Set origin column of DS to 1 and DT to 0.4: for iter = 1 to 10 do5: D = Randomly sample 80% of DS and DT

6: Train classifier M on D7: Evaluate performance of M on DS ∪DT \D8: Compute MCC using Equation 129: if average MCC over 10 runs ≤ 0.2 then

10: return DS and DT are sufficiently related

over labels or covariates, conditional distribution between la-bels and similarity vector, sample selection bias, ratio/num-ber of duplicates to non-duplicates, and many more. Sosummarizing differences from all these dimensions into onequantitative measure is quite challenging. We tackle twosimpler sub-problems relevant to ER for which one can comeup with effective heuristics.

P1: Are DS and DT sufficiently related? The fun-damental idea is to reuse the classifier from Scenario 1 ofSection 4.2, which we used to distinguish source and targetdatasets. Intuitively, if the trained classifier has high accu-racy, then the two datasets are very different and can be eas-ily distinguished. The key insight is to reduce the notion ofrelatedness between two datasets to the selection probabilityof the aforementioned classifier and use a sophisticated sta-tistical test for distinguishing them. Let TP,TN,FP,FN bethe number of true positives, true negatives, false positives,and false negatives of the classifier, respectively. Matthewscorrelation coefficient (MCC) [4] is defined as

MCC =TP× TN− FP× FN√

(TP + FP)(TP + FN)(TN + FP)(TN + FN)(12)

Heuristically, a value of 0.2 or lower is considered asgood [35], i.e., DS can be used as a source for DT . Ofcourse, this statistical test has to be repeated many timesfor confidence. Algorithm 5 shows the pseudocode.

P2: Source Dataset Selection from a Pool of Sources.Another practical problem is to select one dataset DS from aset of candidate datasets {D1

S , . . . , DnS}. One can use Algo-

rithm 5 and rank the datasets based on MCC. However, thismay give incorrect ordering due to variance if two candidatedatasets Di

S and DjS are very similar to DT . We advocate

for an alternate distance metric dA proposed in [5, 15] thatmeasures the distance between two distributions, DT andDi

S , based on how two different classifiers f and f ′ can dis-agree on the labels from a given test set. Note that f andf ′ are chosen from the hypothesis class H that contains allclassifiers of a given type (such as all SVMs or all logisiticregression classifiers). It can be estimated [5] as:

dA = 2× (acc− 0.5) (13)

where acc is the accuracy of the best classifier for domainseparation. Algorithm 6 shows how one could use this metricto reliably select one or more sources.

5.3 Model Evaluation with Limited BudgetThe performance of a trained ML classifier is evaluated

on a separate test set. However, in all but one scenario thatwe considered, we do not have sufficient labeled data DL

T for

7

Page 8: Reuse and Adaptation for Entity Resolution through ... · (III) Class Conditional Probability Distribution Shift oc-curs when the quantities P S(yjX) and P T (yjX) di er, i.e., given

Algorithm 6 Source Dataset Selection from Candidates

1: Input: Source dataset candidates {D1S , D

2S , . . . , D

nS}

and DT

2: for each dataset DS ∈ {D1S , D

2S , . . . , D

nS} do

3: for iter = 1 to 10 do4: D= Randomly select 80% of data from DS and DT

5: Train classifier M on D6: Compute held out accuracy of M on DS ∪DT \D7: Compute average held out accuracy of DS and use it

to estimate dA using Equation 138: return dataset DS ∈ {D1

S , D2S , . . . , D

nS} with least dA

Algorithm 7 Estimating Classifier Performance

1: Input: Classifier M , Budget B, #Stratas W and DUT

2: Stratify each t ∈ DUT based on probability score p(t)

3: Estimate vi and Bi using Equations 15 and 144: for i = 1 to W do5: Random select Bi samples from Pi

6: Get labels from expert7: Compute strata accuracy Ai and recall Ri

8: Estimate dataset accuracy A =∑W

i=1|Pi|nAi

9: Estimate dataset recall R =∑W

i=1|Pi|nRi

10: Estimate dataset F-score from A and R

even training the classifier. We need to avail the service ofa domain expert subject to a labeling budget for estimatingclassifier’s accuracy. Note that this is distinct from activelearning that uses a limited budget to train a classifier.

We propose an effective heuristic to generate estimates ofa classifier accuracy by leveraging the concept of stratifiedsampling [55]. We fed a distributional similarity vector t ∈DU

T to the classifier M and obtained a probabilistic scorep(t). Given the desired number of strata W from the expert(such as 5), we split the probability range [0, 1] into W equalsized partitions. Each t ∈ DU

T is then assigned to its stratabased on p(t). For each strata Pi, we select Bi samplesuniformly at random, get the true label from the domainexpert and compute the classifier performance for Pi. Ifthe strata are sufficiently homogeneous, then this is a goodestimate of the per-strata classifier accuracy.

The next question is how to allocate the budget B acrossdifferent strata. Two intuitive options are equal allocationwhere Bi = B/W , and proportional allocation where Bi =

B × |Pi|n

. We recommend Neyman allocation [55] where thebudget is allocated based on intra-strata variability vi as:

Bi = B × vi × |Pi|∑Wj=1 vj × |Pj |

(14)

Each t ∈ Pi is a Bernoulli random variable with successprobability p(t) whose variance is p(t)× (1−p(t)). The sumof such all t ∈ Pi follows a Poisson binomial distributionwhose variance vi is approximated as:

vi =∑t∈Pi

p(t)× (1− p(t)) (15)

6. EXPERIMENTSIn our evaluation, we answer the following key questions:

(i) Feasibility of TL for ER: Can we really improve theperformance of popular ML classifiers for ER through TL?

(Exp-1); (ii) Effectiveness for Various Scenarios: What isthe effectiveness of our proposed algorithms for the four sce-narios discussed in Section 2.2? (Exp-2); (iii) Multi-SourceTL for ER: Can we leverage multiple sources for TL? (Exp-3); (iv) Cross Domain Transfer: What happens if we do nothave enough training data from the same or similar domain?(Exp-4); (v) Negative Transfer: Can we get decreased per-formance when applying TL for ER? (Exp-5).

6.1 Experimental SetupHardware and Platform. All our experiments were per-formed on a quad-core 2.2 GHz machine with 16 GB ofRAM. The algorithms were implemented in Python. Scikit-Learn [42] (version 0.19.1) was used to train the ML models.We used Magellan [29] (version 0.3.1), an end-to-end ML-based ER framework for handling other ER stages such asfeature generation, blocking, and evaluation.

Datasets. In order to highlight the wide applicability of ourproposed approaches, we conducted extensive experimentsover 12 datasets from 5 diverse domains, namely publica-tions, movies, songs, restaurants, and books (See Table 1for details). The datasets from the publication, movie, andsong domains are popular benchmark datasets which havebeen used in prior ER work with both ML- and non-ML-based approaches. The datasets from the restaurant andbook domains are from the Magellan data repository [13].Their ground truth duplicates were only partially provided.We trained multiple ML models on the partial data and pre-dicted the match status for the remaining tuple pairs usingan Ensemble ML classifier from Magellan. The output wasthen manually verified and corrected.

As the name suggests, the Million Song Dataset (MSD)has 1 Million songs. Our proposed algorithms were quiteeffective over this dataset. We chose two subsets MSD-1and MSD-2 in an adversarial manner such that they have alarge dA. This reduces the effectiveness of TL making thisa challenging dataset.

Distributed Representations. We used FastText [11] asour default DR that was trained over 16 billion tokens fromWikipedia 2017 and other related web corpus. Unless other-wise specified, we used the default dimension of d = 300. Weleveraged the re-training functionality of FastText to par-tially avoid the out-of-vocabulary scenario. Specifically, wepooled all the datasets from a given domain into one giantdocument where each tuple corresponds to a sentence. Wetrained FastText on this document with the pre-trained DRsfrom Wikipedia as initial values. This re-training providesbetter initialization for unknown words from the datasets weused for ER. We initialized the DRs of rare words – definedas those that are not in FastText’s dictionary and occurringless than 5 times in our domain – using the method describedin Section 3.2 [26]. We empirically found that this providesbetter initialization than the subword based mechanism ofFastText. We used Equation 2 to convert tuples to DRs.

Algorithms. We evaluated the five algorithms two ofwhich are baselines - No Transfer (NoT) and Naive Trans-fer (NvT). The other three correspond to the scenarios (Ad-equate, Nothing), (Adequate, Limited), and (Limited, Lim-ited). When the available training data is less than a 10%,we consider it to be limited (i.e., Scenario 2 or 3).

ML Classifiers. For our experiments, we primarily focus

8

Page 9: Reuse and Adaptation for Entity Resolution through ... · (III) Class Conditional Probability Distribution Shift oc-curs when the quantities P S(yjX) and P T (yjX) di er, i.e., given

Table 1: Datasets for ExperimentsDomain Dataset Notation #Tuples #Duplicate Pairs #Attributes

PublicationsDBLP-ACM [1] Pub-DA 2,616 - 2,294 2,224 4

DBLP-Scholar [1] Pub-DS 2,616 - 64,263 5,347 4Cora [49] Pub-C 993-993 14, 280 9

MoviesIMDB-OMDB [13] Mv-IO 1,132,262-2,300,984 521,590 11

DBPedia-IMDB [41] Movies-DI 27,615-23,182 22,405 5Rotten Tomatoes-IMDB [13] Movies-RI 7,390-6,407 190 10

SongsMillion Song Data-1 [7] MSD-1 15,233-15,233 10,000 7Million Song Data-2 [7] MSD-2 14,966-14,966 10,000 7

RestaurantsYellow Pages-Yelp [13] Rest-YY1 11,840-5,223 130 6Yelp-Yellow Pages [13] Rest-YY2 9,947-28,787 88 12

BooksAmazon-Barnes & Noble [13] Books-AN 9,836-9,958 99 7

Barnes & Noble-Half [13] Books-BH 3,022-3,099 327 8

on four ML classifiers: Logistic Regression (LR), SupportVector Machines (SVM), Decision Trees (DT), and RandomForests (RF). These classifiers are widely used for ER inpractice, extensively evaluated in prior work, and supportedby most ER frameworks. Note that our approaches are quitegeneric and can support many popular ML classifiers.

TL for ER Best Practices. Both TL and ER suffer fromimbalance issues separately. ER has a class imbalance issuewhere the number of non-duplicates are many orders of mag-nitude larger than the number of duplicates. TL often hasdata imbalance where the amount of training data for sourceDL

S is often much larger than that of target data DLT . In-

dividually, these issues are readily addressable. When com-bined, they require a careful solution that we describe below.

A typical solution for class imbalance in ER is to un-der sample the non-duplicates such that the ratio betweenthe number of duplicates and non-duplicates is less than athreshold. We used a ratio of 1:3 whenever feasible. As anexample, let the training data has 100 duplicates and 10,000non-duplicates. Suppose that we wish to train a classifierwith 200 tuple pairs. Then, we will pick 50 duplicates and150 non-duplicates – uniformly at random – such that 1:3ratio is maintained. When we wish to train with 1,000 tu-ple pairs, we pick all the available 100 duplicates and pick900 non-duplicates uniformly at random.We then apply thebalanced weighting heuristic proposed in [27] that assignsdifferent class weights to duplicates and non-duplicates.

The dataset imbalance problem in TL can be solved anal-ogously. Consider a scenario where |DL

S | = 1000 while|DL

T | = 100. On the one hand, if we blindly use this fortraining, the source dataset can “swamp” the target dataset.On the other hand, due to the data scarcity, we cannot af-ford to ignore 900 tuple pairs from DL

S to achieve datasetparity. One effective approach is to apply different instanceweights to the training data. For example, each t ∈ DL

S

gets a weight of 0.1 while each t ∈ DLT gets a weight of 1.

While very effective, this approach cannot always be usedin our paper. Recall that in the algorithm for scenario (Ad-equate, Nothing), we use the class separator ML classifierto get the weights for each tuple in DL

S . In this case, weuse the technique of probabilistic replication whereby we re-peatedly sample with replacement from DL

T till we get 1,000samples. Of course, one can also use the even simpler repli-cation where each t ∈ DL

T is replicated 10 times in total.A related complication arises where there are constraints

on the number of tuples one can use for training in DLS and

DLT . For example, in our experiments we vary the size of

DLS and DL

T to understand their respective impact. This alsooften occurs in practice where the source data is much largerthan the target data. Suppose we have |DL

S | = 100, 000 and|DL

T | = 100. Clearly, utilizing all the data (even with aweight of 0.01) might swamp the target dataset. For ourexperiments, we fixed the maximum data imbalance ratioas 10. Hence, we require at most 1,000 tuples from DL

S

when |DLT | = 100. We use the following heuristic : we use

Algorithm 2 to obtain weight for each t ∈ DLS and perform

importance sampling on DLS with these weights. We found

that blindly picking the top-1,000 tuples with the largestweights is not very robust. This approach was also used forselecting tuples from DL

S for Algorithms 3 and 4. In keepingwith known TL best practices (e.g., Chapter 8 of [15]), theweights were computed in both cases using Equation 5 beforeapplying transformations φS , φT and φU . Once again, weperform importance sampling to pick the necessary numberof tuples (such as 1,000 in this example) and then apply thetransformations on them.

6.2 Experimental ResultsExp-1: TL for ER Challenges. Recall from Section 1that TL for ER is challenging due to a number of issues suchas class/data imbalance and shift in prior/class conditionalprobability. Figure 4 visually highlights some of these chal-lenges and also the potential for transfer. This figure showsthe class based histogram of distributional similarity of twodatasets - Pub-DA and Pub-DS. Consider Figure 4(a) cor-responding to dataset Pub-DA. For each distinct tuple pair(ti, tj) in the blocked candidate set, we construct their re-spective DRs and compute their Cosine similarity [34]. Thenon-duplicates and duplicates are highlighted in differentcolors. We can see that the histograms of these two datasetsare sufficiently similar to plausibly allow transfer while suf-ficiently different to make such transfer challenging.

Understanding Baseline Algorithms. We first compare theperformance of the baseline algorithms No Transfer (NoT)and Naive Transfer (NvT). Even though we refer to themas baselines, they often provide formidable performance inmost ER tasks. This is due to three reasons. Using theFellegi-Sunter paradigm, the input to all ML classifiers forvarious datasets is a distributional similarity vector of fixeddimension. This allows a straightforward transfer opportu-nity. Second, the DR-based approach allows one to measureholistic similarity in a high dimensional (d = 300) vectorspace. Third, when combined with the monotonicity of pre-cision property, we can see that a distributional similarity

9

Page 10: Reuse and Adaptation for Entity Resolution through ... · (III) Class Conditional Probability Distribution Shift oc-curs when the quantities P S(yjX) and P T (yjX) di er, i.e., given

Table 1DBLP_ACM

Duplicates Non-Duplicates0-0.1 0 00.1-0.2 0 00.2-0.3 0 00.3-0.4 0 60.4-0.5 0 1270.5-0.6 0 22650.6-0.7 2 167940.7-0.8 42 234390.8-0.9 597 15720.9-1 1578 34

DBLP_ScholarDuplicates Non-Duplicates

0-0.1 0 00.1-0.2 0 00.2-0.3 0 330.3-0.4 0 5110.4-0.5 5 37240.5-0.6 23 213110.6-0.7 150 625920.7-0.8 756 401060.8-0.9 2341 20360.9-1 1605 37

Perc

enta

ge

0%

25%

50%

75%

100%

Cosine Similarity

0-0.1

0.1-0.2

0.2-0.3

0.3-0.4

0.4-0.5

0.5-0.6

0.6-0.7

0.7-0.8

0.8-0.9

0.9-1

Perc

enta

ge

0%

25%

50%

75%

100%

Cosine Similarity

0-0.1

0.1-0.2

0.2-0.3

0.3-0.4

0.4-0.5

0.5-0.6

0.6-0.7

0.7-0.8

0.8-0.9

0.9-1

Duplicates Non-DuplicatesDuplicates Non-Duplicates

0/0

0/0

0/0

0/6

0/12

7

0/22

65

2/16

794

42/2

3439

597/

1572

1572

/34

0/0

0/33

0/51

1

5/37

24

23/2

1311

150/

6259

2

756/

4010

6

2341

/203

6

1605

/37

0/0

�1

(a) Pub-DA

Table 1DBLP_ACM

Duplicates Non-Duplicates0-0.1 0 00.1-0.2 0 00.2-0.3 0 00.3-0.4 0 60.4-0.5 0 1270.5-0.6 0 22650.6-0.7 2 167940.7-0.8 42 234390.8-0.9 597 15720.9-1 1578 34

DBLP_ScholarDuplicates Non-Duplicates

0-0.1 0 00.1-0.2 0 00.2-0.3 0 330.3-0.4 0 5110.4-0.5 5 37240.5-0.6 23 213110.6-0.7 150 625920.7-0.8 756 401060.8-0.9 2341 20360.9-1 1605 37

Perc

enta

ge

0%

25%

50%

75%

100%

Cosine Similarity

0-0.1

0.1-0.2

0.2-0.3

0.3-0.4

0.4-0.5

0.5-0.6

0.6-0.7

0.7-0.8

0.8-0.9

0.9-1

Perc

enta

ge

0%

25%

50%

75%

100%

Cosine Similarity

0-0.1

0.1-0.2

0.2-0.3

0.3-0.4

0.4-0.5

0.5-0.6

0.6-0.7

0.7-0.8

0.8-0.9

0.9-1

Duplicates Non-DuplicatesDuplicates Non-Duplicates

0/0

0/0

0/0

0/6

0/12

7

0/22

65

2/16

794

42/2

3439

597/

1572

1572

/34

0/0

0/33

0/51

1

5/37

24

23/2

1311

150/

6259

2

756/

4010

6

2341

/203

6

1605

/37

0/0

�1

(b) Pub-DS

Figure 4: Challenges of Duplicate Distributions

Source TargetDBLP-ACM DBLP-Scholar

LR SVM DT RFTrainingdatapct LR-NoT LR-NvT SVM-NoT SVM-NvT DT-NoT DT-NvT RF-NoT RF-NvT

1 27.60999236 24.1045057 74.25696839 63.86546653 19.50143574 18.29180772 53.53866858 40.594178922 58.36729118 48.83858709 76.2841555 68.89606787 21.70737684 21.66580206 56.24350053 44.785291173 68.4063658 56.24981889 78.09083448 69.53624334 22.82538038 22.83337618 62.97688445 47.323651964 73.01466826 63.76021979 77.9589682 70.94883376 23.51452089 23.68943689 64.25428936 47.997290035 75.43713257 66.62679252 77.96916223 72.09609529 23.89853857 23.67402628 65.37210897 48.779118426 76.71293968 68.15665783 77.77636059 74.2069599 24.31411786 24.46685819 66.769232 49.795685417 77.48702759 69.33902735 77.57955663 76.3224154 24.92929879 24.65997214 66.80304302 50.509626588 77.84890496 70.24872505 77.49559052 76.43747098 25.29387612 24.95539278 67.32162012 51.06910349 78.21912561 71.08665417 77.39966192 76.43714819 25.43060592 25.51546725 67.9489075 50.5492438310 78.42996906 71.81139468 77.26340508 76.54393488 25.93513186 25.91882895 68.28515204 51.50473338

F-Sc

ore

(%)

0

25

50

75

100 LR-NoT LR-NvT SVM-NoT SVM-NvTDT-NoT DT-NvT RF-NoT RF-NvT

Source TargetMSD-1 MSD-2

LR SVM DT RFTrainingdatapct LR-NoT LR-NvT SVM-NoT SVM-NvT DT-NoT DT-NvT RF-NoT RF-NvT

1 15.32062341 0 23.88890931 0 22.79145337 11.20692455 31.04961464 19.595224692 19.64506862 0 24.28131081 0 23.29468932 12.6014029 31.14141494 20.411674813 21.11456947 0 24.00311139 0 23.87253315 12.75305015 31.12873545 21.096029014 21.81106423 0 23.7439135 0 23.51564215 12.86065911 31.11900083 21.451338725 22.26729898 0 23.46066263 0 23.72290599 12.58169669 31.01142438 21.461533716 22.65810657 0 23.36123324 0 23.98812227 12.44650461 30.91079504 21.60607967 22.92348115 0 23.16716265 0 23.82975614 13.27562531 30.83914219 21.944348538 23.10825395 0 23.07483524 0 23.93497726 12.77755499 30.82414411 22.100089179 23.30514128 0 22.95273704 0 23.79730533 12.85249089 30.82278324 22.2891435210 23.3901304 0 22.86849241 0 23.80207205 13.20245918 30.70948752 22.32774714

F-Sc

ore

(%)

0

25

50

75

100 LR-NoT LR-NvT SVM-NoT SVM-NvTDT-NoT DT-NvT RF-NoT RF-NvT

1 2 3 4 5 6 7 8 9 10Training Data (%)

1 2 3 4 5 6 7 8 9 10Training Data (%)

�1

(a) (Pub-DA, Pub-C)

Source TargetDBLP-ACM DBLP-Scholar

LR SVM DT RFTrainingdatapct LR-NoT LR-NvT SVM-NoT SVM-NvT DT-NoT DT-NvT RF-NoT RF-NvT

1 27.60999236 24.1045057 74.25696839 63.86546653 19.50143574 18.29180772 53.53866858 40.594178922 58.36729118 48.83858709 76.2841555 68.89606787 21.70737684 21.66580206 56.24350053 44.785291173 68.4063658 56.24981889 78.09083448 69.53624334 22.82538038 22.83337618 62.97688445 47.323651964 73.01466826 63.76021979 77.9589682 70.94883376 23.51452089 23.68943689 64.25428936 47.997290035 75.43713257 66.62679252 77.96916223 72.09609529 23.89853857 23.67402628 65.37210897 48.779118426 76.71293968 68.15665783 77.77636059 74.2069599 24.31411786 24.46685819 66.769232 49.795685417 77.48702759 69.33902735 77.57955663 76.3224154 24.92929879 24.65997214 66.80304302 50.509626588 77.84890496 70.24872505 77.49559052 76.43747098 25.29387612 24.95539278 67.32162012 51.06910349 78.21912561 71.08665417 77.39966192 76.43714819 25.43060592 25.51546725 67.9489075 50.5492438310 78.42996906 71.81139468 77.26340508 76.54393488 25.93513186 25.91882895 68.28515204 51.50473338

F-Sc

ore

(%)

0

25

50

75

100 LR-NoT LR-NvT SVM-NoT SVM-NvTDT-NoT DT-NvT RF-NoT RF-NvT

Source TargetMSD-1 MSD-2

LR SVM DT RFTrainingdatapct LR-NoT LR-NvT SVM-NoT SVM-NvT DT-NoT DT-NvT RF-NoT RF-NvT

1 15.32062341 0 23.88890931 0 22.79145337 11.20692455 31.04961464 19.595224692 19.64506862 0 24.28131081 0 23.29468932 12.6014029 31.14141494 20.411674813 21.11456947 0 24.00311139 0 23.87253315 12.75305015 31.12873545 21.096029014 21.81106423 0 23.7439135 0 23.51564215 12.86065911 31.11900083 21.451338725 22.26729898 0 23.46066263 0 23.72290599 12.58169669 31.01142438 21.461533716 22.65810657 0 23.36123324 0 23.98812227 12.44650461 30.91079504 21.60607967 22.92348115 0 23.16716265 0 23.82975614 13.27562531 30.83914219 21.944348538 23.10825395 0 23.07483524 0 23.93497726 12.77755499 30.82414411 22.100089179 23.30514128 0 22.95273704 0 23.79730533 12.85249089 30.82278324 22.2891435210 23.3901304 0 22.86849241 0 23.80207205 13.20245918 30.70948752 22.32774714

F-Sc

ore

(%)

0

25

50

75

100 LR-NoT LR-NvT SVM-NoT SVM-NvTDT-NoT DT-NvT RF-NoT RF-NvT

1 2 3 4 5 6 7 8 9 10Training Data (%)

1 2 3 4 5 6 7 8 9 10Training Data (%)

�1

(b) (MSD-1, MSD-2)

Figure 5: Behavior of Baselines

vector with a high similarity is very likely to be a duplicateregardless of which dataset it was computed from. So anML classifier can readily identify high similarity distribu-tional vectors as a duplicate without much (re)training!

Let us consider the performance of the algorithms in twodatasets - (Pub-DA, Pub-C) and (MSD-1, MSD-2). Giventwo datasets D and D′, we present the (source, target) ordersuch that it produces the worse result. In other words, if(D,D′) has better result than (D′, D), we present the latter.We measure the F-measure of four classifiers while varyingthe size of the training data used. For NoT, this measuresthe size of DL

T while it measures size of DLS for NvT.

Figures 5(a) and 5(b) show the results. We can make afew observations that hold true for both datasets. First,the baselines do provide strong performance and can beused when the sophisticated TL algorithms cannot be ap-plied. The performance of the baselines improve slightlywith larger training data size. Both NoT and NvT suf-fers from a plateauing where the performance improvementslows down as more and more DL

S is utilized. This is to beexpected as the two datasets have a number of differencessuch as data/class imbalance, shift in covariance, prior prob-ability, and class conditional probability. Second, as the sizeof training data increases, the comparatively complex MLmodels of RF, SVM, and LR outperform DT. Overall, thebaselines provide good results for (Pub-DA, Pub-DS). How-ever, the situation is flipped for (MSD-1, MSD-2). Recallthat this dataset was constructed in an adversarial manner.Both SVM-NvT and LR-NvT provide F-scores close to 0while the others perform relatively better. Even then, theF-score peaks at approximately 20% regardless of the sizeof the training data.

Exp-2: Evaluating TL’s Impact on ER. Next we eval-uate the improvement achieved by our algorithms over thebaselines. We would like to bring attention to three key fac-

tors in order to appreciate the increased performance. First,once the feature space of the datasets are standardized us-ing our proposed approach, the baselines already providesubstantial improvements. This improvement happens con-sistently across multiple domains. Second, we intentionallyoperate in a resource constrained environment where we useas little as 1% to as much as 10% of the relevant trainingdata from DL

S or DLT (as explained in each scenario). In

fact, this setting both stress tests our algorithm and is alsomost relevant to real-world scenarios where very often onlya small amount of labeled data is available. Finally, ourimprovements in F-score for various datasets vary between1%-12% with 3-5% being the typical value. It is importantto note that ER is an extremely well studied problem andachieving such improvements is quite challenging! In fact,even when a number of novel approaches for ER such ascrowdsourcing [22, 54] and deep learning [19] are used, theachieved improvements in F-score was often in single digitsdespite using substantially more training data.

Scenario 1: (Adequate, Nothing). We evaluate our proposed

approach for the scenario where we set DLT = ∅. We vary

the size of DLS from 1%, 2% all the way to 10%. In order

to reduce the clutter, we show the improvement in F-scoreobtained by Algorithm 2 compared to NvT. Figures 6 showsthe results. Overall, the performance across all the domainsis very similar. Our proposed approach consistently outper-forms NvT - often by a significant margin. The improvementin F-score becomes close to 0 for more than 10% of train-ing data. The estimated weights of those additional tuplesis close to 1 thereby making our approach similar to NvT.While this scenario cannot be solved using traditional ERapproaches, our proposed approach not only makes it feasi-ble but also provides a good performance.

Scenario 2: (Adequate, Limited). In our next set of experi-ments, we evaluate the effectiveness of Algorithm 3 for Sce-nario 2. We fix the size of DL

T to 10% and vary the size ofDL

S . To reduce clutter, we show the difference in F-scorebetween our algorithm and the F-score of the best of the twobaselines. i.e., NoT or NvT. Figures 7 shows the results. Asexpected, even with very small amount of DL

T and DLS , our

TL approach outperforms both baselines consistently.

Scenario 3: (Limited, Limited). We next evaluate the effec-

tiveness of Algorithm 4 by varying the size of both DLS and

DLT from 1% to 20%. We fix the size of the unlabeled data

DUT , which is needed by the algorithm, to 20% of the train-

ing data. Once again, the results show that our approachoutperforms the baselines. The difference in performancecomes from the fact that training ML classifiers on small-ishtraining data results in biased classifiers that do not gener-alize well to test set. However, due to the ability of ouralgorithms to leverage DU

T , it has better generalizability.

Combating Overfitting. One issue that could reduce the per-formance of our algorithms in Scenarios 2 and 3 is the pos-sibility of overfitting. For consistency, we used the defaultFastText DR size d = 300 for all our experiments. Thiscould become as much as 900 for Scenarios 2 and 3 due to thetransformations φS , φT , and φU . Given the limited amountof training data, they might not be sufficient to learn effec-tive weights for all these dimensions. In other words, theymight overfit to the training data and do not generalize wellfor the test set. While this possibility is relatively mitigated

10

Page 11: Reuse and Adaptation for Entity Resolution through ... · (III) Class Conditional Probability Distribution Shift oc-curs when the quantities P S(yjX) and P T (yjX) di er, i.e., given

PublicaDons Restaurants BooksSource Target IWT-NvT Source Target IWT-NvT Source Target IWT-NvTDBLP-ACM DBLP-Scholar Rest-YY1 Rest-YY2 Books-AN Books-BHTrainingdatapct LR SVM DT RF Trainingdatapct LR SVM DT RF Trainingdatapct LR SVM DT RF

1 8.272669932 9.71909579 0.201644531 14.71662537 1 8.222632091 5.611364164 3.90032689 8.55615245 1 5.631405096 9.002068087 3.887800773 8.0437967642 12.43158031 5.220317543 0.280986302 11.77398799 2 7.019031161 7.393503184 2.995780405 11.96456914 2 5.388600646 8.85729775 2.764502343 7.4362641383 11.62113565 5.16306753 0.19951605 9.250099281 3 8.169621663 6.407030969 1.346404684 10.61010968 3 4.24970864 8.990687881 1.797816308 5.9385293014 6.33241878 6.163664313 0.333536799 7.716389795 4 9.32762485 5.562085761 0.8081017 7.803062988 4 3.112375522 6.100561938 1.191481319 5.9385293015 8.715046558 7.116881824 0.247162446 7.17325849 5 7.802762933 10.14960354 1.550325663 5.98197931 5 4.33205373 4.304606598 2.201522681 5.0150648636 9.312814689 4.943910984 0.190595112 5.613932486 6 7.403328756 7.393503184 0.81476891 4.398370599 6 3.454861717 6.144487806 1.007733051 3.6522135137 10.59736177 3.809958712 0.224867711 5.141142938 7 8.085841035 10.40202939 0.427225465 3.917175497 7 4.24970864 3.09819043 0.796245231 3.6228779848 8.932251131 4.92849548 0.126429758 3.642477809 8 8.535598673 6.646318074 -0.088833099 3.188035222 8 5.631405096 3.030313288 -0.600507543 1.6416275789 10.07624958 4.928818266 0.17399879 3.637857676 9 8.138719786 10.77609119 -0.079069878 1.799381902 9 10.80836336 4.531218779 0.075962776 3.071230643

10 10.29705604 6.012359777 0.246464665 1.560189446 10 8.787647512 12.04975348 0.278744229 1.033140881 10 8.67628568 5.109786586 -0.022986524 3.652213513

0

4

8

12

16LR SVM DT RF

0

3.5

7

10.5

14LR SVM DT RF

0

3

6

9

12LR SVM DT RF

0

1.25

2.5

3.75

5LR SVM DT RF

PublicaDons Restaurants BooksSource Target IWT-max(NvT,NoT) Source Target IWT-max(NvT,NoT) Source Target IWT-max(NvT,NoT)DBLP-ACM DBLP-Scholar Rest-YY1 Rest-YY2 Books-AN Books-BHTrainingdatapct LR SVM DT RF Trainingdatapct LR SVM DT RF Trainingdatapct LR SVM DT RF

1 6.347669725 3.012979924 1.534048176 2.486968697 1 3.705372702 3.220008044 1.427634515 3.005821332 1 5.064535797 2.80224707 1.912252153 2.3667199272 4.999044695 2.630832113 1.329541613 2.038792958 2 3.074656055 3.242745447 2.716894415 3.476234194 2 5.346879393 2.417836099 2.531387467 2.3345541093 7.218260596 2.178454875 1.279319577 2.769600544 3 6.077935319 2.016078489 1.983423562 3.216101692 3 6.279409453 4.185641136 1.110506491 3.4604951524 7.811959298 4.383921186 1.876215646 2.874217931 4 6.283485771 3.631616562 1.097686829 2.984755648 4 4.753801357 4.863761253 2.635280587 3.7390455465 5.870356135 6.905426741 1.329541613 4.021129305 5 6.540566806 3.220008044 2.16323582 2.590034493 5 4.952215796 3.317135882 1.144621162 3.7923678986 6.932456764 6.28898356 1.959520166 3.641063619 6 5.493463978 5.852122097 1.235694886 4.997955407 6 5.381311495 4.145489042 0.91471985 2.9363326977 5.999044695 4.896827007 2.804113161 3.40377216 7 5.385743049 7.104296994 0.579381989 3.758762722 7 3.360256834 5.283614225 0.572144622 2.8278499228 7.682329428 7.87768189 1.288012822 2.707176853 8 7.710952774 5.722958269 0.358548504 3.216101692 8 5.346879393 4.145489042 0.353615412 4.1578045559 5.892939968 7.361380907 0.752617637 2.245261725 9 6.143684667 6.536048073 0.400384951 1.412660119 9 3.77322672 4.956930687 0.275480884 1.902102431

10 4.563506846 5.623173984 0.264158892 1.646265088 10 4.308898412 5.146312088 1.401348785 3.650011956 10 2.300036855 3.479646859 0.561441778 2.56101488

1 2 3 4 5 6 7 8 9 10 1 2 3 4 5 6 7 8 9 10 1 2 3 4 5 6 7 8 9 10 1 2 3 4 5 6 7 8 9 10

0

2

4

6

8LR SVM DT RF

1 2 3 4 5 6 7 8 9 100

2

4

6

8LR SVM DT RF

1 2 3 4 5 6 7 8 9 100

1.75

3.5

5.25

7LR SVM DT RF

1 2 3 4 5 6 7 8 9 100

1.25

2.5

3.75

5LR SVM DT RF

1 2 3 4 5 6 7 8 9 10

PublicaDons Restaurants BooksSource Target IWT-max(NvT,NoT) Source Target IWT-max(NvT,NoT) Source Target IWT-max(NvT,NoT)DBLP-ACM DBLP-Scholar Rest-YY1 Rest-YY2 Books-AN Books-BHTrainingdatapct LR SVM DT RF Trainingdatapct LR SVM DT RF Trainingdatapct LR SVM DT RF

1 3.159148051 2.374507077 1.698139726 2.36065773 1 2.458228876 2.556530035 0.959658623 2.128546249 1 2.348136676 1.846803152 1.3824826 2.1382550572 3.453805461 3.012628028 1.561353838 2.44100944 2 2.819476517 2.818718299 0.891214773 2.532336301 2 2.50211043 2.123341809 1.28058616 2.1619798793 3.757277228 3.71099791 1.073068519 2.245581032 3 3.066513956 2.846624556 0.691628781 2.213457177 3 2.808340065 2.429576498 1.295602925 2.1861454144 3.930017541 4.179689382 1.212756101 2.277039031 4 3.203063902 3.222358986 1.172107584 3.257612782 4 2.991383341 2.846803152 1.332417891 2.2131355925 3.952803809 4.391046629 1.03082631 2.44100944 5 3.1738113 3.561167839 1.634668486 3.265948482 5 2.798420526 2.772728373 1.427625598 2.2505318836 3.893764657 3.926151429 1.717409102 2.888589245 6 3.260601459 3.432094215 2.060460659 3.356144325 6 2.807479868 3.230868713 1.084373811 2.3169176957 3.17792013 3.572106688 1.752214424 2.245581032 7 2.956614204 3.021695454 2.033448368 2.739790927 7 3.421413442 3.546877192 1.044134214 2.3782201728 2.885074262 3.116694447 1.232414812 1.77631918 8 2.566146111 2.818718299 1.748060071 2.995340017 8 3.79282528 3.280433095 1.28058616 2.3819652959 2.461758867 3.002662929 0.996835939 1.720829883 9 2.753348984 2.776085388 0.891214773 2.443887877 9 2.941072197 3.043527824 0.980828458 2.403326407

10 2.013881479 3.433771754 1.348462909 1.318050254 10 2.307110992 2.597764607 0.959658623 2.409418955 10 3.108340065 3.063113142 1.233415586 2.693916806

0

1.25

2.5

3.75

5LR SVM DT RF

1 2 3 4 5 6 7 8 9 100

1

2

3

4LR SVM DT RF

0

1

2

3

4LR SVM DT RF

0

1

2

3

4LR SVM DT RF

1 2 3 4 5 6 7 8 9 10 1 2 3 4 5 6 7 8 9 10 1 2 3 4 5 6 7 8 9 10

Training Data (%) Training Data (%) Training Data (%) Training Data (%)

F-Sc

ore

Impr

ovem

ent

over

NvT

(%)

F-Sc

ore

Impr

ovem

ent

over

NvT

(%)

F-Sc

ore

Impr

ovem

ent

over

NvT

(%)

F-Sc

ore

Impr

ovem

ent

over

NvT

(%)

F-Sc

ore

Impr

ovem

ent

over

Bas

elin

es (%

)

F-Sc

ore

Impr

ovem

ent

over

Bas

elin

es (%

)

F-Sc

ore

Impr

ovem

ent

over

Bas

elin

es (%

)

F-Sc

ore

Impr

ovem

ent

over

Bas

elin

es (%

)

F-Sc

ore

Impr

ovem

ent

over

Bas

elin

es (%

)

F-Sc

ore

Impr

ovem

ent

over

Bas

elin

es (%

)

F-Sc

ore

Impr

ovem

ent

over

Bas

elin

es (%

)

F-Sc

ore

Impr

ovem

ent

over

Bas

elin

es (%

)

Training Data (%) Training Data (%) Training Data (%) Training Data (%)

Training Data (%) Training Data (%) Training Data (%) Training Data (%)

�1

(a) Publications

PublicaDons Restaurants BooksSource Target IWT-NvT Source Target IWT-NvT Source Target IWT-NvTDBLP-ACM DBLP-Scholar Rest-YY1 Rest-YY2 Books-AN Books-BHTrainingdatapct LR SVM DT RF Trainingdatapct LR SVM DT RF Trainingdatapct LR SVM DT RF

1 8.272669932 9.71909579 0.201644531 14.71662537 1 8.222632091 5.611364164 3.90032689 8.55615245 1 5.631405096 9.002068087 3.887800773 8.0437967642 12.43158031 5.220317543 0.280986302 11.77398799 2 7.019031161 7.393503184 2.995780405 11.96456914 2 5.388600646 8.85729775 2.764502343 7.4362641383 11.62113565 5.16306753 0.19951605 9.250099281 3 8.169621663 6.407030969 1.346404684 10.61010968 3 4.24970864 8.990687881 1.797816308 5.9385293014 6.33241878 6.163664313 0.333536799 7.716389795 4 9.32762485 5.562085761 0.8081017 7.803062988 4 3.112375522 6.100561938 1.191481319 5.9385293015 8.715046558 7.116881824 0.247162446 7.17325849 5 7.802762933 10.14960354 1.550325663 5.98197931 5 4.33205373 4.304606598 2.201522681 5.0150648636 9.312814689 4.943910984 0.190595112 5.613932486 6 7.403328756 7.393503184 0.81476891 4.398370599 6 3.454861717 6.144487806 1.007733051 3.6522135137 10.59736177 3.809958712 0.224867711 5.141142938 7 8.085841035 10.40202939 0.427225465 3.917175497 7 4.24970864 3.09819043 0.796245231 3.6228779848 8.932251131 4.92849548 0.126429758 3.642477809 8 8.535598673 6.646318074 -0.088833099 3.188035222 8 5.631405096 3.030313288 -0.600507543 1.6416275789 10.07624958 4.928818266 0.17399879 3.637857676 9 8.138719786 10.77609119 -0.079069878 1.799381902 9 10.80836336 4.531218779 0.075962776 3.071230643

10 10.29705604 6.012359777 0.246464665 1.560189446 10 8.787647512 12.04975348 0.278744229 1.033140881 10 8.67628568 5.109786586 -0.022986524 3.652213513

0

4

8

12

16LR SVM DT RF

0

3.5

7

10.5

14LR SVM DT RF

0

3

6

9

12LR SVM DT RF

0

1.25

2.5

3.75

5LR SVM DT RF

PublicaDons Restaurants BooksSource Target IWT-max(NvT,NoT) Source Target IWT-max(NvT,NoT) Source Target IWT-max(NvT,NoT)DBLP-ACM DBLP-Scholar Rest-YY1 Rest-YY2 Books-AN Books-BHTrainingdatapct LR SVM DT RF Trainingdatapct LR SVM DT RF Trainingdatapct LR SVM DT RF

1 6.347669725 3.012979924 1.534048176 2.486968697 1 3.705372702 3.220008044 1.427634515 3.005821332 1 5.064535797 2.80224707 1.912252153 2.3667199272 4.999044695 2.630832113 1.329541613 2.038792958 2 3.074656055 3.242745447 2.716894415 3.476234194 2 5.346879393 2.417836099 2.531387467 2.3345541093 7.218260596 2.178454875 1.279319577 2.769600544 3 6.077935319 2.016078489 1.983423562 3.216101692 3 6.279409453 4.185641136 1.110506491 3.4604951524 7.811959298 4.383921186 1.876215646 2.874217931 4 6.283485771 3.631616562 1.097686829 2.984755648 4 4.753801357 4.863761253 2.635280587 3.7390455465 5.870356135 6.905426741 1.329541613 4.021129305 5 6.540566806 3.220008044 2.16323582 2.590034493 5 4.952215796 3.317135882 1.144621162 3.7923678986 6.932456764 6.28898356 1.959520166 3.641063619 6 5.493463978 5.852122097 1.235694886 4.997955407 6 5.381311495 4.145489042 0.91471985 2.9363326977 5.999044695 4.896827007 2.804113161 3.40377216 7 5.385743049 7.104296994 0.579381989 3.758762722 7 3.360256834 5.283614225 0.572144622 2.8278499228 7.682329428 7.87768189 1.288012822 2.707176853 8 7.710952774 5.722958269 0.358548504 3.216101692 8 5.346879393 4.145489042 0.353615412 4.1578045559 5.892939968 7.361380907 0.752617637 2.245261725 9 6.143684667 6.536048073 0.400384951 1.412660119 9 3.77322672 4.956930687 0.275480884 1.902102431

10 4.563506846 5.623173984 0.264158892 1.646265088 10 4.308898412 5.146312088 1.401348785 3.650011956 10 2.300036855 3.479646859 0.561441778 2.56101488

1 2 3 4 5 6 7 8 9 10 1 2 3 4 5 6 7 8 9 10 1 2 3 4 5 6 7 8 9 10 1 2 3 4 5 6 7 8 9 10

0

2

4

6

8LR SVM DT RF

1 2 3 4 5 6 7 8 9 100

2

4

6

8LR SVM DT RF

1 2 3 4 5 6 7 8 9 100

1.75

3.5

5.25

7LR SVM DT RF

1 2 3 4 5 6 7 8 9 100

1.25

2.5

3.75

5LR SVM DT RF

1 2 3 4 5 6 7 8 9 10

PublicaDons Restaurants BooksSource Target IWT-max(NvT,NoT) Source Target IWT-max(NvT,NoT) Source Target IWT-max(NvT,NoT)DBLP-ACM DBLP-Scholar Rest-YY1 Rest-YY2 Books-AN Books-BHTrainingdatapct LR SVM DT RF Trainingdatapct LR SVM DT RF Trainingdatapct LR SVM DT RF

1 3.159148051 2.374507077 1.698139726 2.36065773 1 2.458228876 2.556530035 0.959658623 2.128546249 1 2.348136676 1.846803152 1.3824826 2.1382550572 3.453805461 3.012628028 1.561353838 2.44100944 2 2.819476517 2.818718299 0.891214773 2.532336301 2 2.50211043 2.123341809 1.28058616 2.1619798793 3.757277228 3.71099791 1.073068519 2.245581032 3 3.066513956 2.846624556 0.691628781 2.213457177 3 2.808340065 2.429576498 1.295602925 2.1861454144 3.930017541 4.179689382 1.212756101 2.277039031 4 3.203063902 3.222358986 1.172107584 3.257612782 4 2.991383341 2.846803152 1.332417891 2.2131355925 3.952803809 4.391046629 1.03082631 2.44100944 5 3.1738113 3.561167839 1.634668486 3.265948482 5 2.798420526 2.772728373 1.427625598 2.2505318836 3.893764657 3.926151429 1.717409102 2.888589245 6 3.260601459 3.432094215 2.060460659 3.356144325 6 2.807479868 3.230868713 1.084373811 2.3169176957 3.17792013 3.572106688 1.752214424 2.245581032 7 2.956614204 3.021695454 2.033448368 2.739790927 7 3.421413442 3.546877192 1.044134214 2.3782201728 2.885074262 3.116694447 1.232414812 1.77631918 8 2.566146111 2.818718299 1.748060071 2.995340017 8 3.79282528 3.280433095 1.28058616 2.3819652959 2.461758867 3.002662929 0.996835939 1.720829883 9 2.753348984 2.776085388 0.891214773 2.443887877 9 2.941072197 3.043527824 0.980828458 2.403326407

10 2.013881479 3.433771754 1.348462909 1.318050254 10 2.307110992 2.597764607 0.959658623 2.409418955 10 3.108340065 3.063113142 1.233415586 2.693916806

0

1.25

2.5

3.75

5LR SVM DT RF

1 2 3 4 5 6 7 8 9 100

1

2

3

4LR SVM DT RF

0

1

2

3

4LR SVM DT RF

0

1

2

3

4LR SVM DT RF

1 2 3 4 5 6 7 8 9 10 1 2 3 4 5 6 7 8 9 10 1 2 3 4 5 6 7 8 9 10

Training Data (%) Training Data (%) Training Data (%) Training Data (%)

F-Sc

ore

Impr

ovem

ent

over

NvT

(%)

F-Sc

ore

Impr

ovem

ent

over

NvT

(%)

F-Sc

ore

Impr

ovem

ent

over

NvT

(%)

F-Sc

ore

Impr

ovem

ent

over

NvT

(%)

F-Sc

ore

Impr

ovem

ent

over

Bas

elin

es (%

)

F-Sc

ore

Impr

ovem

ent

over

Bas

elin

es (%

)

F-Sc

ore

Impr

ovem

ent

over

Bas

elin

es (%

)

F-Sc

ore

Impr

ovem

ent

over

Bas

elin

es (%

)

F-Sc

ore

Impr

ovem

ent

over

Bas

elin

es (%

)

F-Sc

ore

Impr

ovem

ent

over

Bas

elin

es (%

)

F-Sc

ore

Impr

ovem

ent

over

Bas

elin

es (%

)

F-Sc

ore

Impr

ovem

ent

over

Bas

elin

es (%

)

Training Data (%) Training Data (%) Training Data (%) Training Data (%)

Training Data (%) Training Data (%) Training Data (%) Training Data (%)

�1

(b) Restaurant

PublicaDons Restaurants BooksSource Target IWT-NvT Source Target IWT-NvT Source Target IWT-NvTDBLP-ACM DBLP-Scholar Rest-YY1 Rest-YY2 Books-AN Books-BHTrainingdatapct LR SVM DT RF Trainingdatapct LR SVM DT RF Trainingdatapct LR SVM DT RF

1 8.272669932 9.71909579 0.201644531 14.71662537 1 8.222632091 5.611364164 3.90032689 8.55615245 1 5.631405096 9.002068087 3.887800773 8.0437967642 12.43158031 5.220317543 0.280986302 11.77398799 2 7.019031161 7.393503184 2.995780405 11.96456914 2 5.388600646 8.85729775 2.764502343 7.4362641383 11.62113565 5.16306753 0.19951605 9.250099281 3 8.169621663 6.407030969 1.346404684 10.61010968 3 4.24970864 8.990687881 1.797816308 5.9385293014 6.33241878 6.163664313 0.333536799 7.716389795 4 9.32762485 5.562085761 0.8081017 7.803062988 4 3.112375522 6.100561938 1.191481319 5.9385293015 8.715046558 7.116881824 0.247162446 7.17325849 5 7.802762933 10.14960354 1.550325663 5.98197931 5 4.33205373 4.304606598 2.201522681 5.0150648636 9.312814689 4.943910984 0.190595112 5.613932486 6 7.403328756 7.393503184 0.81476891 4.398370599 6 3.454861717 6.144487806 1.007733051 3.6522135137 10.59736177 3.809958712 0.224867711 5.141142938 7 8.085841035 10.40202939 0.427225465 3.917175497 7 4.24970864 3.09819043 0.796245231 3.6228779848 8.932251131 4.92849548 0.126429758 3.642477809 8 8.535598673 6.646318074 -0.088833099 3.188035222 8 5.631405096 3.030313288 -0.600507543 1.6416275789 10.07624958 4.928818266 0.17399879 3.637857676 9 8.138719786 10.77609119 -0.079069878 1.799381902 9 10.80836336 4.531218779 0.075962776 3.071230643

10 10.29705604 6.012359777 0.246464665 1.560189446 10 8.787647512 12.04975348 0.278744229 1.033140881 10 8.67628568 5.109786586 -0.022986524 3.652213513

0

4

8

12

16LR SVM DT RF

0

3.5

7

10.5

14LR SVM DT RF

0

3

6

9

12LR SVM DT RF

0

1.25

2.5

3.75

5LR SVM DT RF

PublicaDons Restaurants BooksSource Target IWT-max(NvT,NoT) Source Target IWT-max(NvT,NoT) Source Target IWT-max(NvT,NoT)DBLP-ACM DBLP-Scholar Rest-YY1 Rest-YY2 Books-AN Books-BHTrainingdatapct LR SVM DT RF Trainingdatapct LR SVM DT RF Trainingdatapct LR SVM DT RF

1 6.347669725 3.012979924 1.534048176 2.486968697 1 3.705372702 3.220008044 1.427634515 3.005821332 1 5.064535797 2.80224707 1.912252153 2.3667199272 4.999044695 2.630832113 1.329541613 2.038792958 2 3.074656055 3.242745447 2.716894415 3.476234194 2 5.346879393 2.417836099 2.531387467 2.3345541093 7.218260596 2.178454875 1.279319577 2.769600544 3 6.077935319 2.016078489 1.983423562 3.216101692 3 6.279409453 4.185641136 1.110506491 3.4604951524 7.811959298 4.383921186 1.876215646 2.874217931 4 6.283485771 3.631616562 1.097686829 2.984755648 4 4.753801357 4.863761253 2.635280587 3.7390455465 5.870356135 6.905426741 1.329541613 4.021129305 5 6.540566806 3.220008044 2.16323582 2.590034493 5 4.952215796 3.317135882 1.144621162 3.7923678986 6.932456764 6.28898356 1.959520166 3.641063619 6 5.493463978 5.852122097 1.235694886 4.997955407 6 5.381311495 4.145489042 0.91471985 2.9363326977 5.999044695 4.896827007 2.804113161 3.40377216 7 5.385743049 7.104296994 0.579381989 3.758762722 7 3.360256834 5.283614225 0.572144622 2.8278499228 7.682329428 7.87768189 1.288012822 2.707176853 8 7.710952774 5.722958269 0.358548504 3.216101692 8 5.346879393 4.145489042 0.353615412 4.1578045559 5.892939968 7.361380907 0.752617637 2.245261725 9 6.143684667 6.536048073 0.400384951 1.412660119 9 3.77322672 4.956930687 0.275480884 1.902102431

10 4.563506846 5.623173984 0.264158892 1.646265088 10 4.308898412 5.146312088 1.401348785 3.650011956 10 2.300036855 3.479646859 0.561441778 2.56101488

1 2 3 4 5 6 7 8 9 10 1 2 3 4 5 6 7 8 9 10 1 2 3 4 5 6 7 8 9 10 1 2 3 4 5 6 7 8 9 10

0

2

4

6

8LR SVM DT RF

1 2 3 4 5 6 7 8 9 100

2

4

6

8LR SVM DT RF

1 2 3 4 5 6 7 8 9 100

1.75

3.5

5.25

7LR SVM DT RF

1 2 3 4 5 6 7 8 9 100

1.25

2.5

3.75

5LR SVM DT RF

1 2 3 4 5 6 7 8 9 10

PublicaDons Restaurants BooksSource Target IWT-max(NvT,NoT) Source Target IWT-max(NvT,NoT) Source Target IWT-max(NvT,NoT)DBLP-ACM DBLP-Scholar Rest-YY1 Rest-YY2 Books-AN Books-BHTrainingdatapct LR SVM DT RF Trainingdatapct LR SVM DT RF Trainingdatapct LR SVM DT RF

1 3.159148051 2.374507077 1.698139726 2.36065773 1 2.458228876 2.556530035 0.959658623 2.128546249 1 2.348136676 1.846803152 1.3824826 2.1382550572 3.453805461 3.012628028 1.561353838 2.44100944 2 2.819476517 2.818718299 0.891214773 2.532336301 2 2.50211043 2.123341809 1.28058616 2.1619798793 3.757277228 3.71099791 1.073068519 2.245581032 3 3.066513956 2.846624556 0.691628781 2.213457177 3 2.808340065 2.429576498 1.295602925 2.1861454144 3.930017541 4.179689382 1.212756101 2.277039031 4 3.203063902 3.222358986 1.172107584 3.257612782 4 2.991383341 2.846803152 1.332417891 2.2131355925 3.952803809 4.391046629 1.03082631 2.44100944 5 3.1738113 3.561167839 1.634668486 3.265948482 5 2.798420526 2.772728373 1.427625598 2.2505318836 3.893764657 3.926151429 1.717409102 2.888589245 6 3.260601459 3.432094215 2.060460659 3.356144325 6 2.807479868 3.230868713 1.084373811 2.3169176957 3.17792013 3.572106688 1.752214424 2.245581032 7 2.956614204 3.021695454 2.033448368 2.739790927 7 3.421413442 3.546877192 1.044134214 2.3782201728 2.885074262 3.116694447 1.232414812 1.77631918 8 2.566146111 2.818718299 1.748060071 2.995340017 8 3.79282528 3.280433095 1.28058616 2.3819652959 2.461758867 3.002662929 0.996835939 1.720829883 9 2.753348984 2.776085388 0.891214773 2.443887877 9 2.941072197 3.043527824 0.980828458 2.403326407

10 2.013881479 3.433771754 1.348462909 1.318050254 10 2.307110992 2.597764607 0.959658623 2.409418955 10 3.108340065 3.063113142 1.233415586 2.693916806

0

1.25

2.5

3.75

5LR SVM DT RF

1 2 3 4 5 6 7 8 9 100

1

2

3

4LR SVM DT RF

0

1

2

3

4LR SVM DT RF

0

1

2

3

4LR SVM DT RF

1 2 3 4 5 6 7 8 9 10 1 2 3 4 5 6 7 8 9 10 1 2 3 4 5 6 7 8 9 10

Training Data (%) Training Data (%) Training Data (%) Training Data (%)

F-Sc

ore

Impr

ovem

ent

over

NvT

(%)

F-Sc

ore

Impr

ovem

ent

over

NvT

(%)

F-Sc

ore

Impr

ovem

ent

over

NvT

(%)

F-Sc

ore

Impr

ovem

ent

over

NvT

(%)

F-Sc

ore

Impr

ovem

ent

over

Bas

elin

es (%

)

F-Sc

ore

Impr

ovem

ent

over

Bas

elin

es (%

)

F-Sc

ore

Impr

ovem

ent

over

Bas

elin

es (%

)

F-Sc

ore

Impr

ovem

ent

over

Bas

elin

es (%

)

F-Sc

ore

Impr

ovem

ent

over

Bas

elin

es (%

)

F-Sc

ore

Impr

ovem

ent

over

Bas

elin

es (%

)

F-Sc

ore

Impr

ovem

ent

over

Bas

elin

es (%

)

F-Sc

ore

Impr

ovem

ent

over

Bas

elin

es (%

)

Training Data (%) Training Data (%) Training Data (%) Training Data (%)

Training Data (%) Training Data (%) Training Data (%) Training Data (%)

�1

(c) Books

PublicaDons Restaurants BooksSource Target IWT-NvT Source Target IWT-NvT Source Target IWT-NvTDBLP-ACM DBLP-Scholar Rest-YY1 Rest-YY2 Books-AN Books-BHTrainingdatapct LR SVM DT RF Trainingdatapct LR SVM DT RF Trainingdatapct LR SVM DT RF

1 8.272669932 9.71909579 0.201644531 14.71662537 1 8.222632091 5.611364164 3.90032689 8.55615245 1 5.631405096 9.002068087 3.887800773 8.0437967642 12.43158031 5.220317543 0.280986302 11.77398799 2 7.019031161 7.393503184 2.995780405 11.96456914 2 5.388600646 8.85729775 2.764502343 7.4362641383 11.62113565 5.16306753 0.19951605 9.250099281 3 8.169621663 6.407030969 1.346404684 10.61010968 3 4.24970864 8.990687881 1.797816308 5.9385293014 6.33241878 6.163664313 0.333536799 7.716389795 4 9.32762485 5.562085761 0.8081017 7.803062988 4 3.112375522 6.100561938 1.191481319 5.9385293015 8.715046558 7.116881824 0.247162446 7.17325849 5 7.802762933 10.14960354 1.550325663 5.98197931 5 4.33205373 4.304606598 2.201522681 5.0150648636 9.312814689 4.943910984 0.190595112 5.613932486 6 7.403328756 7.393503184 0.81476891 4.398370599 6 3.454861717 6.144487806 1.007733051 3.6522135137 10.59736177 3.809958712 0.224867711 5.141142938 7 8.085841035 10.40202939 0.427225465 3.917175497 7 4.24970864 3.09819043 0.796245231 3.6228779848 8.932251131 4.92849548 0.126429758 3.642477809 8 8.535598673 6.646318074 -0.088833099 3.188035222 8 5.631405096 3.030313288 -0.600507543 1.6416275789 10.07624958 4.928818266 0.17399879 3.637857676 9 8.138719786 10.77609119 -0.079069878 1.799381902 9 10.80836336 4.531218779 0.075962776 3.071230643

10 10.29705604 6.012359777 0.246464665 1.560189446 10 8.787647512 12.04975348 0.278744229 1.033140881 10 8.67628568 5.109786586 -0.022986524 3.652213513

0

4

8

12

16LR SVM DT RF

0

3.5

7

10.5

14LR SVM DT RF

0

3

6

9

12LR SVM DT RF

0

1.25

2.5

3.75

5LR SVM DT RF

PublicaDons Restaurants BooksSource Target IWT-max(NvT,NoT) Source Target IWT-max(NvT,NoT) Source Target IWT-max(NvT,NoT)DBLP-ACM DBLP-Scholar Rest-YY1 Rest-YY2 Books-AN Books-BHTrainingdatapct LR SVM DT RF Trainingdatapct LR SVM DT RF Trainingdatapct LR SVM DT RF

1 6.347669725 3.012979924 1.534048176 2.486968697 1 3.705372702 3.220008044 1.427634515 3.005821332 1 5.064535797 2.80224707 1.912252153 2.3667199272 4.999044695 2.630832113 1.329541613 2.038792958 2 3.074656055 3.242745447 2.716894415 3.476234194 2 5.346879393 2.417836099 2.531387467 2.3345541093 7.218260596 2.178454875 1.279319577 2.769600544 3 6.077935319 2.016078489 1.983423562 3.216101692 3 6.279409453 4.185641136 1.110506491 3.4604951524 7.811959298 4.383921186 1.876215646 2.874217931 4 6.283485771 3.631616562 1.097686829 2.984755648 4 4.753801357 4.863761253 2.635280587 3.7390455465 5.870356135 6.905426741 1.329541613 4.021129305 5 6.540566806 3.220008044 2.16323582 2.590034493 5 4.952215796 3.317135882 1.144621162 3.7923678986 6.932456764 6.28898356 1.959520166 3.641063619 6 5.493463978 5.852122097 1.235694886 4.997955407 6 5.381311495 4.145489042 0.91471985 2.9363326977 5.999044695 4.896827007 2.804113161 3.40377216 7 5.385743049 7.104296994 0.579381989 3.758762722 7 3.360256834 5.283614225 0.572144622 2.8278499228 7.682329428 7.87768189 1.288012822 2.707176853 8 7.710952774 5.722958269 0.358548504 3.216101692 8 5.346879393 4.145489042 0.353615412 4.1578045559 5.892939968 7.361380907 0.752617637 2.245261725 9 6.143684667 6.536048073 0.400384951 1.412660119 9 3.77322672 4.956930687 0.275480884 1.902102431

10 4.563506846 5.623173984 0.264158892 1.646265088 10 4.308898412 5.146312088 1.401348785 3.650011956 10 2.300036855 3.479646859 0.561441778 2.56101488

1 2 3 4 5 6 7 8 9 10 1 2 3 4 5 6 7 8 9 10 1 2 3 4 5 6 7 8 9 10 1 2 3 4 5 6 7 8 9 10

0

2

4

6

8LR SVM DT RF

1 2 3 4 5 6 7 8 9 100

2

4

6

8LR SVM DT RF

1 2 3 4 5 6 7 8 9 100

1.75

3.5

5.25

7LR SVM DT RF

1 2 3 4 5 6 7 8 9 100

1.25

2.5

3.75

5LR SVM DT RF

1 2 3 4 5 6 7 8 9 10

PublicaDons Restaurants BooksSource Target IWT-max(NvT,NoT) Source Target IWT-max(NvT,NoT) Source Target IWT-max(NvT,NoT)DBLP-ACM DBLP-Scholar Rest-YY1 Rest-YY2 Books-AN Books-BHTrainingdatapct LR SVM DT RF Trainingdatapct LR SVM DT RF Trainingdatapct LR SVM DT RF

1 3.159148051 2.374507077 1.698139726 2.36065773 1 2.458228876 2.556530035 0.959658623 2.128546249 1 2.348136676 1.846803152 1.3824826 2.1382550572 3.453805461 3.012628028 1.561353838 2.44100944 2 2.819476517 2.818718299 0.891214773 2.532336301 2 2.50211043 2.123341809 1.28058616 2.1619798793 3.757277228 3.71099791 1.073068519 2.245581032 3 3.066513956 2.846624556 0.691628781 2.213457177 3 2.808340065 2.429576498 1.295602925 2.1861454144 3.930017541 4.179689382 1.212756101 2.277039031 4 3.203063902 3.222358986 1.172107584 3.257612782 4 2.991383341 2.846803152 1.332417891 2.2131355925 3.952803809 4.391046629 1.03082631 2.44100944 5 3.1738113 3.561167839 1.634668486 3.265948482 5 2.798420526 2.772728373 1.427625598 2.2505318836 3.893764657 3.926151429 1.717409102 2.888589245 6 3.260601459 3.432094215 2.060460659 3.356144325 6 2.807479868 3.230868713 1.084373811 2.3169176957 3.17792013 3.572106688 1.752214424 2.245581032 7 2.956614204 3.021695454 2.033448368 2.739790927 7 3.421413442 3.546877192 1.044134214 2.3782201728 2.885074262 3.116694447 1.232414812 1.77631918 8 2.566146111 2.818718299 1.748060071 2.995340017 8 3.79282528 3.280433095 1.28058616 2.3819652959 2.461758867 3.002662929 0.996835939 1.720829883 9 2.753348984 2.776085388 0.891214773 2.443887877 9 2.941072197 3.043527824 0.980828458 2.403326407

10 2.013881479 3.433771754 1.348462909 1.318050254 10 2.307110992 2.597764607 0.959658623 2.409418955 10 3.108340065 3.063113142 1.233415586 2.693916806

0

1.25

2.5

3.75

5LR SVM DT RF

1 2 3 4 5 6 7 8 9 100

1

2

3

4LR SVM DT RF

0

1

2

3

4LR SVM DT RF

0

1

2

3

4LR SVM DT RF

1 2 3 4 5 6 7 8 9 10 1 2 3 4 5 6 7 8 9 10 1 2 3 4 5 6 7 8 9 10

Training Data (%) Training Data (%) Training Data (%) Training Data (%)

F-Sc

ore

Impr

ovem

ent

over

NvT

(%)

F-Sc

ore

Impr

ovem

ent

over

NvT

(%)

F-Sc

ore

Impr

ovem

ent

over

NvT

(%)

F-Sc

ore

Impr

ovem

ent

over

NvT

(%)

F-Sc

ore

Impr

ovem

ent

over

Bas

elin

es (%

)

F-Sc

ore

Impr

ovem

ent

over

Bas

elin

es (%

)

F-Sc

ore

Impr

ovem

ent

over

Bas

elin

es (%

)

F-Sc

ore

Impr

ovem

ent

over

Bas

elin

es (%

)

F-Sc

ore

Impr

ovem

ent

over

Bas

elin

es (%

)

F-Sc

ore

Impr

ovem

ent

over

Bas

elin

es (%

)

F-Sc

ore

Impr

ovem

ent

over

Bas

elin

es (%

)

F-Sc

ore

Impr

ovem

ent

over

Bas

elin

es (%

)

Training Data (%) Training Data (%) Training Data (%) Training Data (%)

Training Data (%) Training Data (%) Training Data (%) Training Data (%)

�1

(d) Songs

Figure 6: Scenario 1. (Source, Target): (Adequate, Nothing)

PublicaDons Restaurants BooksSource Target IWT-NvT Source Target IWT-NvT Source Target IWT-NvTDBLP-ACM DBLP-Scholar Rest-YY1 Rest-YY2 Books-AN Books-BHTrainingdatapct LR SVM DT RF Trainingdatapct LR SVM DT RF Trainingdatapct LR SVM DT RF

1 8.272669932 9.71909579 0.201644531 14.71662537 1 8.222632091 5.611364164 3.90032689 8.55615245 1 5.631405096 9.002068087 3.887800773 8.0437967642 12.43158031 5.220317543 0.280986302 11.77398799 2 7.019031161 7.393503184 2.995780405 11.96456914 2 5.388600646 8.85729775 2.764502343 7.4362641383 11.62113565 5.16306753 0.19951605 9.250099281 3 8.169621663 6.407030969 1.346404684 10.61010968 3 4.24970864 8.990687881 1.797816308 5.9385293014 6.33241878 6.163664313 0.333536799 7.716389795 4 9.32762485 5.562085761 0.8081017 7.803062988 4 3.112375522 6.100561938 1.191481319 5.9385293015 8.715046558 7.116881824 0.247162446 7.17325849 5 7.802762933 10.14960354 1.550325663 5.98197931 5 4.33205373 4.304606598 2.201522681 5.0150648636 9.312814689 4.943910984 0.190595112 5.613932486 6 7.403328756 7.393503184 0.81476891 4.398370599 6 3.454861717 6.144487806 1.007733051 3.6522135137 10.59736177 3.809958712 0.224867711 5.141142938 7 8.085841035 10.40202939 0.427225465 3.917175497 7 4.24970864 3.09819043 0.796245231 3.6228779848 8.932251131 4.92849548 0.126429758 3.642477809 8 8.535598673 6.646318074 -0.088833099 3.188035222 8 5.631405096 3.030313288 -0.600507543 1.6416275789 10.07624958 4.928818266 0.17399879 3.637857676 9 8.138719786 10.77609119 -0.079069878 1.799381902 9 10.80836336 4.531218779 0.075962776 3.071230643

10 10.29705604 6.012359777 0.246464665 1.560189446 10 8.787647512 12.04975348 0.278744229 1.033140881 10 8.67628568 5.109786586 -0.022986524 3.652213513

0

4

8

12

16LR SVM DT RF

0

3.5

7

10.5

14LR SVM DT RF

0

3

6

9

12LR SVM DT RF

0

1.25

2.5

3.75

5LR SVM DT RF

PublicaDons Restaurants BooksSource Target IWT-max(NvT,NoT) Source Target IWT-max(NvT,NoT) Source Target IWT-max(NvT,NoT)DBLP-ACM DBLP-Scholar Rest-YY1 Rest-YY2 Books-AN Books-BHTrainingdatapct LR SVM DT RF Trainingdatapct LR SVM DT RF Trainingdatapct LR SVM DT RF

1 6.347669725 3.012979924 1.534048176 2.486968697 1 3.705372702 3.220008044 1.427634515 3.005821332 1 5.064535797 2.80224707 1.912252153 2.3667199272 4.999044695 2.630832113 1.329541613 2.038792958 2 3.074656055 3.242745447 2.716894415 3.476234194 2 5.346879393 2.417836099 2.531387467 2.3345541093 7.218260596 2.178454875 1.279319577 2.769600544 3 6.077935319 2.016078489 1.983423562 3.216101692 3 6.279409453 4.185641136 1.110506491 3.4604951524 7.811959298 4.383921186 1.876215646 2.874217931 4 6.283485771 3.631616562 1.097686829 2.984755648 4 4.753801357 4.863761253 2.635280587 3.7390455465 5.870356135 6.905426741 1.329541613 4.021129305 5 6.540566806 3.220008044 2.16323582 2.590034493 5 4.952215796 3.317135882 1.144621162 3.7923678986 6.932456764 6.28898356 1.959520166 3.641063619 6 5.493463978 5.852122097 1.235694886 4.997955407 6 5.381311495 4.145489042 0.91471985 2.9363326977 5.999044695 4.896827007 2.804113161 3.40377216 7 5.385743049 7.104296994 0.579381989 3.758762722 7 3.360256834 5.283614225 0.572144622 2.8278499228 7.682329428 7.87768189 1.288012822 2.707176853 8 7.710952774 5.722958269 0.358548504 3.216101692 8 5.346879393 4.145489042 0.353615412 4.1578045559 5.892939968 7.361380907 0.752617637 2.245261725 9 6.143684667 6.536048073 0.400384951 1.412660119 9 3.77322672 4.956930687 0.275480884 1.902102431

10 4.563506846 5.623173984 0.264158892 1.646265088 10 4.308898412 5.146312088 1.401348785 3.650011956 10 2.300036855 3.479646859 0.561441778 2.56101488

1 2 3 4 5 6 7 8 9 10 1 2 3 4 5 6 7 8 9 10 1 2 3 4 5 6 7 8 9 10 1 2 3 4 5 6 7 8 9 10

0

2

4

6

8LR SVM DT RF

1 2 3 4 5 6 7 8 9 100

2

4

6

8LR SVM DT RF

1 2 3 4 5 6 7 8 9 100

1.75

3.5

5.25

7LR SVM DT RF

1 2 3 4 5 6 7 8 9 100

1.25

2.5

3.75

5LR SVM DT RF

1 2 3 4 5 6 7 8 9 10

PublicaDons Restaurants BooksSource Target IWT-max(NvT,NoT) Source Target IWT-max(NvT,NoT) Source Target IWT-max(NvT,NoT)DBLP-ACM DBLP-Scholar Rest-YY1 Rest-YY2 Books-AN Books-BHTrainingdatapct LR SVM DT RF Trainingdatapct LR SVM DT RF Trainingdatapct LR SVM DT RF

1 3.159148051 2.374507077 1.698139726 2.36065773 1 2.458228876 2.556530035 0.959658623 2.128546249 1 2.348136676 1.846803152 1.3824826 2.1382550572 3.453805461 3.012628028 1.561353838 2.44100944 2 2.819476517 2.818718299 0.891214773 2.532336301 2 2.50211043 2.123341809 1.28058616 2.1619798793 3.757277228 3.71099791 1.073068519 2.245581032 3 3.066513956 2.846624556 0.691628781 2.213457177 3 2.808340065 2.429576498 1.295602925 2.1861454144 3.930017541 4.179689382 1.212756101 2.277039031 4 3.203063902 3.222358986 1.172107584 3.257612782 4 2.991383341 2.846803152 1.332417891 2.2131355925 3.952803809 4.391046629 1.03082631 2.44100944 5 3.1738113 3.561167839 1.634668486 3.265948482 5 2.798420526 2.772728373 1.427625598 2.2505318836 3.893764657 3.926151429 1.717409102 2.888589245 6 3.260601459 3.432094215 2.060460659 3.356144325 6 2.807479868 3.230868713 1.084373811 2.3169176957 3.17792013 3.572106688 1.752214424 2.245581032 7 2.956614204 3.021695454 2.033448368 2.739790927 7 3.421413442 3.546877192 1.044134214 2.3782201728 2.885074262 3.116694447 1.232414812 1.77631918 8 2.566146111 2.818718299 1.748060071 2.995340017 8 3.79282528 3.280433095 1.28058616 2.3819652959 2.461758867 3.002662929 0.996835939 1.720829883 9 2.753348984 2.776085388 0.891214773 2.443887877 9 2.941072197 3.043527824 0.980828458 2.403326407

10 2.013881479 3.433771754 1.348462909 1.318050254 10 2.307110992 2.597764607 0.959658623 2.409418955 10 3.108340065 3.063113142 1.233415586 2.693916806

0

1.25

2.5

3.75

5LR SVM DT RF

1 2 3 4 5 6 7 8 9 100

1

2

3

4LR SVM DT RF

0

1

2

3

4LR SVM DT RF

0

1

2

3

4LR SVM DT RF

1 2 3 4 5 6 7 8 9 10 1 2 3 4 5 6 7 8 9 10 1 2 3 4 5 6 7 8 9 10

Training Data (%) Training Data (%) Training Data (%) Training Data (%)

F-Sc

ore

Impr

ovem

ent

over

NvT

(%)

F-Sc

ore

Impr

ovem

ent

over

NvT

(%)

F-Sc

ore

Impr

ovem

ent

over

NvT

(%)

F-Sc

ore

Impr

ovem

ent

over

NvT

(%)

F-Sc

ore

Impr

ovem

ent

over

Bas

elin

es (%

)

F-Sc

ore

Impr

ovem

ent

over

Bas

elin

es (%

)

F-Sc

ore

Impr

ovem

ent

over

Bas

elin

es (%

)

F-Sc

ore

Impr

ovem

ent

over

Bas

elin

es (%

)

F-Sc

ore

Impr

ovem

ent

over

Bas

elin

es (%

)

F-Sc

ore

Impr

ovem

ent

over

Bas

elin

es (%

)

F-Sc

ore

Impr

ovem

ent

over

Bas

elin

es (%

)

F-Sc

ore

Impr

ovem

ent

over

Bas

elin

es (%

)

Training Data (%) Training Data (%) Training Data (%) Training Data (%)

Training Data (%) Training Data (%) Training Data (%) Training Data (%)

�1

(a) Publications

PublicaDons Restaurants BooksSource Target IWT-NvT Source Target IWT-NvT Source Target IWT-NvTDBLP-ACM DBLP-Scholar Rest-YY1 Rest-YY2 Books-AN Books-BHTrainingdatapct LR SVM DT RF Trainingdatapct LR SVM DT RF Trainingdatapct LR SVM DT RF

1 8.272669932 9.71909579 0.201644531 14.71662537 1 8.222632091 5.611364164 3.90032689 8.55615245 1 5.631405096 9.002068087 3.887800773 8.0437967642 12.43158031 5.220317543 0.280986302 11.77398799 2 7.019031161 7.393503184 2.995780405 11.96456914 2 5.388600646 8.85729775 2.764502343 7.4362641383 11.62113565 5.16306753 0.19951605 9.250099281 3 8.169621663 6.407030969 1.346404684 10.61010968 3 4.24970864 8.990687881 1.797816308 5.9385293014 6.33241878 6.163664313 0.333536799 7.716389795 4 9.32762485 5.562085761 0.8081017 7.803062988 4 3.112375522 6.100561938 1.191481319 5.9385293015 8.715046558 7.116881824 0.247162446 7.17325849 5 7.802762933 10.14960354 1.550325663 5.98197931 5 4.33205373 4.304606598 2.201522681 5.0150648636 9.312814689 4.943910984 0.190595112 5.613932486 6 7.403328756 7.393503184 0.81476891 4.398370599 6 3.454861717 6.144487806 1.007733051 3.6522135137 10.59736177 3.809958712 0.224867711 5.141142938 7 8.085841035 10.40202939 0.427225465 3.917175497 7 4.24970864 3.09819043 0.796245231 3.6228779848 8.932251131 4.92849548 0.126429758 3.642477809 8 8.535598673 6.646318074 -0.088833099 3.188035222 8 5.631405096 3.030313288 -0.600507543 1.6416275789 10.07624958 4.928818266 0.17399879 3.637857676 9 8.138719786 10.77609119 -0.079069878 1.799381902 9 10.80836336 4.531218779 0.075962776 3.071230643

10 10.29705604 6.012359777 0.246464665 1.560189446 10 8.787647512 12.04975348 0.278744229 1.033140881 10 8.67628568 5.109786586 -0.022986524 3.652213513

0

4

8

12

16LR SVM DT RF

0

3.5

7

10.5

14LR SVM DT RF

0

3

6

9

12LR SVM DT RF

0

1.25

2.5

3.75

5LR SVM DT RF

PublicaDons Restaurants BooksSource Target IWT-max(NvT,NoT) Source Target IWT-max(NvT,NoT) Source Target IWT-max(NvT,NoT)DBLP-ACM DBLP-Scholar Rest-YY1 Rest-YY2 Books-AN Books-BHTrainingdatapct LR SVM DT RF Trainingdatapct LR SVM DT RF Trainingdatapct LR SVM DT RF

1 6.347669725 3.012979924 1.534048176 2.486968697 1 3.705372702 3.220008044 1.427634515 3.005821332 1 5.064535797 2.80224707 1.912252153 2.3667199272 4.999044695 2.630832113 1.329541613 2.038792958 2 3.074656055 3.242745447 2.716894415 3.476234194 2 5.346879393 2.417836099 2.531387467 2.3345541093 7.218260596 2.178454875 1.279319577 2.769600544 3 6.077935319 2.016078489 1.983423562 3.216101692 3 6.279409453 4.185641136 1.110506491 3.4604951524 7.811959298 4.383921186 1.876215646 2.874217931 4 6.283485771 3.631616562 1.097686829 2.984755648 4 4.753801357 4.863761253 2.635280587 3.7390455465 5.870356135 6.905426741 1.329541613 4.021129305 5 6.540566806 3.220008044 2.16323582 2.590034493 5 4.952215796 3.317135882 1.144621162 3.7923678986 6.932456764 6.28898356 1.959520166 3.641063619 6 5.493463978 5.852122097 1.235694886 4.997955407 6 5.381311495 4.145489042 0.91471985 2.9363326977 5.999044695 4.896827007 2.804113161 3.40377216 7 5.385743049 7.104296994 0.579381989 3.758762722 7 3.360256834 5.283614225 0.572144622 2.8278499228 7.682329428 7.87768189 1.288012822 2.707176853 8 7.710952774 5.722958269 0.358548504 3.216101692 8 5.346879393 4.145489042 0.353615412 4.1578045559 5.892939968 7.361380907 0.752617637 2.245261725 9 6.143684667 6.536048073 0.400384951 1.412660119 9 3.77322672 4.956930687 0.275480884 1.902102431

10 4.563506846 5.623173984 0.264158892 1.646265088 10 4.308898412 5.146312088 1.401348785 3.650011956 10 2.300036855 3.479646859 0.561441778 2.56101488

1 2 3 4 5 6 7 8 9 10 1 2 3 4 5 6 7 8 9 10 1 2 3 4 5 6 7 8 9 10 1 2 3 4 5 6 7 8 9 10

0

2

4

6

8LR SVM DT RF

1 2 3 4 5 6 7 8 9 100

2

4

6

8LR SVM DT RF

1 2 3 4 5 6 7 8 9 100

1.75

3.5

5.25

7LR SVM DT RF

1 2 3 4 5 6 7 8 9 100

1.25

2.5

3.75

5LR SVM DT RF

1 2 3 4 5 6 7 8 9 10

PublicaDons Restaurants BooksSource Target IWT-max(NvT,NoT) Source Target IWT-max(NvT,NoT) Source Target IWT-max(NvT,NoT)DBLP-ACM DBLP-Scholar Rest-YY1 Rest-YY2 Books-AN Books-BHTrainingdatapct LR SVM DT RF Trainingdatapct LR SVM DT RF Trainingdatapct LR SVM DT RF

1 3.159148051 2.374507077 1.698139726 2.36065773 1 2.458228876 2.556530035 0.959658623 2.128546249 1 2.348136676 1.846803152 1.3824826 2.1382550572 3.453805461 3.012628028 1.561353838 2.44100944 2 2.819476517 2.818718299 0.891214773 2.532336301 2 2.50211043 2.123341809 1.28058616 2.1619798793 3.757277228 3.71099791 1.073068519 2.245581032 3 3.066513956 2.846624556 0.691628781 2.213457177 3 2.808340065 2.429576498 1.295602925 2.1861454144 3.930017541 4.179689382 1.212756101 2.277039031 4 3.203063902 3.222358986 1.172107584 3.257612782 4 2.991383341 2.846803152 1.332417891 2.2131355925 3.952803809 4.391046629 1.03082631 2.44100944 5 3.1738113 3.561167839 1.634668486 3.265948482 5 2.798420526 2.772728373 1.427625598 2.2505318836 3.893764657 3.926151429 1.717409102 2.888589245 6 3.260601459 3.432094215 2.060460659 3.356144325 6 2.807479868 3.230868713 1.084373811 2.3169176957 3.17792013 3.572106688 1.752214424 2.245581032 7 2.956614204 3.021695454 2.033448368 2.739790927 7 3.421413442 3.546877192 1.044134214 2.3782201728 2.885074262 3.116694447 1.232414812 1.77631918 8 2.566146111 2.818718299 1.748060071 2.995340017 8 3.79282528 3.280433095 1.28058616 2.3819652959 2.461758867 3.002662929 0.996835939 1.720829883 9 2.753348984 2.776085388 0.891214773 2.443887877 9 2.941072197 3.043527824 0.980828458 2.403326407

10 2.013881479 3.433771754 1.348462909 1.318050254 10 2.307110992 2.597764607 0.959658623 2.409418955 10 3.108340065 3.063113142 1.233415586 2.693916806

0

1.25

2.5

3.75

5LR SVM DT RF

1 2 3 4 5 6 7 8 9 100

1

2

3

4LR SVM DT RF

0

1

2

3

4LR SVM DT RF

0

1

2

3

4LR SVM DT RF

1 2 3 4 5 6 7 8 9 10 1 2 3 4 5 6 7 8 9 10 1 2 3 4 5 6 7 8 9 10

Training Data (%) Training Data (%) Training Data (%) Training Data (%)

F-Sc

ore

Impr

ovem

ent

over

NvT

(%)

F-Sc

ore

Impr

ovem

ent

over

NvT

(%)

F-Sc

ore

Impr

ovem

ent

over

NvT

(%)

F-Sc

ore

Impr

ovem

ent

over

NvT

(%)

F-Sc

ore

Impr

ovem

ent

over

Bas

elin

es (%

)

F-Sc

ore

Impr

ovem

ent

over

Bas

elin

es (%

)

F-Sc

ore

Impr

ovem

ent

over

Bas

elin

es (%

)

F-Sc

ore

Impr

ovem

ent

over

Bas

elin

es (%

)

F-Sc

ore

Impr

ovem

ent

over

Bas

elin

es (%

)

F-Sc

ore

Impr

ovem

ent

over

Bas

elin

es (%

)

F-Sc

ore

Impr

ovem

ent

over

Bas

elin

es (%

)

F-Sc

ore

Impr

ovem

ent

over

Bas

elin

es (%

)

Training Data (%) Training Data (%) Training Data (%) Training Data (%)

Training Data (%) Training Data (%) Training Data (%) Training Data (%)

�1

(b) Restaurant

PublicaDons Restaurants BooksSource Target IWT-NvT Source Target IWT-NvT Source Target IWT-NvTDBLP-ACM DBLP-Scholar Rest-YY1 Rest-YY2 Books-AN Books-BHTrainingdatapct LR SVM DT RF Trainingdatapct LR SVM DT RF Trainingdatapct LR SVM DT RF

1 8.272669932 9.71909579 0.201644531 14.71662537 1 8.222632091 5.611364164 3.90032689 8.55615245 1 5.631405096 9.002068087 3.887800773 8.0437967642 12.43158031 5.220317543 0.280986302 11.77398799 2 7.019031161 7.393503184 2.995780405 11.96456914 2 5.388600646 8.85729775 2.764502343 7.4362641383 11.62113565 5.16306753 0.19951605 9.250099281 3 8.169621663 6.407030969 1.346404684 10.61010968 3 4.24970864 8.990687881 1.797816308 5.9385293014 6.33241878 6.163664313 0.333536799 7.716389795 4 9.32762485 5.562085761 0.8081017 7.803062988 4 3.112375522 6.100561938 1.191481319 5.9385293015 8.715046558 7.116881824 0.247162446 7.17325849 5 7.802762933 10.14960354 1.550325663 5.98197931 5 4.33205373 4.304606598 2.201522681 5.0150648636 9.312814689 4.943910984 0.190595112 5.613932486 6 7.403328756 7.393503184 0.81476891 4.398370599 6 3.454861717 6.144487806 1.007733051 3.6522135137 10.59736177 3.809958712 0.224867711 5.141142938 7 8.085841035 10.40202939 0.427225465 3.917175497 7 4.24970864 3.09819043 0.796245231 3.6228779848 8.932251131 4.92849548 0.126429758 3.642477809 8 8.535598673 6.646318074 -0.088833099 3.188035222 8 5.631405096 3.030313288 -0.600507543 1.6416275789 10.07624958 4.928818266 0.17399879 3.637857676 9 8.138719786 10.77609119 -0.079069878 1.799381902 9 10.80836336 4.531218779 0.075962776 3.071230643

10 10.29705604 6.012359777 0.246464665 1.560189446 10 8.787647512 12.04975348 0.278744229 1.033140881 10 8.67628568 5.109786586 -0.022986524 3.652213513

0

4

8

12

16LR SVM DT RF

0

3.5

7

10.5

14LR SVM DT RF

0

3

6

9

12LR SVM DT RF

0

1.25

2.5

3.75

5LR SVM DT RF

PublicaDons Restaurants BooksSource Target IWT-max(NvT,NoT) Source Target IWT-max(NvT,NoT) Source Target IWT-max(NvT,NoT)DBLP-ACM DBLP-Scholar Rest-YY1 Rest-YY2 Books-AN Books-BHTrainingdatapct LR SVM DT RF Trainingdatapct LR SVM DT RF Trainingdatapct LR SVM DT RF

1 6.347669725 3.012979924 1.534048176 2.486968697 1 3.705372702 3.220008044 1.427634515 3.005821332 1 5.064535797 2.80224707 1.912252153 2.3667199272 4.999044695 2.630832113 1.329541613 2.038792958 2 3.074656055 3.242745447 2.716894415 3.476234194 2 5.346879393 2.417836099 2.531387467 2.3345541093 7.218260596 2.178454875 1.279319577 2.769600544 3 6.077935319 2.016078489 1.983423562 3.216101692 3 6.279409453 4.185641136 1.110506491 3.4604951524 7.811959298 4.383921186 1.876215646 2.874217931 4 6.283485771 3.631616562 1.097686829 2.984755648 4 4.753801357 4.863761253 2.635280587 3.7390455465 5.870356135 6.905426741 1.329541613 4.021129305 5 6.540566806 3.220008044 2.16323582 2.590034493 5 4.952215796 3.317135882 1.144621162 3.7923678986 6.932456764 6.28898356 1.959520166 3.641063619 6 5.493463978 5.852122097 1.235694886 4.997955407 6 5.381311495 4.145489042 0.91471985 2.9363326977 5.999044695 4.896827007 2.804113161 3.40377216 7 5.385743049 7.104296994 0.579381989 3.758762722 7 3.360256834 5.283614225 0.572144622 2.8278499228 7.682329428 7.87768189 1.288012822 2.707176853 8 7.710952774 5.722958269 0.358548504 3.216101692 8 5.346879393 4.145489042 0.353615412 4.1578045559 5.892939968 7.361380907 0.752617637 2.245261725 9 6.143684667 6.536048073 0.400384951 1.412660119 9 3.77322672 4.956930687 0.275480884 1.902102431

10 4.563506846 5.623173984 0.264158892 1.646265088 10 4.308898412 5.146312088 1.401348785 3.650011956 10 2.300036855 3.479646859 0.561441778 2.56101488

1 2 3 4 5 6 7 8 9 10 1 2 3 4 5 6 7 8 9 10 1 2 3 4 5 6 7 8 9 10 1 2 3 4 5 6 7 8 9 10

0

2

4

6

8LR SVM DT RF

1 2 3 4 5 6 7 8 9 100

2

4

6

8LR SVM DT RF

1 2 3 4 5 6 7 8 9 100

1.75

3.5

5.25

7LR SVM DT RF

1 2 3 4 5 6 7 8 9 100

1.25

2.5

3.75

5LR SVM DT RF

1 2 3 4 5 6 7 8 9 10

PublicaDons Restaurants BooksSource Target IWT-max(NvT,NoT) Source Target IWT-max(NvT,NoT) Source Target IWT-max(NvT,NoT)DBLP-ACM DBLP-Scholar Rest-YY1 Rest-YY2 Books-AN Books-BHTrainingdatapct LR SVM DT RF Trainingdatapct LR SVM DT RF Trainingdatapct LR SVM DT RF

1 3.159148051 2.374507077 1.698139726 2.36065773 1 2.458228876 2.556530035 0.959658623 2.128546249 1 2.348136676 1.846803152 1.3824826 2.1382550572 3.453805461 3.012628028 1.561353838 2.44100944 2 2.819476517 2.818718299 0.891214773 2.532336301 2 2.50211043 2.123341809 1.28058616 2.1619798793 3.757277228 3.71099791 1.073068519 2.245581032 3 3.066513956 2.846624556 0.691628781 2.213457177 3 2.808340065 2.429576498 1.295602925 2.1861454144 3.930017541 4.179689382 1.212756101 2.277039031 4 3.203063902 3.222358986 1.172107584 3.257612782 4 2.991383341 2.846803152 1.332417891 2.2131355925 3.952803809 4.391046629 1.03082631 2.44100944 5 3.1738113 3.561167839 1.634668486 3.265948482 5 2.798420526 2.772728373 1.427625598 2.2505318836 3.893764657 3.926151429 1.717409102 2.888589245 6 3.260601459 3.432094215 2.060460659 3.356144325 6 2.807479868 3.230868713 1.084373811 2.3169176957 3.17792013 3.572106688 1.752214424 2.245581032 7 2.956614204 3.021695454 2.033448368 2.739790927 7 3.421413442 3.546877192 1.044134214 2.3782201728 2.885074262 3.116694447 1.232414812 1.77631918 8 2.566146111 2.818718299 1.748060071 2.995340017 8 3.79282528 3.280433095 1.28058616 2.3819652959 2.461758867 3.002662929 0.996835939 1.720829883 9 2.753348984 2.776085388 0.891214773 2.443887877 9 2.941072197 3.043527824 0.980828458 2.403326407

10 2.013881479 3.433771754 1.348462909 1.318050254 10 2.307110992 2.597764607 0.959658623 2.409418955 10 3.108340065 3.063113142 1.233415586 2.693916806

0

1.25

2.5

3.75

5LR SVM DT RF

1 2 3 4 5 6 7 8 9 100

1

2

3

4LR SVM DT RF

0

1

2

3

4LR SVM DT RF

0

1

2

3

4LR SVM DT RF

1 2 3 4 5 6 7 8 9 10 1 2 3 4 5 6 7 8 9 10 1 2 3 4 5 6 7 8 9 10

Training Data (%) Training Data (%) Training Data (%) Training Data (%)

F-Sc

ore

Impr

ovem

ent

over

NvT

(%)

F-Sc

ore

Impr

ovem

ent

over

NvT

(%)

F-Sc

ore

Impr

ovem

ent

over

NvT

(%)

F-Sc

ore

Impr

ovem

ent

over

NvT

(%)

F-Sc

ore

Impr

ovem

ent

over

Bas

elin

es (%

)

F-Sc

ore

Impr

ovem

ent

over

Bas

elin

es (%

)

F-Sc

ore

Impr

ovem

ent

over

Bas

elin

es (%

)

F-Sc

ore

Impr

ovem

ent

over

Bas

elin

es (%

)

F-Sc

ore

Impr

ovem

ent

over

Bas

elin

es (%

)

F-Sc

ore

Impr

ovem

ent

over

Bas

elin

es (%

)

F-Sc

ore

Impr

ovem

ent

over

Bas

elin

es (%

)

F-Sc

ore

Impr

ovem

ent

over

Bas

elin

es (%

)

Training Data (%) Training Data (%) Training Data (%) Training Data (%)

Training Data (%) Training Data (%) Training Data (%) Training Data (%)

�1

(c) Books

PublicaDons Restaurants BooksSource Target IWT-NvT Source Target IWT-NvT Source Target IWT-NvTDBLP-ACM DBLP-Scholar Rest-YY1 Rest-YY2 Books-AN Books-BHTrainingdatapct LR SVM DT RF Trainingdatapct LR SVM DT RF Trainingdatapct LR SVM DT RF

1 8.272669932 9.71909579 0.201644531 14.71662537 1 8.222632091 5.611364164 3.90032689 8.55615245 1 5.631405096 9.002068087 3.887800773 8.0437967642 12.43158031 5.220317543 0.280986302 11.77398799 2 7.019031161 7.393503184 2.995780405 11.96456914 2 5.388600646 8.85729775 2.764502343 7.4362641383 11.62113565 5.16306753 0.19951605 9.250099281 3 8.169621663 6.407030969 1.346404684 10.61010968 3 4.24970864 8.990687881 1.797816308 5.9385293014 6.33241878 6.163664313 0.333536799 7.716389795 4 9.32762485 5.562085761 0.8081017 7.803062988 4 3.112375522 6.100561938 1.191481319 5.9385293015 8.715046558 7.116881824 0.247162446 7.17325849 5 7.802762933 10.14960354 1.550325663 5.98197931 5 4.33205373 4.304606598 2.201522681 5.0150648636 9.312814689 4.943910984 0.190595112 5.613932486 6 7.403328756 7.393503184 0.81476891 4.398370599 6 3.454861717 6.144487806 1.007733051 3.6522135137 10.59736177 3.809958712 0.224867711 5.141142938 7 8.085841035 10.40202939 0.427225465 3.917175497 7 4.24970864 3.09819043 0.796245231 3.6228779848 8.932251131 4.92849548 0.126429758 3.642477809 8 8.535598673 6.646318074 -0.088833099 3.188035222 8 5.631405096 3.030313288 -0.600507543 1.6416275789 10.07624958 4.928818266 0.17399879 3.637857676 9 8.138719786 10.77609119 -0.079069878 1.799381902 9 10.80836336 4.531218779 0.075962776 3.071230643

10 10.29705604 6.012359777 0.246464665 1.560189446 10 8.787647512 12.04975348 0.278744229 1.033140881 10 8.67628568 5.109786586 -0.022986524 3.652213513

0

4

8

12

16LR SVM DT RF

0

3.5

7

10.5

14LR SVM DT RF

0

3

6

9

12LR SVM DT RF

0

1.25

2.5

3.75

5LR SVM DT RF

PublicaDons Restaurants BooksSource Target IWT-max(NvT,NoT) Source Target IWT-max(NvT,NoT) Source Target IWT-max(NvT,NoT)DBLP-ACM DBLP-Scholar Rest-YY1 Rest-YY2 Books-AN Books-BHTrainingdatapct LR SVM DT RF Trainingdatapct LR SVM DT RF Trainingdatapct LR SVM DT RF

1 6.347669725 3.012979924 1.534048176 2.486968697 1 3.705372702 3.220008044 1.427634515 3.005821332 1 5.064535797 2.80224707 1.912252153 2.3667199272 4.999044695 2.630832113 1.329541613 2.038792958 2 3.074656055 3.242745447 2.716894415 3.476234194 2 5.346879393 2.417836099 2.531387467 2.3345541093 7.218260596 2.178454875 1.279319577 2.769600544 3 6.077935319 2.016078489 1.983423562 3.216101692 3 6.279409453 4.185641136 1.110506491 3.4604951524 7.811959298 4.383921186 1.876215646 2.874217931 4 6.283485771 3.631616562 1.097686829 2.984755648 4 4.753801357 4.863761253 2.635280587 3.7390455465 5.870356135 6.905426741 1.329541613 4.021129305 5 6.540566806 3.220008044 2.16323582 2.590034493 5 4.952215796 3.317135882 1.144621162 3.7923678986 6.932456764 6.28898356 1.959520166 3.641063619 6 5.493463978 5.852122097 1.235694886 4.997955407 6 5.381311495 4.145489042 0.91471985 2.9363326977 5.999044695 4.896827007 2.804113161 3.40377216 7 5.385743049 7.104296994 0.579381989 3.758762722 7 3.360256834 5.283614225 0.572144622 2.8278499228 7.682329428 7.87768189 1.288012822 2.707176853 8 7.710952774 5.722958269 0.358548504 3.216101692 8 5.346879393 4.145489042 0.353615412 4.1578045559 5.892939968 7.361380907 0.752617637 2.245261725 9 6.143684667 6.536048073 0.400384951 1.412660119 9 3.77322672 4.956930687 0.275480884 1.902102431

10 4.563506846 5.623173984 0.264158892 1.646265088 10 4.308898412 5.146312088 1.401348785 3.650011956 10 2.300036855 3.479646859 0.561441778 2.56101488

1 2 3 4 5 6 7 8 9 10 1 2 3 4 5 6 7 8 9 10 1 2 3 4 5 6 7 8 9 10 1 2 3 4 5 6 7 8 9 10

0

2

4

6

8LR SVM DT RF

1 2 3 4 5 6 7 8 9 100

2

4

6

8LR SVM DT RF

1 2 3 4 5 6 7 8 9 100

1.75

3.5

5.25

7LR SVM DT RF

1 2 3 4 5 6 7 8 9 100

1.25

2.5

3.75

5LR SVM DT RF

1 2 3 4 5 6 7 8 9 10

PublicaDons Restaurants BooksSource Target IWT-max(NvT,NoT) Source Target IWT-max(NvT,NoT) Source Target IWT-max(NvT,NoT)DBLP-ACM DBLP-Scholar Rest-YY1 Rest-YY2 Books-AN Books-BHTrainingdatapct LR SVM DT RF Trainingdatapct LR SVM DT RF Trainingdatapct LR SVM DT RF

1 3.159148051 2.374507077 1.698139726 2.36065773 1 2.458228876 2.556530035 0.959658623 2.128546249 1 2.348136676 1.846803152 1.3824826 2.1382550572 3.453805461 3.012628028 1.561353838 2.44100944 2 2.819476517 2.818718299 0.891214773 2.532336301 2 2.50211043 2.123341809 1.28058616 2.1619798793 3.757277228 3.71099791 1.073068519 2.245581032 3 3.066513956 2.846624556 0.691628781 2.213457177 3 2.808340065 2.429576498 1.295602925 2.1861454144 3.930017541 4.179689382 1.212756101 2.277039031 4 3.203063902 3.222358986 1.172107584 3.257612782 4 2.991383341 2.846803152 1.332417891 2.2131355925 3.952803809 4.391046629 1.03082631 2.44100944 5 3.1738113 3.561167839 1.634668486 3.265948482 5 2.798420526 2.772728373 1.427625598 2.2505318836 3.893764657 3.926151429 1.717409102 2.888589245 6 3.260601459 3.432094215 2.060460659 3.356144325 6 2.807479868 3.230868713 1.084373811 2.3169176957 3.17792013 3.572106688 1.752214424 2.245581032 7 2.956614204 3.021695454 2.033448368 2.739790927 7 3.421413442 3.546877192 1.044134214 2.3782201728 2.885074262 3.116694447 1.232414812 1.77631918 8 2.566146111 2.818718299 1.748060071 2.995340017 8 3.79282528 3.280433095 1.28058616 2.3819652959 2.461758867 3.002662929 0.996835939 1.720829883 9 2.753348984 2.776085388 0.891214773 2.443887877 9 2.941072197 3.043527824 0.980828458 2.403326407

10 2.013881479 3.433771754 1.348462909 1.318050254 10 2.307110992 2.597764607 0.959658623 2.409418955 10 3.108340065 3.063113142 1.233415586 2.693916806

0

1.25

2.5

3.75

5LR SVM DT RF

1 2 3 4 5 6 7 8 9 100

1

2

3

4LR SVM DT RF

0

1

2

3

4LR SVM DT RF

0

1

2

3

4LR SVM DT RF

1 2 3 4 5 6 7 8 9 10 1 2 3 4 5 6 7 8 9 10 1 2 3 4 5 6 7 8 9 10

Training Data (%) Training Data (%) Training Data (%) Training Data (%)

F-Sc

ore

Impr

ovem

ent

over

NvT

(%)

F-Sc

ore

Impr

ovem

ent

over

NvT

(%)

F-Sc

ore

Impr

ovem

ent

over

NvT

(%)

F-Sc

ore

Impr

ovem

ent

over

NvT

(%)

F-Sc

ore

Impr

ovem

ent

over

Bas

elin

es (%

)

F-Sc

ore

Impr

ovem

ent

over

Bas

elin

es (%

)

F-Sc

ore

Impr

ovem

ent

over

Bas

elin

es (%

)

F-Sc

ore

Impr

ovem

ent

over

Bas

elin

es (%

)

F-Sc

ore

Impr

ovem

ent

over

Bas

elin

es (%

)

F-Sc

ore

Impr

ovem

ent

over

Bas

elin

es (%

)

F-Sc

ore

Impr

ovem

ent

over

Bas

elin

es (%

)

F-Sc

ore

Impr

ovem

ent

over

Bas

elin

es (%

)

Training Data (%) Training Data (%) Training Data (%) Training Data (%)

Training Data (%) Training Data (%) Training Data (%) Training Data (%)

�1

(d) Songs

Figure 7: Scenario 2. (Source, Target): (Adequate, Limited)

PublicaDons Restaurants BooksSource Target IWT-NvT Source Target IWT-NvT Source Target IWT-NvTDBLP-ACM DBLP-Scholar Rest-YY1 Rest-YY2 Books-AN Books-BHTrainingdatapct LR SVM DT RF Trainingdatapct LR SVM DT RF Trainingdatapct LR SVM DT RF

1 8.272669932 9.71909579 0.201644531 14.71662537 1 8.222632091 5.611364164 3.90032689 8.55615245 1 5.631405096 9.002068087 3.887800773 8.0437967642 12.43158031 5.220317543 0.280986302 11.77398799 2 7.019031161 7.393503184 2.995780405 11.96456914 2 5.388600646 8.85729775 2.764502343 7.4362641383 11.62113565 5.16306753 0.19951605 9.250099281 3 8.169621663 6.407030969 1.346404684 10.61010968 3 4.24970864 8.990687881 1.797816308 5.9385293014 6.33241878 6.163664313 0.333536799 7.716389795 4 9.32762485 5.562085761 0.8081017 7.803062988 4 3.112375522 6.100561938 1.191481319 5.9385293015 8.715046558 7.116881824 0.247162446 7.17325849 5 7.802762933 10.14960354 1.550325663 5.98197931 5 4.33205373 4.304606598 2.201522681 5.0150648636 9.312814689 4.943910984 0.190595112 5.613932486 6 7.403328756 7.393503184 0.81476891 4.398370599 6 3.454861717 6.144487806 1.007733051 3.6522135137 10.59736177 3.809958712 0.224867711 5.141142938 7 8.085841035 10.40202939 0.427225465 3.917175497 7 4.24970864 3.09819043 0.796245231 3.6228779848 8.932251131 4.92849548 0.126429758 3.642477809 8 8.535598673 6.646318074 -0.088833099 3.188035222 8 5.631405096 3.030313288 -0.600507543 1.6416275789 10.07624958 4.928818266 0.17399879 3.637857676 9 8.138719786 10.77609119 -0.079069878 1.799381902 9 10.80836336 4.531218779 0.075962776 3.071230643

10 10.29705604 6.012359777 0.246464665 1.560189446 10 8.787647512 12.04975348 0.278744229 1.033140881 10 8.67628568 5.109786586 -0.022986524 3.652213513

0

4

8

12

16LR SVM DT RF

0

3.5

7

10.5

14LR SVM DT RF

0

3

6

9

12LR SVM DT RF

0

1.25

2.5

3.75

5LR SVM DT RF

PublicaDons Restaurants BooksSource Target IWT-max(NvT,NoT) Source Target IWT-max(NvT,NoT) Source Target IWT-max(NvT,NoT)DBLP-ACM DBLP-Scholar Rest-YY1 Rest-YY2 Books-AN Books-BHTrainingdatapct LR SVM DT RF Trainingdatapct LR SVM DT RF Trainingdatapct LR SVM DT RF

1 6.347669725 3.012979924 1.534048176 2.486968697 1 3.705372702 3.220008044 1.427634515 3.005821332 1 5.064535797 2.80224707 1.912252153 2.3667199272 4.999044695 2.630832113 1.329541613 2.038792958 2 3.074656055 3.242745447 2.716894415 3.476234194 2 5.346879393 2.417836099 2.531387467 2.3345541093 7.218260596 2.178454875 1.279319577 2.769600544 3 6.077935319 2.016078489 1.983423562 3.216101692 3 6.279409453 4.185641136 1.110506491 3.4604951524 7.811959298 4.383921186 1.876215646 2.874217931 4 6.283485771 3.631616562 1.097686829 2.984755648 4 4.753801357 4.863761253 2.635280587 3.7390455465 5.870356135 6.905426741 1.329541613 4.021129305 5 6.540566806 3.220008044 2.16323582 2.590034493 5 4.952215796 3.317135882 1.144621162 3.7923678986 6.932456764 6.28898356 1.959520166 3.641063619 6 5.493463978 5.852122097 1.235694886 4.997955407 6 5.381311495 4.145489042 0.91471985 2.9363326977 5.999044695 4.896827007 2.804113161 3.40377216 7 5.385743049 7.104296994 0.579381989 3.758762722 7 3.360256834 5.283614225 0.572144622 2.8278499228 7.682329428 7.87768189 1.288012822 2.707176853 8 7.710952774 5.722958269 0.358548504 3.216101692 8 5.346879393 4.145489042 0.353615412 4.1578045559 5.892939968 7.361380907 0.752617637 2.245261725 9 6.143684667 6.536048073 0.400384951 1.412660119 9 3.77322672 4.956930687 0.275480884 1.902102431

10 4.563506846 5.623173984 0.264158892 1.646265088 10 4.308898412 5.146312088 1.401348785 3.650011956 10 2.300036855 3.479646859 0.561441778 2.56101488

1 2 3 4 5 6 7 8 9 10 1 2 3 4 5 6 7 8 9 10 1 2 3 4 5 6 7 8 9 10 1 2 3 4 5 6 7 8 9 10

0

2

4

6

8LR SVM DT RF

1 2 3 4 5 6 7 8 9 100

2

4

6

8LR SVM DT RF

1 2 3 4 5 6 7 8 9 100

1.75

3.5

5.25

7LR SVM DT RF

1 2 3 4 5 6 7 8 9 100

1.25

2.5

3.75

5LR SVM DT RF

1 2 3 4 5 6 7 8 9 10

PublicaDons Restaurants BooksSource Target IWT-max(NvT,NoT) Source Target IWT-max(NvT,NoT) Source Target IWT-max(NvT,NoT)DBLP-ACM DBLP-Scholar Rest-YY1 Rest-YY2 Books-AN Books-BHTrainingdatapct LR SVM DT RF Trainingdatapct LR SVM DT RF Trainingdatapct LR SVM DT RF

1 3.159148051 2.374507077 1.698139726 2.36065773 1 2.458228876 2.556530035 0.959658623 2.128546249 1 2.348136676 1.846803152 1.3824826 2.1382550572 3.453805461 3.012628028 1.561353838 2.44100944 2 2.819476517 2.818718299 0.891214773 2.532336301 2 2.50211043 2.123341809 1.28058616 2.1619798793 3.757277228 3.71099791 1.073068519 2.245581032 3 3.066513956 2.846624556 0.691628781 2.213457177 3 2.808340065 2.429576498 1.295602925 2.1861454144 3.930017541 4.179689382 1.212756101 2.277039031 4 3.203063902 3.222358986 1.172107584 3.257612782 4 2.991383341 2.846803152 1.332417891 2.2131355925 3.952803809 4.391046629 1.03082631 2.44100944 5 3.1738113 3.561167839 1.634668486 3.265948482 5 2.798420526 2.772728373 1.427625598 2.2505318836 3.893764657 3.926151429 1.717409102 2.888589245 6 3.260601459 3.432094215 2.060460659 3.356144325 6 2.807479868 3.230868713 1.084373811 2.3169176957 3.17792013 3.572106688 1.752214424 2.245581032 7 2.956614204 3.021695454 2.033448368 2.739790927 7 3.421413442 3.546877192 1.044134214 2.3782201728 2.885074262 3.116694447 1.232414812 1.77631918 8 2.566146111 2.818718299 1.748060071 2.995340017 8 3.79282528 3.280433095 1.28058616 2.3819652959 2.461758867 3.002662929 0.996835939 1.720829883 9 2.753348984 2.776085388 0.891214773 2.443887877 9 2.941072197 3.043527824 0.980828458 2.403326407

10 2.013881479 3.433771754 1.348462909 1.318050254 10 2.307110992 2.597764607 0.959658623 2.409418955 10 3.108340065 3.063113142 1.233415586 2.693916806

0

1.25

2.5

3.75

5LR SVM DT RF

1 2 3 4 5 6 7 8 9 100

1

2

3

4LR SVM DT RF

0

1

2

3

4LR SVM DT RF

0

1

2

3

4LR SVM DT RF

1 2 3 4 5 6 7 8 9 10 1 2 3 4 5 6 7 8 9 10 1 2 3 4 5 6 7 8 9 10

Training Data (%) Training Data (%) Training Data (%) Training Data (%)

F-Sc

ore

Impr

ovem

ent

over

NvT

(%)

F-Sc

ore

Impr

ovem

ent

over

NvT

(%)

F-Sc

ore

Impr

ovem

ent

over

NvT

(%)

F-Sc

ore

Impr

ovem

ent

over

NvT

(%)

F-Sc

ore

Impr

ovem

ent

over

Bas

elin

es (%

)

F-Sc

ore

Impr

ovem

ent

over

Bas

elin

es (%

)

F-Sc

ore

Impr

ovem

ent

over

Bas

elin

es (%

)

F-Sc

ore

Impr

ovem

ent

over

Bas

elin

es (%

)

F-Sc

ore

Impr

ovem

ent

over

Bas

elin

es (%

)

F-Sc

ore

Impr

ovem

ent

over

Bas

elin

es (%

)

F-Sc

ore

Impr

ovem

ent

over

Bas

elin

es (%

)

F-Sc

ore

Impr

ovem

ent

over

Bas

elin

es (%

)

Training Data (%) Training Data (%) Training Data (%) Training Data (%)

Training Data (%) Training Data (%) Training Data (%) Training Data (%)

�1

(a) Publications

PublicaDons Restaurants BooksSource Target IWT-NvT Source Target IWT-NvT Source Target IWT-NvTDBLP-ACM DBLP-Scholar Rest-YY1 Rest-YY2 Books-AN Books-BHTrainingdatapct LR SVM DT RF Trainingdatapct LR SVM DT RF Trainingdatapct LR SVM DT RF

1 8.272669932 9.71909579 0.201644531 14.71662537 1 8.222632091 5.611364164 3.90032689 8.55615245 1 5.631405096 9.002068087 3.887800773 8.0437967642 12.43158031 5.220317543 0.280986302 11.77398799 2 7.019031161 7.393503184 2.995780405 11.96456914 2 5.388600646 8.85729775 2.764502343 7.4362641383 11.62113565 5.16306753 0.19951605 9.250099281 3 8.169621663 6.407030969 1.346404684 10.61010968 3 4.24970864 8.990687881 1.797816308 5.9385293014 6.33241878 6.163664313 0.333536799 7.716389795 4 9.32762485 5.562085761 0.8081017 7.803062988 4 3.112375522 6.100561938 1.191481319 5.9385293015 8.715046558 7.116881824 0.247162446 7.17325849 5 7.802762933 10.14960354 1.550325663 5.98197931 5 4.33205373 4.304606598 2.201522681 5.0150648636 9.312814689 4.943910984 0.190595112 5.613932486 6 7.403328756 7.393503184 0.81476891 4.398370599 6 3.454861717 6.144487806 1.007733051 3.6522135137 10.59736177 3.809958712 0.224867711 5.141142938 7 8.085841035 10.40202939 0.427225465 3.917175497 7 4.24970864 3.09819043 0.796245231 3.6228779848 8.932251131 4.92849548 0.126429758 3.642477809 8 8.535598673 6.646318074 -0.088833099 3.188035222 8 5.631405096 3.030313288 -0.600507543 1.6416275789 10.07624958 4.928818266 0.17399879 3.637857676 9 8.138719786 10.77609119 -0.079069878 1.799381902 9 10.80836336 4.531218779 0.075962776 3.071230643

10 10.29705604 6.012359777 0.246464665 1.560189446 10 8.787647512 12.04975348 0.278744229 1.033140881 10 8.67628568 5.109786586 -0.022986524 3.652213513

0

4

8

12

16LR SVM DT RF

0

3.5

7

10.5

14LR SVM DT RF

0

3

6

9

12LR SVM DT RF

0

1.25

2.5

3.75

5LR SVM DT RF

PublicaDons Restaurants BooksSource Target IWT-max(NvT,NoT) Source Target IWT-max(NvT,NoT) Source Target IWT-max(NvT,NoT)DBLP-ACM DBLP-Scholar Rest-YY1 Rest-YY2 Books-AN Books-BHTrainingdatapct LR SVM DT RF Trainingdatapct LR SVM DT RF Trainingdatapct LR SVM DT RF

1 6.347669725 3.012979924 1.534048176 2.486968697 1 3.705372702 3.220008044 1.427634515 3.005821332 1 5.064535797 2.80224707 1.912252153 2.3667199272 4.999044695 2.630832113 1.329541613 2.038792958 2 3.074656055 3.242745447 2.716894415 3.476234194 2 5.346879393 2.417836099 2.531387467 2.3345541093 7.218260596 2.178454875 1.279319577 2.769600544 3 6.077935319 2.016078489 1.983423562 3.216101692 3 6.279409453 4.185641136 1.110506491 3.4604951524 7.811959298 4.383921186 1.876215646 2.874217931 4 6.283485771 3.631616562 1.097686829 2.984755648 4 4.753801357 4.863761253 2.635280587 3.7390455465 5.870356135 6.905426741 1.329541613 4.021129305 5 6.540566806 3.220008044 2.16323582 2.590034493 5 4.952215796 3.317135882 1.144621162 3.7923678986 6.932456764 6.28898356 1.959520166 3.641063619 6 5.493463978 5.852122097 1.235694886 4.997955407 6 5.381311495 4.145489042 0.91471985 2.9363326977 5.999044695 4.896827007 2.804113161 3.40377216 7 5.385743049 7.104296994 0.579381989 3.758762722 7 3.360256834 5.283614225 0.572144622 2.8278499228 7.682329428 7.87768189 1.288012822 2.707176853 8 7.710952774 5.722958269 0.358548504 3.216101692 8 5.346879393 4.145489042 0.353615412 4.1578045559 5.892939968 7.361380907 0.752617637 2.245261725 9 6.143684667 6.536048073 0.400384951 1.412660119 9 3.77322672 4.956930687 0.275480884 1.902102431

10 4.563506846 5.623173984 0.264158892 1.646265088 10 4.308898412 5.146312088 1.401348785 3.650011956 10 2.300036855 3.479646859 0.561441778 2.56101488

1 2 3 4 5 6 7 8 9 10 1 2 3 4 5 6 7 8 9 10 1 2 3 4 5 6 7 8 9 10 1 2 3 4 5 6 7 8 9 10

0

2

4

6

8LR SVM DT RF

1 2 3 4 5 6 7 8 9 100

2

4

6

8LR SVM DT RF

1 2 3 4 5 6 7 8 9 100

1.75

3.5

5.25

7LR SVM DT RF

1 2 3 4 5 6 7 8 9 100

1.25

2.5

3.75

5LR SVM DT RF

1 2 3 4 5 6 7 8 9 10

PublicaDons Restaurants BooksSource Target IWT-max(NvT,NoT) Source Target IWT-max(NvT,NoT) Source Target IWT-max(NvT,NoT)DBLP-ACM DBLP-Scholar Rest-YY1 Rest-YY2 Books-AN Books-BHTrainingdatapct LR SVM DT RF Trainingdatapct LR SVM DT RF Trainingdatapct LR SVM DT RF

1 3.159148051 2.374507077 1.698139726 2.36065773 1 2.458228876 2.556530035 0.959658623 2.128546249 1 2.348136676 1.846803152 1.3824826 2.1382550572 3.453805461 3.012628028 1.561353838 2.44100944 2 2.819476517 2.818718299 0.891214773 2.532336301 2 2.50211043 2.123341809 1.28058616 2.1619798793 3.757277228 3.71099791 1.073068519 2.245581032 3 3.066513956 2.846624556 0.691628781 2.213457177 3 2.808340065 2.429576498 1.295602925 2.1861454144 3.930017541 4.179689382 1.212756101 2.277039031 4 3.203063902 3.222358986 1.172107584 3.257612782 4 2.991383341 2.846803152 1.332417891 2.2131355925 3.952803809 4.391046629 1.03082631 2.44100944 5 3.1738113 3.561167839 1.634668486 3.265948482 5 2.798420526 2.772728373 1.427625598 2.2505318836 3.893764657 3.926151429 1.717409102 2.888589245 6 3.260601459 3.432094215 2.060460659 3.356144325 6 2.807479868 3.230868713 1.084373811 2.3169176957 3.17792013 3.572106688 1.752214424 2.245581032 7 2.956614204 3.021695454 2.033448368 2.739790927 7 3.421413442 3.546877192 1.044134214 2.3782201728 2.885074262 3.116694447 1.232414812 1.77631918 8 2.566146111 2.818718299 1.748060071 2.995340017 8 3.79282528 3.280433095 1.28058616 2.3819652959 2.461758867 3.002662929 0.996835939 1.720829883 9 2.753348984 2.776085388 0.891214773 2.443887877 9 2.941072197 3.043527824 0.980828458 2.403326407

10 2.013881479 3.433771754 1.348462909 1.318050254 10 2.307110992 2.597764607 0.959658623 2.409418955 10 3.108340065 3.063113142 1.233415586 2.693916806

0

1.25

2.5

3.75

5LR SVM DT RF

1 2 3 4 5 6 7 8 9 100

1

2

3

4LR SVM DT RF

0

1

2

3

4LR SVM DT RF

0

1

2

3

4LR SVM DT RF

1 2 3 4 5 6 7 8 9 10 1 2 3 4 5 6 7 8 9 10 1 2 3 4 5 6 7 8 9 10

Training Data (%) Training Data (%) Training Data (%) Training Data (%)

F-Sc

ore

Impr

ovem

ent

over

NvT

(%)

F-Sc

ore

Impr

ovem

ent

over

NvT

(%)

F-Sc

ore

Impr

ovem

ent

over

NvT

(%)

F-Sc

ore

Impr

ovem

ent

over

NvT

(%)

F-Sc

ore

Impr

ovem

ent

over

Bas

elin

es (%

)

F-Sc

ore

Impr

ovem

ent

over

Bas

elin

es (%

)

F-Sc

ore

Impr

ovem

ent

over

Bas

elin

es (%

)

F-Sc

ore

Impr

ovem

ent

over

Bas

elin

es (%

)

F-Sc

ore

Impr

ovem

ent

over

Bas

elin

es (%

)

F-Sc

ore

Impr

ovem

ent

over

Bas

elin

es (%

)

F-Sc

ore

Impr

ovem

ent

over

Bas

elin

es (%

)

F-Sc

ore

Impr

ovem

ent

over

Bas

elin

es (%

)

Training Data (%) Training Data (%) Training Data (%) Training Data (%)

Training Data (%) Training Data (%) Training Data (%) Training Data (%)

�1

(b) Restaurant

PublicaDons Restaurants BooksSource Target IWT-NvT Source Target IWT-NvT Source Target IWT-NvTDBLP-ACM DBLP-Scholar Rest-YY1 Rest-YY2 Books-AN Books-BHTrainingdatapct LR SVM DT RF Trainingdatapct LR SVM DT RF Trainingdatapct LR SVM DT RF

1 8.272669932 9.71909579 0.201644531 14.71662537 1 8.222632091 5.611364164 3.90032689 8.55615245 1 5.631405096 9.002068087 3.887800773 8.0437967642 12.43158031 5.220317543 0.280986302 11.77398799 2 7.019031161 7.393503184 2.995780405 11.96456914 2 5.388600646 8.85729775 2.764502343 7.4362641383 11.62113565 5.16306753 0.19951605 9.250099281 3 8.169621663 6.407030969 1.346404684 10.61010968 3 4.24970864 8.990687881 1.797816308 5.9385293014 6.33241878 6.163664313 0.333536799 7.716389795 4 9.32762485 5.562085761 0.8081017 7.803062988 4 3.112375522 6.100561938 1.191481319 5.9385293015 8.715046558 7.116881824 0.247162446 7.17325849 5 7.802762933 10.14960354 1.550325663 5.98197931 5 4.33205373 4.304606598 2.201522681 5.0150648636 9.312814689 4.943910984 0.190595112 5.613932486 6 7.403328756 7.393503184 0.81476891 4.398370599 6 3.454861717 6.144487806 1.007733051 3.6522135137 10.59736177 3.809958712 0.224867711 5.141142938 7 8.085841035 10.40202939 0.427225465 3.917175497 7 4.24970864 3.09819043 0.796245231 3.6228779848 8.932251131 4.92849548 0.126429758 3.642477809 8 8.535598673 6.646318074 -0.088833099 3.188035222 8 5.631405096 3.030313288 -0.600507543 1.6416275789 10.07624958 4.928818266 0.17399879 3.637857676 9 8.138719786 10.77609119 -0.079069878 1.799381902 9 10.80836336 4.531218779 0.075962776 3.071230643

10 10.29705604 6.012359777 0.246464665 1.560189446 10 8.787647512 12.04975348 0.278744229 1.033140881 10 8.67628568 5.109786586 -0.022986524 3.652213513

0

4

8

12

16LR SVM DT RF

0

3.5

7

10.5

14LR SVM DT RF

0

3

6

9

12LR SVM DT RF

0

1.25

2.5

3.75

5LR SVM DT RF

PublicaDons Restaurants BooksSource Target IWT-max(NvT,NoT) Source Target IWT-max(NvT,NoT) Source Target IWT-max(NvT,NoT)DBLP-ACM DBLP-Scholar Rest-YY1 Rest-YY2 Books-AN Books-BHTrainingdatapct LR SVM DT RF Trainingdatapct LR SVM DT RF Trainingdatapct LR SVM DT RF

1 6.347669725 3.012979924 1.534048176 2.486968697 1 3.705372702 3.220008044 1.427634515 3.005821332 1 5.064535797 2.80224707 1.912252153 2.3667199272 4.999044695 2.630832113 1.329541613 2.038792958 2 3.074656055 3.242745447 2.716894415 3.476234194 2 5.346879393 2.417836099 2.531387467 2.3345541093 7.218260596 2.178454875 1.279319577 2.769600544 3 6.077935319 2.016078489 1.983423562 3.216101692 3 6.279409453 4.185641136 1.110506491 3.4604951524 7.811959298 4.383921186 1.876215646 2.874217931 4 6.283485771 3.631616562 1.097686829 2.984755648 4 4.753801357 4.863761253 2.635280587 3.7390455465 5.870356135 6.905426741 1.329541613 4.021129305 5 6.540566806 3.220008044 2.16323582 2.590034493 5 4.952215796 3.317135882 1.144621162 3.7923678986 6.932456764 6.28898356 1.959520166 3.641063619 6 5.493463978 5.852122097 1.235694886 4.997955407 6 5.381311495 4.145489042 0.91471985 2.9363326977 5.999044695 4.896827007 2.804113161 3.40377216 7 5.385743049 7.104296994 0.579381989 3.758762722 7 3.360256834 5.283614225 0.572144622 2.8278499228 7.682329428 7.87768189 1.288012822 2.707176853 8 7.710952774 5.722958269 0.358548504 3.216101692 8 5.346879393 4.145489042 0.353615412 4.1578045559 5.892939968 7.361380907 0.752617637 2.245261725 9 6.143684667 6.536048073 0.400384951 1.412660119 9 3.77322672 4.956930687 0.275480884 1.902102431

10 4.563506846 5.623173984 0.264158892 1.646265088 10 4.308898412 5.146312088 1.401348785 3.650011956 10 2.300036855 3.479646859 0.561441778 2.56101488

1 2 3 4 5 6 7 8 9 10 1 2 3 4 5 6 7 8 9 10 1 2 3 4 5 6 7 8 9 10 1 2 3 4 5 6 7 8 9 10

0

2

4

6

8LR SVM DT RF

1 2 3 4 5 6 7 8 9 100

2

4

6

8LR SVM DT RF

1 2 3 4 5 6 7 8 9 100

1.75

3.5

5.25

7LR SVM DT RF

1 2 3 4 5 6 7 8 9 100

1.25

2.5

3.75

5LR SVM DT RF

1 2 3 4 5 6 7 8 9 10

PublicaDons Restaurants BooksSource Target IWT-max(NvT,NoT) Source Target IWT-max(NvT,NoT) Source Target IWT-max(NvT,NoT)DBLP-ACM DBLP-Scholar Rest-YY1 Rest-YY2 Books-AN Books-BHTrainingdatapct LR SVM DT RF Trainingdatapct LR SVM DT RF Trainingdatapct LR SVM DT RF

1 3.159148051 2.374507077 1.698139726 2.36065773 1 2.458228876 2.556530035 0.959658623 2.128546249 1 2.348136676 1.846803152 1.3824826 2.1382550572 3.453805461 3.012628028 1.561353838 2.44100944 2 2.819476517 2.818718299 0.891214773 2.532336301 2 2.50211043 2.123341809 1.28058616 2.1619798793 3.757277228 3.71099791 1.073068519 2.245581032 3 3.066513956 2.846624556 0.691628781 2.213457177 3 2.808340065 2.429576498 1.295602925 2.1861454144 3.930017541 4.179689382 1.212756101 2.277039031 4 3.203063902 3.222358986 1.172107584 3.257612782 4 2.991383341 2.846803152 1.332417891 2.2131355925 3.952803809 4.391046629 1.03082631 2.44100944 5 3.1738113 3.561167839 1.634668486 3.265948482 5 2.798420526 2.772728373 1.427625598 2.2505318836 3.893764657 3.926151429 1.717409102 2.888589245 6 3.260601459 3.432094215 2.060460659 3.356144325 6 2.807479868 3.230868713 1.084373811 2.3169176957 3.17792013 3.572106688 1.752214424 2.245581032 7 2.956614204 3.021695454 2.033448368 2.739790927 7 3.421413442 3.546877192 1.044134214 2.3782201728 2.885074262 3.116694447 1.232414812 1.77631918 8 2.566146111 2.818718299 1.748060071 2.995340017 8 3.79282528 3.280433095 1.28058616 2.3819652959 2.461758867 3.002662929 0.996835939 1.720829883 9 2.753348984 2.776085388 0.891214773 2.443887877 9 2.941072197 3.043527824 0.980828458 2.403326407

10 2.013881479 3.433771754 1.348462909 1.318050254 10 2.307110992 2.597764607 0.959658623 2.409418955 10 3.108340065 3.063113142 1.233415586 2.693916806

0

1.25

2.5

3.75

5LR SVM DT RF

1 2 3 4 5 6 7 8 9 100

1

2

3

4LR SVM DT RF

0

1

2

3

4LR SVM DT RF

0

1

2

3

4LR SVM DT RF

1 2 3 4 5 6 7 8 9 10 1 2 3 4 5 6 7 8 9 10 1 2 3 4 5 6 7 8 9 10

Training Data (%) Training Data (%) Training Data (%) Training Data (%)

F-Sc

ore

Impr

ovem

ent

over

NvT

(%)

F-Sc

ore

Impr

ovem

ent

over

NvT

(%)

F-Sc

ore

Impr

ovem

ent

over

NvT

(%)

F-Sc

ore

Impr

ovem

ent

over

NvT

(%)

F-Sc

ore

Impr

ovem

ent

over

Bas

elin

es (%

)

F-Sc

ore

Impr

ovem

ent

over

Bas

elin

es (%

)

F-Sc

ore

Impr

ovem

ent

over

Bas

elin

es (%

)

F-Sc

ore

Impr

ovem

ent

over

Bas

elin

es (%

)

F-Sc

ore

Impr

ovem

ent

over

Bas

elin

es (%

)

F-Sc

ore

Impr

ovem

ent

over

Bas

elin

es (%

)

F-Sc

ore

Impr

ovem

ent

over

Bas

elin

es (%

)

F-Sc

ore

Impr

ovem

ent

over

Bas

elin

es (%

)

Training Data (%) Training Data (%) Training Data (%) Training Data (%)

Training Data (%) Training Data (%) Training Data (%) Training Data (%)

�1

(c) Books

PublicaDons Restaurants BooksSource Target IWT-NvT Source Target IWT-NvT Source Target IWT-NvTDBLP-ACM DBLP-Scholar Rest-YY1 Rest-YY2 Books-AN Books-BHTrainingdatapct LR SVM DT RF Trainingdatapct LR SVM DT RF Trainingdatapct LR SVM DT RF

1 8.272669932 9.71909579 0.201644531 14.71662537 1 8.222632091 5.611364164 3.90032689 8.55615245 1 5.631405096 9.002068087 3.887800773 8.0437967642 12.43158031 5.220317543 0.280986302 11.77398799 2 7.019031161 7.393503184 2.995780405 11.96456914 2 5.388600646 8.85729775 2.764502343 7.4362641383 11.62113565 5.16306753 0.19951605 9.250099281 3 8.169621663 6.407030969 1.346404684 10.61010968 3 4.24970864 8.990687881 1.797816308 5.9385293014 6.33241878 6.163664313 0.333536799 7.716389795 4 9.32762485 5.562085761 0.8081017 7.803062988 4 3.112375522 6.100561938 1.191481319 5.9385293015 8.715046558 7.116881824 0.247162446 7.17325849 5 7.802762933 10.14960354 1.550325663 5.98197931 5 4.33205373 4.304606598 2.201522681 5.0150648636 9.312814689 4.943910984 0.190595112 5.613932486 6 7.403328756 7.393503184 0.81476891 4.398370599 6 3.454861717 6.144487806 1.007733051 3.6522135137 10.59736177 3.809958712 0.224867711 5.141142938 7 8.085841035 10.40202939 0.427225465 3.917175497 7 4.24970864 3.09819043 0.796245231 3.6228779848 8.932251131 4.92849548 0.126429758 3.642477809 8 8.535598673 6.646318074 -0.088833099 3.188035222 8 5.631405096 3.030313288 -0.600507543 1.6416275789 10.07624958 4.928818266 0.17399879 3.637857676 9 8.138719786 10.77609119 -0.079069878 1.799381902 9 10.80836336 4.531218779 0.075962776 3.071230643

10 10.29705604 6.012359777 0.246464665 1.560189446 10 8.787647512 12.04975348 0.278744229 1.033140881 10 8.67628568 5.109786586 -0.022986524 3.652213513

0

4

8

12

16LR SVM DT RF

0

3.5

7

10.5

14LR SVM DT RF

0

3

6

9

12LR SVM DT RF

0

1.25

2.5

3.75

5LR SVM DT RF

PublicaDons Restaurants BooksSource Target IWT-max(NvT,NoT) Source Target IWT-max(NvT,NoT) Source Target IWT-max(NvT,NoT)DBLP-ACM DBLP-Scholar Rest-YY1 Rest-YY2 Books-AN Books-BHTrainingdatapct LR SVM DT RF Trainingdatapct LR SVM DT RF Trainingdatapct LR SVM DT RF

1 6.347669725 3.012979924 1.534048176 2.486968697 1 3.705372702 3.220008044 1.427634515 3.005821332 1 5.064535797 2.80224707 1.912252153 2.3667199272 4.999044695 2.630832113 1.329541613 2.038792958 2 3.074656055 3.242745447 2.716894415 3.476234194 2 5.346879393 2.417836099 2.531387467 2.3345541093 7.218260596 2.178454875 1.279319577 2.769600544 3 6.077935319 2.016078489 1.983423562 3.216101692 3 6.279409453 4.185641136 1.110506491 3.4604951524 7.811959298 4.383921186 1.876215646 2.874217931 4 6.283485771 3.631616562 1.097686829 2.984755648 4 4.753801357 4.863761253 2.635280587 3.7390455465 5.870356135 6.905426741 1.329541613 4.021129305 5 6.540566806 3.220008044 2.16323582 2.590034493 5 4.952215796 3.317135882 1.144621162 3.7923678986 6.932456764 6.28898356 1.959520166 3.641063619 6 5.493463978 5.852122097 1.235694886 4.997955407 6 5.381311495 4.145489042 0.91471985 2.9363326977 5.999044695 4.896827007 2.804113161 3.40377216 7 5.385743049 7.104296994 0.579381989 3.758762722 7 3.360256834 5.283614225 0.572144622 2.8278499228 7.682329428 7.87768189 1.288012822 2.707176853 8 7.710952774 5.722958269 0.358548504 3.216101692 8 5.346879393 4.145489042 0.353615412 4.1578045559 5.892939968 7.361380907 0.752617637 2.245261725 9 6.143684667 6.536048073 0.400384951 1.412660119 9 3.77322672 4.956930687 0.275480884 1.902102431

10 4.563506846 5.623173984 0.264158892 1.646265088 10 4.308898412 5.146312088 1.401348785 3.650011956 10 2.300036855 3.479646859 0.561441778 2.56101488

1 2 3 4 5 6 7 8 9 10 1 2 3 4 5 6 7 8 9 10 1 2 3 4 5 6 7 8 9 10 1 2 3 4 5 6 7 8 9 10

0

2

4

6

8LR SVM DT RF

1 2 3 4 5 6 7 8 9 100

2

4

6

8LR SVM DT RF

1 2 3 4 5 6 7 8 9 100

1.75

3.5

5.25

7LR SVM DT RF

1 2 3 4 5 6 7 8 9 100

1.25

2.5

3.75

5LR SVM DT RF

1 2 3 4 5 6 7 8 9 10

PublicaDons Restaurants BooksSource Target IWT-max(NvT,NoT) Source Target IWT-max(NvT,NoT) Source Target IWT-max(NvT,NoT)DBLP-ACM DBLP-Scholar Rest-YY1 Rest-YY2 Books-AN Books-BHTrainingdatapct LR SVM DT RF Trainingdatapct LR SVM DT RF Trainingdatapct LR SVM DT RF

1 3.159148051 2.374507077 1.698139726 2.36065773 1 2.458228876 2.556530035 0.959658623 2.128546249 1 2.348136676 1.846803152 1.3824826 2.1382550572 3.453805461 3.012628028 1.561353838 2.44100944 2 2.819476517 2.818718299 0.891214773 2.532336301 2 2.50211043 2.123341809 1.28058616 2.1619798793 3.757277228 3.71099791 1.073068519 2.245581032 3 3.066513956 2.846624556 0.691628781 2.213457177 3 2.808340065 2.429576498 1.295602925 2.1861454144 3.930017541 4.179689382 1.212756101 2.277039031 4 3.203063902 3.222358986 1.172107584 3.257612782 4 2.991383341 2.846803152 1.332417891 2.2131355925 3.952803809 4.391046629 1.03082631 2.44100944 5 3.1738113 3.561167839 1.634668486 3.265948482 5 2.798420526 2.772728373 1.427625598 2.2505318836 3.893764657 3.926151429 1.717409102 2.888589245 6 3.260601459 3.432094215 2.060460659 3.356144325 6 2.807479868 3.230868713 1.084373811 2.3169176957 3.17792013 3.572106688 1.752214424 2.245581032 7 2.956614204 3.021695454 2.033448368 2.739790927 7 3.421413442 3.546877192 1.044134214 2.3782201728 2.885074262 3.116694447 1.232414812 1.77631918 8 2.566146111 2.818718299 1.748060071 2.995340017 8 3.79282528 3.280433095 1.28058616 2.3819652959 2.461758867 3.002662929 0.996835939 1.720829883 9 2.753348984 2.776085388 0.891214773 2.443887877 9 2.941072197 3.043527824 0.980828458 2.403326407

10 2.013881479 3.433771754 1.348462909 1.318050254 10 2.307110992 2.597764607 0.959658623 2.409418955 10 3.108340065 3.063113142 1.233415586 2.693916806

0

1.25

2.5

3.75

5LR SVM DT RF

1 2 3 4 5 6 7 8 9 100

1

2

3

4LR SVM DT RF

0

1

2

3

4LR SVM DT RF

0

1

2

3

4LR SVM DT RF

1 2 3 4 5 6 7 8 9 10 1 2 3 4 5 6 7 8 9 10 1 2 3 4 5 6 7 8 9 10

Training Data (%) Training Data (%) Training Data (%) Training Data (%)

F-Sc

ore

Impr

ovem

ent

over

NvT

(%)

F-Sc

ore

Impr

ovem

ent

over

NvT

(%)

F-Sc

ore

Impr

ovem

ent

over

NvT

(%)

F-Sc

ore

Impr

ovem

ent

over

NvT

(%)

F-Sc

ore

Impr

ovem

ent

over

Bas

elin

es (%

)

F-Sc

ore

Impr

ovem

ent

over

Bas

elin

es (%

)

F-Sc

ore

Impr

ovem

ent

over

Bas

elin

es (%

)

F-Sc

ore

Impr

ovem

ent

over

Bas

elin

es (%

)

F-Sc

ore

Impr

ovem

ent

over

Bas

elin

es (%

)

F-Sc

ore

Impr

ovem

ent

over

Bas

elin

es (%

)

F-Sc

ore

Impr

ovem

ent

over

Bas

elin

es (%

)

F-Sc

ore

Impr

ovem

ent

over

Bas

elin

es (%

)

Training Data (%) Training Data (%) Training Data (%) Training Data (%)

Training Data (%) Training Data (%) Training Data (%) Training Data (%)

�1

(d) Songs

Figure 8: Scenario 3. (Source, Target): (Limited, Limited)

11

Page 12: Reuse and Adaptation for Entity Resolution through ... · (III) Class Conditional Probability Distribution Shift oc-curs when the quantities P S(yjX) and P T (yjX) di er, i.e., given

SVM(DA,DS->C)-SVM(DA->C)

SVM(DA,DS->C)-SVM(DS->C)

RF(DA,DS->C)-RF(DA->C)

RF(DA,DS->C)-RF(DS->C)

2.712364885 17.25506286 1.983380182 7.1242871345.428567048 20.17347006 2.892079194 9.7029506366.443050079 21.0828337 3.109155913 10.428916126.101396911 20.77808408 2.295246653 12.275119894.734069238 19.51676785 3.876397765 13.252881795.395418241 20.4031152 5.69262421 15.554593056.333521995 21.37187372 5.5259238 16.349969397.937678093 23.03530942 4.177457412 14.97921048.011833086 22.95762666 1.742717969 12.7157453410.51248271 25.88048973 1.226485767 12.44938379

0

7.5

15

22.5

30SVM: (Pub-DA, Pub-C)SVM: (Pub-DS, Pub-C)RF: (Pub-DA, Pub-C)RF: (Pub-DS, Pub-C)

1 2 3 4 5 6 7 8 9 10

Movies->Songs(MSD2)

TrainingdatapctSVM(MSD1->MSD2)

SVM(Mv-IO->MSD2)

RF(MSD1->MSD2)

RF(Mv-IO->MSD2)

1 3.595512549 6.493849185 4.147598455 6.5087399142 3.426654362 7.590668792 3.432469609 6.0050049633 4.676585182 8.526658391 2.338294024 6.1129324 3.300556553 6.712007052 1.697582784 5.7301524835 2.991734324 8.296262542 3.73327926 6.5342013316 1.907194106 5.652595492 2.171454033 5.3325954927 1.771590852 7.535589078 2.324325682 5.1252477528 3.860756371 10.96221398 2.283652813 6.3162225129 1.333701201 7.498132689 2.241364298 7.566005552

10 0.323126776 6.087125002 1.697582784 6.408915478

0

3

6

9

12SVM: MSD1, MSD2SVM: Mv-IO, MSD2RF: MSD1, MSD2RF: Mv-IO, MSD2

1 2 3 4 5 6 7 8 9 10

MSD1->MSD2Trainingdatapct SVM RF

1 3.595512549 4.1475984552 3.426654362 3.4324696093 4.676585182 2.3382940244 3.300556553 1.6975827845 2.991734324 3.73327926

10 1.907194106 1.69758278415 -0.88696324 -0.4583534820 -1.008828045 -1.88606467425 -0.962576064 -1.985162254

-2.5-1.25

01.252.5

3.755

SVM RF 1 2 3 4 5 10 15 20 25

F-Sc

ore

Impr

ovem

ent

over

Sin

gle

Sour

ce (%

)

F-Sc

ore

Impr

ovem

ent

over

Bas

elin

es (%

)

F-Sc

ore

Impr

ovem

ent

over

Bas

elin

es (%

)

Training Data (%) Training Data (%) Training Data (%)

�1

(a) Multiple Sources

SVM(DA,DS->C)-SVM(DA->C)

SVM(DA,DS->C)-SVM(DS->C)

RF(DA,DS->C)-RF(DA->C)

RF(DA,DS->C)-RF(DS->C)

2.712364885 17.25506286 1.983380182 7.1242871345.428567048 20.17347006 2.892079194 9.7029506366.443050079 21.0828337 3.109155913 10.428916126.101396911 20.77808408 2.295246653 12.275119894.734069238 19.51676785 3.876397765 13.252881795.395418241 20.4031152 5.69262421 15.554593056.333521995 21.37187372 5.5259238 16.349969397.937678093 23.03530942 4.177457412 14.97921048.011833086 22.95762666 1.742717969 12.7157453410.51248271 25.88048973 1.226485767 12.44938379

0

7.5

15

22.5

30SVM: (Pub-DA, Pub-C)SVM: (Pub-DS, Pub-C)RF: (Pub-DA, Pub-C)RF: (Pub-DS, Pub-C)

1 2 3 4 5 6 7 8 9 10

Movies->Songs(MSD2)

TrainingdatapctSVM(MSD1->MSD2)

SVM(Mv-IO->MSD2)

RF(MSD1->MSD2)

RF(Mv-IO->MSD2)

1 3.595512549 6.493849185 4.147598455 6.5087399142 3.426654362 7.590668792 3.432469609 6.0050049633 4.676585182 8.526658391 2.338294024 6.1129324 3.300556553 6.712007052 1.697582784 5.7301524835 2.991734324 8.296262542 3.73327926 6.5342013316 1.907194106 5.652595492 2.171454033 5.3325954927 1.771590852 7.535589078 2.324325682 5.1252477528 3.860756371 10.96221398 2.283652813 6.3162225129 1.333701201 7.498132689 2.241364298 7.566005552

10 0.323126776 6.087125002 1.697582784 6.408915478

0

3

6

9

12SVM: MSD1, MSD2SVM: Mv-IO, MSD2RF: MSD1, MSD2RF: Mv-IO, MSD2

1 2 3 4 5 6 7 8 9 10

MSD1->MSD2Trainingdatapct SVM RF

1 3.595512549 4.1475984552 3.426654362 3.4324696093 4.676585182 2.3382940244 3.300556553 1.6975827845 2.991734324 3.73327926

10 1.907194106 1.69758278415 -0.88696324 -0.4583534820 -1.008828045 -1.88606467425 -0.962576064 -1.985162254

-2.5-1.25

01.252.5

3.755

SVM RF 1 2 3 4 5 10 15 20 25

F-Sc

ore

Impr

ovem

ent

over

Sin

gle

Sour

ce (%

)

F-Sc

ore

Impr

ovem

ent

over

Bas

elin

es (%

)

F-Sc

ore

Impr

ovem

ent

over

Bas

elin

es (%

)

Training Data (%) Training Data (%) Training Data (%)

�1

(b) Cross Domain

Figure 9: Multiple Sources and Cross Domain

due to the nature of ER (specifically, monotonicity of preci-sion), it could still be a potential issue.

One approach that we found effective was to use DRs ofsmaller dimension such as d = 50 or d = 100. For simpledatasets with limited training data, this might be prefer-able. However, FastText does not provide such smaller DRsby default. We instead used the following approach. Webegin by taking GloVe’s 50 dimensional word embeddingdictionary. We dump all the tuples from the datasets DS

and DT into a single file such that each tuple corresponds toa sentence. We train FastText on this dataset and require itto provide a model outputting a 50 dimension DR. We alsoinstruct FastText to use GloVe’s word embeddings as initialvalues. This produces a DR model that for any given wordproduces a 50 dimension DR as output.

Exp-3: Multi Source Transfer. We evaluate in this ex-periment the potential for transferring from multiple sourcedatasets. We choose Pub-DA and Pub-DS as the sourcedatasets while Pub-C is the target dataset. We choosetwo representative classifiers– SVM and RF, and evaluatethree cases, namely (Pub-DA, Pub-C), (Pub-DS, Pub-C),and ({Pub-DA, Pub-DS}, Pub-C). We allocated tuples fromboth source datasets equally. For example, when we use 2%of DL

S overall, 1% comes from Pub-DA and 1% comes fromPub-DS. We measure the improvement in performance ofmulti source transfer ({Pub-DA, Pub-DS}, Pub-C) to sin-gle source transfers (Pub-DA, Pub-C) and (Pub-DS, Pub-C)respectively. As expected, the improvement can be substan-tial (See Figure 9(a)) Note that Pub-DA is a better sourcedataset for Pub-C than Pub-DS. So the improvement ob-tained in (Pub-DS, Pub-C) is as much as 25% improvementin F-score while (Pub-DA, Pub-C) still manages to achieve10% improvement. Of course, the potential for improvementis contingent on how the various source datasets are relatedto the target dataset. One can use our proposed algorithmsfrom Section 5 for choosing one or more source datasets.

Effectiveness of Source Selection Algorithm. We evaluatedour source selection algorithm that ranks a given set ofsource datasets based on their relatedness to the targetdataset. We used a wide variety of domains and many sourcedestination combinations. Overall, we tested the 5 domainsdescribed earlier and 20 datasets (dubbed 784 datasets) fromthe Magellan data repository [13]. We compare the orderprovided by our algorithm with the empirical order wherewe measure the improvement provided by treating each ofthe datasets individually as the source. In each case, theordering was correct.

Exp-4: Cross Domain Transfer. ML classifiers for ERtake similarity vectors as input. Once the feature space is

SVM(DA,DS->C)-SVM(DA->C)

SVM(DA,DS->C)-SVM(DS->C)

RF(DA,DS->C)-RF(DA->C)

RF(DA,DS->C)-RF(DS->C)

2.712364885 17.25506286 1.983380182 7.1242871345.428567048 20.17347006 2.892079194 9.7029506366.443050079 21.0828337 3.109155913 10.428916126.101396911 20.77808408 2.295246653 12.275119894.734069238 19.51676785 3.876397765 13.252881795.395418241 20.4031152 5.69262421 15.554593056.333521995 21.37187372 5.5259238 16.349969397.937678093 23.03530942 4.177457412 14.97921048.011833086 22.95762666 1.742717969 12.7157453410.51248271 25.88048973 1.226485767 12.44938379

0

7.5

15

22.5

30

SVM: Improving of Multiple Source to (DA-> C)SVM: Improving of Multiple Source to (DS-> C)RF: Improving of Multiple Source to (DA -> C)RF: Improving of Multiple Source to (DS -> C)

1 2 3 4 5 6 7 8 9 10

Movies->Songs(MSD2)

TrainingdatapctSVM(MSD1->MSD2)

SVM(Mv-IO->MSD2)

RF(MSD1->MSD2)

RF(Mv-IO->MSD2)

1 3.595512549 6.493849185 4.147598455 6.5087399142 3.426654362 7.590668792 3.432469609 6.0050049633 4.676585182 8.526658391 2.338294024 6.1129324 3.300556553 6.712007052 1.697582784 5.7301524835 2.991734324 8.296262542 3.73327926 6.5342013316 1.907194106 5.652595492 2.171454033 5.3325954927 1.771590852 7.535589078 2.324325682 5.1252477528 3.860756371 10.96221398 2.283652813 6.3162225129 1.333701201 7.498132689 2.241364298 7.566005552

10 0.323126776 6.087125002 1.697582784 6.408915478

0

3

6

9

12

SVM (MSD1 -> MSD2)SVM (Mv-IO -> MSD2)RF (MSD1 -> MSD2)RF (Mv-IO -> MSD2)

1 2 3 4 5 6 7 8 9 10

MSD1->MSD2Trainingdatapct SVM RF

1 3.595512549 4.1475984552 3.426654362 3.4324696093 4.676585182 2.3382940244 3.300556553 1.6975827845 2.991734324 3.73327926

10 1.907194106 1.69758278415 -0.88696324 -0.4583534820 -1.008828045 -1.88606467425 -0.962576064 -1.985162254

-2.5-1.25

01.252.5

3.755

SVM RF 1 2 3 4 5 10 15 20 25

F-Sc

ore

Impr

ovem

ent

over

Bas

elin

es (%

)

F-Sc

ore

Impr

ovem

ent

over

Bas

elin

es (%

)

F-Sc

ore

Impr

ovem

ent

over

Bas

elin

es (%

)

Training Data (%) Training Data (%) Training Data (%)

�1

Figure 10: Negative Transfer

standardized, two similarity vectors - even if they are fromdifferent datasets - can be compared. Based on this observa-tion, we could transfer from a dataset from a related domainwhenever a dataset from the same domain is not available.To test this, we chose two related domains - Movies andSongs. Since Mv-IO is the larger dataset, we chose it asthe source. Recall that (MSD-1, MSD-2) is a challengingdataset due to the way the MSD dataset was partitionedadversarially. Even in such a case, (Mv-IO, MSD-2) per-formed much better although it is from a different domainas shown in Figure 9(b).

Exp-5: Negative Transfer. It is possible that in cer-tain cases, TL can negatively affect performance. We con-structed MSD-1 and MSD-2 adversarially to identify such abehavior. Figure 10 shows the results. We fix DL

T = ∅ andthe experiment corresponds to Scenario 1. We measure theimprovement obtained by the instance weighting algorithmover NvT. For small data sizes, our approach is slightly bet-ter than NvT. As the size of the training data increases, theperformance slightly improves before turning negative.

6.3 Summary of the Experiments

1. We have shown that TL for ER is not only feasible butalso quite effective and desirable (Exp-1).

2. Our proposed approaches provide excellent perfor-mance in all the scenarios (Exp-2). Furthermore, theyenable ER in previously unfeasible cases such as Sce-nario 1 and improve the performance in previously in-effective cases such as Scenario 3.

3. In many cases, multi-source TL can improve the per-formance over a single source (Exp-3). We also showthat our proposed heuristics (Section 5.2) can helpidentify the most promising sources.

4. Our Fellegi-Sunter model based formulation has aninteresting side effect – one could use the similarityvectors from a (even tangentially) related dataset assource DS for a given target dataset DT (Exp-4).

5. We highlight the phenomenon of negative transferwhere TL results in decreased performance (Exp-5).In fact, we had to partition a benchmark dataset inan adversarial manner to demonstrate negative trans-fer. We found that this occurs rarely in ER can bemitigated by techniques from Section 5.2.

7. RELATED WORK(ML-based) Entity Resolution. ER has been widelystudied by using (a) declarative rules [49, 9] (b) ML-based

12

Page 13: Reuse and Adaptation for Entity Resolution through ... · (III) Class Conditional Probability Distribution Shift oc-curs when the quantities P S(yjX) and P T (yjX) di er, i.e., given

models [30, 19, 29], and (c) experts or crowdsourcing [54,23] (see surveys [20, 38] for more details). ML-based ER sys-tems often provide the state-of-the-art performance [18, 30,19, 29], by following the Fellegi-Sunter model [21]. Notableprior works use SVM [9], active learning [48], clustering [12]and Markov logic [50]. Recently, deep learning-based meth-ods have been successfully applied for ER [19, 37]. Theseapproaches leverage DRs as the fundamental building blockand propose a number of DL architectures for ER.

ER Systems. A comprehensive survey of current EM sys-tems can be found in [29]. Popular tools include Magel-lan [29], DeepER [19], SERF [6], and Dedoop [28]. One ofour key objectives is to adapt prior TL algorithms such thatthey require minimal changes to these existing ER pipelines.

Transfer Learning is a fundamental ML technique to im-prove the performance of a classifier on a task with limited tono labeled data by leveraging a related task (see [40, 17] formore details). Chapter 8 of [15] provides a simplified versionof some of the theorems that are applied in our paper. Itis possible to transfer features, training data, or ML classi-fiers. We focused on the case of transferring features throughdistributed representations and training data. TransferringML classifiers is often complex and require non-trivial im-plementations. Furthermore, they are also challenging tointegrate into existing ER pipelines.

One prior work that uses TL for ER [39]. It focused onthe specific case of multi-source ER and proposed a trans-fer algorithm for linear classifiers such as SVM. However,their approach require complex modifications to the objec-tive function and the use of composite gradient methods.

Distributed Representations. Recently, DRs has beenidentified as a promising approach to encode tuples forER [19, 37]. DRs are heavily used in NLP [34]. Popu-lar pre-trained word embeddings include word2vec [36] fromGoogle, GloVe [43] from Stanford, fastText [11] from Face-book, and ELMo [45] from AllenAI. The definition of DRscan be also extended from words to sentences, paragraphsand documents [32]. Recent theoretical works have shownthat such complex methods are often outperformed in a TLsetting by simpler compositional approaches [56, 3] such asthose used in our paper.

8. CONCLUSIONIn this paper, we have investigated the feasibility of per-

forming ER on a dataset DT that has limited to no trainingdata by reusing labeled data and features from a relateddataset DS . We proposed an effective approach to converteach tuple into a standard feature space based on distributedrepresentations. We identified 4 common ER scenarios andproposed effective algorithms that can be readily integratedinto existing ER pipelines. Our extensive experiments over12 datasets from 5 domains have shown that TL does providemeaningful benefits in improving the performance of an MLclassifier on the target dataset. Promising research direc-tions to further explore include seamless integration of TLon ER with other research directions such as active learning,weak supervision, data/feature augmentation, and genera-tive adversarial approaches from deep learning.

9. REFERENCES[1] Benchmark datasets for entity resolution.

https://dbs.uni-leipzig.de/en/research/

projects/object_matching/fever/benchmark_

datasets_for_entity_resolution.

[2] A. Arasu, M. Gotz, and R. Kaushik. On activelearning of record matching packages. In Proceedingsof the 2010 ACM SIGMOD International Conferenceon Management of data, pages 783–794. ACM, 2010.

[3] S. Arora, Y. Liang, and T. Ma. A simple buttough-to-beat baseline for sentence embeddings.International Conference on Learning Representations,2017.

[4] P. Baldi, S. Brunak, Y. Chauvin, C. A. Andersen, andH. Nielsen. Assessing the accuracy of predictionalgorithms for classification: an overview.Bioinformatics, 16(5):412–424, 2000.

[5] S. Ben-David, J. Blitzer, K. Crammer, and F. Pereira.Analysis of representations for domain adaptation. InAdvances in neural information processing systems,pages 137–144, 2007.

[6] O. Benjelloun, H. Garcia-molina, H. Kawai, T. E.Larson, D. Menestrina, Q. Su, S. Thavisomboon, andJ. Widom. Generic entity resolution in the serfproject. In IEEE Data Engineering Bulletin, 2006.

[7] T. Bertin-Mahieux, D. P. Ellis, B. Whitman, andP. Lamere. The million song dataset. In Ismir,volume 2, page 10, 2011.

[8] S. Bickel, M. Bruckner, and T. Scheffer. Discriminativelearning for differing training and test distributions. InProceedings of the 24th international conference onMachine learning, pages 81–88. ACM, 2007.

[9] M. Bilenko and R. J. Mooney. Adaptive duplicatedetection using learnable string similarity measures. InKDD, 2003.

[10] J. Blitzer, R. McDonald, and F. Pereira. Domainadaptation with structural correspondence learning. InProceedings of the 2006 conference on empiricalmethods in natural language processing, pages 120–128.Association for Computational Linguistics, 2006.

[11] P. Bojanowski, E. Grave, A. Joulin, and T. Mikolov.Enriching word vectors with subword information.arXiv preprint arXiv:1607.04606, 2016.

[12] W. W. Cohen and J. Richman. Learning to match andcluster large high-dimensional data sets for dataintegration. In KDD, 2002.

[13] S. Das, A. Doan, P. S. G. C., C. Gokhale, andP. Konda. The magellan data repository.https://sites.google.com/site/anhaidgroup/

projects/data.

[14] H. Daume III. Frustratingly easy domain adaptation.ACL 2007, page 256, 2007.

[15] H. Daume III. A course in machine learning.Publisher, CIML, 2012.

[16] H. Daume III, A. Kumar, and A. Saha. Frustratinglyeasy semi-supervised domain adaptation. InProceedings of the 2010 Workshop on DomainAdaptation for Natural Language Processing, pages53–59. Association for Computational Linguistics,2010.

[17] H. Daume III and D. Marcu. Domain adaptation forstatistical classifiers. Journal of Artificial IntelligenceResearch, 26:101–126, 2006.

[18] X. L. Dong and T. Rekatsinas. Data integration and

13

Page 14: Reuse and Adaptation for Entity Resolution through ... · (III) Class Conditional Probability Distribution Shift oc-curs when the quantities P S(yjX) and P T (yjX) di er, i.e., given

machine learning: A natural synergy. In Proceedings ofthe 2018 International Conference on Management ofData, pages 1645–1650. ACM, 2018.

[19] M. Ebraheem, S. Thirumuruganathan, S. Joty,M. Ouzzani, and N. Tang. Distributed representationsof tuples for entity resolution. In PVLDB, 2018.

[20] A. K. Elmagarmid, P. G. Ipeirotis, and V. S. Verykios.Duplicate record detection: A survey. TKDE, 19(1),2007.

[21] I. Fellegi and A. Sunter. A theory for record linkage.Journal of the American Statistical Association, 64(328), 1969.

[22] C. Gokhale, S. Das, A. Doan, J. F. Naughton,N. Rampalli, J. Shavlik, and X. Zhu. Corleone:Hands-off crowdsourcing for entity matching. InSIGMOD, 2014.

[23] C. Gokhale, S. Das, A. Doan, J. F. Naughton,N. Rampalli, J. W. Shavlik, and X. Zhu. Corleone:hands-off crowdsourcing for entity matching. InSIGMOD, pages 601–612, 2014.

[24] I. Goodfellow, Y. Bengio, and A. Courville. DeepLearning. MIT Press, 2016.http://www.deeplearningbook.org.

[25] J. Huang, A. Gretton, K. M. Borgwardt, B. Scholkopf,and A. J. Smola. Correcting sample selection bias byunlabeled data. In Advances in neural informationprocessing systems, pages 601–608, 2007.

[26] M. Khodak, N. Saunshi, Y. Liang, T. Ma, B. Stewart,and S. Arora. A la carte embedding: Cheap buteffective induction of semantic feature vectors.Proceedings of the Association of ComputationalLinguistics, 2018.

[27] G. King and L. Zeng. Logistic regression in rare eventsdata. Political analysis, 9(2):137–163, 2001.

[28] L. Kolb, A. Thor, and E. Rahm. Dedoop: Efficientdeduplication with hadoop. PVLDB, 2012.

[29] P. Konda, S. Das, P. Suganthan GC, A. Doan,A. Ardalan, J. R. Ballard, H. Li, F. Panahi, H. Zhang,J. Naughton, et al. Magellan: Toward building entitymatching management systems. PVLDB, 2016.

[30] H. Kopcke, A. Thor, and E. Rahm. Evaluation ofentity resolution approaches on real-world matchproblems. PVLDB, 2010.

[31] A. Kumar, A. Saha, and H. Daume. Co-regularizationbased semi-supervised domain adaptation. InAdvances in neural information processing systems,pages 478–486, 2010.

[32] Q. Le and T. Mikolov. Distributed representations ofsentences and documents. In ICML, 2014.

[33] W. Li, L. Duan, D. Xu, and I. W. Tsang. Learningwith augmented features for supervised andsemi-supervised heterogeneous domain adaptation.IEEE transactions on pattern analysis and machineintelligence, 36(6):1134–1148, 2014.

[34] C. Manning. Representations for language: From wordembeddings to sentence meanings.https://simons.berkeley.edu/talks/

christopher-manning-2017-3-27, 2017.

[35] F. J. Martin. A simple machine learning method todetect covariate shift.https://blog.bigml.com/2014/01/03/

simple-machine-learning-to-detect-covariate-shift/.

[36] T. Mikolov, I. Sutskever, K. Chen, G. S. Corrado, andJ. Dean. Distributed representations of words andphrases and their compositionality. In NIPS, 2013.

[37] S. Mudgal, H. Li, T. Rekatsinas, A. Doan, Y. Park,G. Krishnan, R. Deep, E. Arcaute, andV. Raghavendra. Deep learning for entity matching: Adesign space exploration. In SIGMOD, 2018.

[38] F. Naumann and M. Herschel. An introduction toduplicate detection. Synthesis Lectures on DataManagement, 2(1):1–87, 2010.

[39] S. N. Negahban, B. I. Rubinstein, and J. G. Gemmell.Scaling multiple-source entity resolution usingstatistically efficient transfer learning. In Proceedingsof the 21st ACM international conference onInformation and knowledge management, pages2224–2228. ACM, 2012.

[40] S. J. Pan, Q. Yang, et al. A survey on transferlearning. IEEE Transactions on knowledge and dataengineering, 22(10):1345–1359, 2010.

[41] G. Papadakis, E. Ioannou, T. Palpanas, C. Niederee,and W. Nejdl. A blocking framework for entityresolution in highly heterogeneous information spaces.IEEE Transactions on Knowledge and DataEngineering, 25(12):2665–2682, 2013.

[42] F. Pedregosa, G. Varoquaux, A. Gramfort, V. Michel,B. Thirion, O. Grisel, M. Blondel, P. Prettenhofer,R. Weiss, V. Dubourg, et al. Scikit-learn: Machinelearning in python. JMLR, 12(Oct):2825–2830, 2011.

[43] J. Pennington, R. Socher, and C. D. Manning. Glove:Global vectors for word representation. In EMNLP,2014.

[44] D. N. Perkins, G. Salomon, et al. Transfer of learning.International encyclopedia of education, 2:6452–6457,1992.

[45] M. Peters, M. Neumann, M. Iyyer, M. Gardner,C. Clark, K. Lee, and L. Zettlemoyer. Deepcontextualized word representations. In NAACL,volume 1, pages 2227–2237, 2018.

[46] A. J. Ratner, C. M. De Sa, S. Wu, D. Selsam, andC. Re. Data programming: Creating large trainingsets, quickly. In Advances in neural informationprocessing systems, pages 3567–3575, 2016.

[47] M. T. Rosenstein, Z. Marx, L. P. Kaelbling, and T. G.Dietterich. To transfer or not to transfer. In NIPS2005 workshop on transfer learning, volume 898, pages1–4, 2005.

[48] S. Sarawagi and A. Bhamidipaty. Interactivededuplication using active learning. In KDD, 2002.

[49] R. Singh, V. Meduri, A. K. Elmagarmid, S. Madden,P. Papotti, J. Quiane-Ruiz, A. Solar-Lezama, andN. Tang. Generating concise entity matching rules. InPVLDB, 2017.

[50] P. Singla and P. Domingos. Entity resolution withmarkov logic. In ICDM, 2006.

[51] M. Stonebraker and I. F. Ilyas. Data integration: Thecurrent status and the way forward. Data Engineering,page 3, 2018.

[52] M. Sugiyama, N. D. Lawrence, A. Schwaighofer, et al.Dataset shift in machine learning. The MIT Press,2017.

14

Page 15: Reuse and Adaptation for Entity Resolution through ... · (III) Class Conditional Probability Distribution Shift oc-curs when the quantities P S(yjX) and P T (yjX) di er, i.e., given

[53] M. Sugiyama, S. Nakajima, H. Kashima, P. V.Buenau, and M. Kawanabe. Direct importanceestimation with model selection and its application tocovariate shift adaptation. In Advances in neuralinformation processing systems, pages 1433–1440,2008.

[54] J. Wang, T. Kraska, M. J. Franklin, and J. Feng.Crowder: Crowdsourcing entity resolution. PVLDB,5(11):1483–1494, 2012.

[55] L. Wasserman. All of statistics: a concise course instatistical inference. Springer Science & BusinessMedia, 2013.

[56] J. Wieting, M. Bansal, K. Gimpel, and K. Livescu.Towards universal paraphrastic sentence embeddings.International Conference on Learning Representations,2016.

15