12
Combining Effectiveness and Efficiency for Schema Matching Evaluation Alsayed Algergawy, Eike Schallehn, and Gunter Saake Department of Computer Science, Otto-von-Guericke University, 39106 Magdeburg, Germany {alshahat,eike,saake}@iti.cs.uni-mgdeburg.de Abstract. Schema matching plays a central role in many applications that require interoperability among heterogeneous data sources. A good evaluation for different capabilities of schema matching systems has be- come vital as the complexity of such systems arises. The capabilities of matching systems incorporate different (possibly conflicting) aspects among them match quality and match efficiency. The analysis of effi- ciency of a schema matching system, if it is done, tends to be done in a way separate from the analysis of effectiveness. In this paper, we present the trade-off between schema matching effectiveness and efficiency as a multi-objective optimization problem. This representation enables us to obtain a combined measure as a compromise between them. We combine both performance aspects in a weighted-average function to determine the cost-effectiveness of a schema matching system. We apply our pro- posed approach to evaluate two currently existing mainstream schema matching systems namely COMA++ and BTreeMatch. Experimental re- sults showed that, by carefully utilizing both small-scale and large-scale schemas, it is necessary to take the response time of the matching process into account especially in large-scale schemas. Keywords: Schema matching, Schema matching performance, Effective- ness, Efficiency, Cost-effectiveness. 1 Introduction Schema matching is the task of identifying semantic correspondences between the elements of two or more schemas and plays a central role in many data appli- cation scenarios [6,13,10]: in data integration, to identify and characterize inter- schema relationships between multiple (heterogeneous) schemas; in E-business, to help exchange messages between different XML formats; in semantic query processing, to map user-specified concepts in the query to schema elements; in the semantic Web, to establish semantic correspondences between concepts of different websites ontologies; and in data migration, to migrate legacy data from multiple sources into a new one [7]. To identify a solution for a particular match problem, it is important to understand which of the proposed techniques performs best. The performance R.-D. Kutsche and N. Milanovic (Eds.): MBSDI 2008, CCIS 8, pp. 19–30, 2008. Springer-Verlag Berlin Heidelberg 2008

[Communications in Computer and Information Science] Model-Based Software and Data Integration Volume 8 || Combining Effectiveness and Efficiency for Schema Matching Evaluation

  • Upload
    nikola

  • View
    213

  • Download
    1

Embed Size (px)

Citation preview

Combining Effectiveness and Efficiency forSchema Matching Evaluation

Alsayed Algergawy, Eike Schallehn, and Gunter Saake

Department of Computer Science,Otto-von-Guericke University,39106 Magdeburg, Germany

{alshahat,eike,saake}@iti.cs.uni-mgdeburg.de

Abstract. Schema matching plays a central role in many applicationsthat require interoperability among heterogeneous data sources. A goodevaluation for different capabilities of schema matching systems has be-come vital as the complexity of such systems arises. The capabilitiesof matching systems incorporate different (possibly conflicting) aspectsamong them match quality and match efficiency. The analysis of effi-ciency of a schema matching system, if it is done, tends to be done in away separate from the analysis of effectiveness. In this paper, we presentthe trade-off between schema matching effectiveness and efficiency as amulti-objective optimization problem. This representation enables us toobtain a combined measure as a compromise between them. We combineboth performance aspects in a weighted-average function to determinethe cost-effectiveness of a schema matching system. We apply our pro-posed approach to evaluate two currently existing mainstream schemamatching systems namely COMA++ and BTreeMatch. Experimental re-sults showed that, by carefully utilizing both small-scale and large-scaleschemas, it is necessary to take the response time of the matching processinto account especially in large-scale schemas.

Keywords: Schema matching, Schema matching performance, Effective-ness, Efficiency, Cost-effectiveness.

1 Introduction

Schema matching is the task of identifying semantic correspondences betweenthe elements of two or more schemas and plays a central role in many data appli-cation scenarios [6,13,10]: in data integration, to identify and characterize inter-schema relationships between multiple (heterogeneous) schemas; in E-business,to help exchange messages between different XML formats; in semantic queryprocessing, to map user-specified concepts in the query to schema elements; inthe semantic Web, to establish semantic correspondences between concepts ofdifferent websites ontologies; and in data migration, to migrate legacy data frommultiple sources into a new one [7].

To identify a solution for a particular match problem, it is important tounderstand which of the proposed techniques performs best. The performance

R.-D. Kutsche and N. Milanovic (Eds.): MBSDI 2008, CCIS 8, pp. 19–30, 2008.� Springer-Verlag Berlin Heidelberg 2008

20 A. Algergawy, E. Schallehn, and G. Saake

of a schema matching system comprises mainly two equally important factors;namely; effectiveness and efficiency [15,8]. The effectiveness is concerned withthe accuracy and the correctness of the match result while the efficiency is con-cerned with resources consumption (time, memory,...) by the match system. Mostof existing system evaluations focus on analyzing the effectiveness of a schemamatching system [2,16]. The analysis of the efficiency of a schema matchingsystem, if it is done, tends to be done in a way separate from the analysis of ef-fectiveness [8]. In other words, effectiveness and efficiency are usually consideredtwo very different dimensions, and the trade-off between these two dimensionshas not been investigated in the schema matching community.

Many real-world problems such as the schema matching problem, involve mul-tiple measures of performance, which should be optimized simultaneously. Op-timal performance according to one objective, if such an optimum exists, oftenimplies unacceptably low performance in one or more of the other objectivedimensions, creating the need for a compromise to be reached. In the schemamatching problem, the performance of a matching system involves multiple as-pects among them effectiveness and efficiency. Optimizing one aspect for exam-ple effectiveness will affect the other aspects such as efficiency. Hence, we needa compromise between them, and we could consider the trade-off between ef-fectiveness and efficiency matching result as a multi-objective problem. In prac-tice, multi-objective problems have to be re-formulated as a single objectiveproblem.

To this end, in this paper, we propose a method for computing the cost-effectiveness of a schema matching system. Such a method is intended to beused in a combined evaluation of schema matching systems. This evaluationconcentrates on the cost-effectiveness of schema matching approaches, i.e. thetrade-off between effectiveness and efficiency. The motivation behind this is thatsuppose we want to compare two schema matching systems to solve a specificmatching problem. If we have a schema matching problem P, and we have twomatching systems A and B. The system A is more effective than system B, whilethe system B is more efficient than system A. The arising question here is whichsystem will be used to solve the given problem?

So far most existing matching systems [3,10,12] evaluate their performance ac-cording only to effectiveness issues, hence they all choose the system A (more ef-fective). This paper introduces a combined approach to evaluate cost-effectivenessof matching systems based on the multi-objective optimization problem (MOOP).We apply the proposed approach to evaluate and compare two well-known sys-tems, namely COMA++ [3] and BTreeMatch [9].

The rest of this paper is organized as follows: the next section introducesan overview of schema matching. In the following section, we present schemamatching performance focusing on effectiveness and efficiency evaluations. Sec-tion 4 presents a combined measure for schema matching performance, cost-effectiveness measure. In section 5, experiments and results are discussed. Section6 gives concluding remarks and our proposed future work.

Combining Effectiveness and Efficiency for Schema Matching Evaluation 21

2 Schema Matching: An Overview

In this section, we present the main definitions used in this paper.

Definition 1. (Schema) A schema is a description of the structure and the con-tent of a model and consists of a set of related elements such as tables, columns,classes, or XML elements and attributes. �

By schema structure and schema content, we mean its schema-based propertiesand its instance-based properties, respectively.

Definition 2. (Match) Match is a function that takes two or more schemas asinput and produces a mapping as output �

Definition 3. (Mapping) A mapping is a set of mapping elements specifying thecorrespondence of schemas’ elements together. Each mapping element is 5-tuple< ID, S.si, T.tj , R, semrel > where: ID is an identifier for the mapping element, si

is an element of the first schema, tj is an element of the second one, and R indicatesthe similarity value between 0 and 1. The value of 0 means strong dissimilarity whilethe value of 1 means strong similarity. semrel is the semantic relationship betweentwo (or more) elements such as equivalence, synonyms, etc. �

To identify the correspondences among schemas’ elements, various matching al-gorithms have been proposed and numerous schema matching systems have beendeveloped. The current matching algorithms can be classified by either the infor-mation they exploit or their methodologies. According to the information theyexploit the matchers can be: [13] individual matchers which exploits only onetype of element properties in a single algorithm, or combining matchers that canbe one of two types: hybrid matchers (integrate multiple criteria [10]) and com-posite matchers (combine results of independently executed matchers [5,3]). Theexploited information by a matcher is called element properties. These propertiescan be classified as atomic or structural properties; schema-based or instance-based properties; and auxiliary properties.

According to the methodologies of the matching algorithm, they can be clas-sified as either rule-based or learner-based [6,15]. Table 1 summarizes the advan-tages and disadvantages of both systems.

Table 1. Comparison of Rule-base and learner-based Systems

Criteria Rule-based Learner-basedexploited information schema-based instance-based

nature of schema elements more static more dynamicschema size small size large size

training phase not needed neededresponse time less time more time

22 A. Algergawy, E. Schallehn, and G. Saake

3 Schema Matching Performance

In order to motivate the importance of trading-off performance aspects duringschema matching evaluation, we present the schema matching problem as follows:consider two schemas S and T having n and m elements respectively, we coulddistinguish between two types of matching, simple and complex– Simple matching: for each element s of S, find the most semantically similar

element t of T. This problem is referred to as one-to-one matching.– Complex matching: for each element s (or a set of elements) of S, find the

most semantically similar set of elements t1, t2, ..., tk of T.

The solution to the above problems is not unique, and this is due to inherentdifficulties in schema matching.

In [17], they introduce the concepts of matching state and matching space. Amatching state represents a possible matching result with complexity of n �mand the matching space is the all possible matching states with order of 2n×m.For example, consider n = 5 and m = 4, then to identify a suitable match resultfor a given schema matching problem, a schema matching system should searchin a matching space of 25×4 = 1048576 (very large matching space although verysmall schemas’ sizes).

Therefore, the schema matching problem is an optimization problem whichsearches for the best solution (matching state) among vast available solutions(matching space). However, the problem becomes not only identifying the bestsolution but also obtaining this solution in a reasonable time. Unfortunately,most previous evaluations ignore the time response of the match system andoften assume that matching is an off-line process. Hence, in order to cope withlarge scale schemas, we should take the processing time into account. In thefollowing subsection, we consider both performance aspects; effectiveness andefficiency.

To obtain a better overview about the current state of the art in evaluatingschema matching approaches, good reviews could be found in [2,16,8].

3.1 Performance Evaluation

To unify the performance evaluation of schema matching systems, the followingcriteria should be taken into account– Input : what kind of input information has been used (schema-based, instance-

based, auxiliary information)?– Output : what information has been included in the mappings? i.e. the type

of output, output formats, and the complexity of mappings,– Performance measures: what metrics have been chosen to quantify the match

result?; and– Effort : what kind of manual effort has been measured? To assess the manual

effort, one should consider both pre-match effort required before an auto-matic matcher can run (such as training of learner-based matchers, spec-ifying auxiliary information) as well as post-match effort to add the falsenegatives and to remove the false positives from the final match result.

Combining Effectiveness and Efficiency for Schema Matching Evaluation 23

Effectiveness Measures: First, the match task should be manually solved toget the real mappings Rm. Then, the matching system solves the same problemto obtain automatic mappings Am. After identifying both real and automaticmappings it is possible to define some terms that will be used in computing matcheffectiveness. False negatives A =Rm - Am: are the needed matches but notidentified by the system; True positives B =Rm ∩ Am: are the correct matchesand identified correctly by the system; False positives C =Am - Rm: are thefalse matches but identified by the system; and True negatives D: are the falsematches and correctly discarded by the system.

Precision and Recall: Based on real and automatic mappings, two measurescan be computed. The two measures are precision and recall, which originatefrom the information retrieval (IR) field [14]. Precision P can be computed fromP = |B|

|B|+|C| and the recall R is computed R = |B||B|+|A| . However, neither precision

nor recall alone can accurately assess the match quality. Hence, it is necessaryto consider a trade-off between them. There are several methods to handle sucha trade-off, one of them is to combine both measures. The most used combinedmeasures are:

– F-Measure: is the weighted harmonic mean of precision and recall, the tra-ditional F-measure or balanced F-score is:

F =2 ∗ |B|

(|B| + |A|) + (|B| + |C|) = 2 ∗ P ∗ R

P + R(1)

– Overall : is developed specifically in the schema matching context and embod-ies the idea to quantify the effort needed to add false negatives and removingfalse positives. It is introduced in the Similarity Flooding (SF) system [12]and is given by

OV = 1 − |A| + |C||A| + |B| = R ∗ (2 − 1

P) (2)

To determine which one is used as a measure for the effectiveness of a schemamatching system, we compare the two combined measures. Figure 2 as wellas equations (1) and (2) represent a good way for this comparison. Since theoverall measure is more sensitive to precision than recall. Therefore, in thispaper, we consider the overall (OV ) measure as an indicator for schema matchingeffectiveness.

Efficiency Evaluation: Efficiency is mainly contained in two properties: speed(the time it takes for an operation to complete), and space (the memory or non-volatile storage used up by the construct). In order to obtain a good efficiencymeasure for schema matching systems, we should consider the following factors:

– A first factor to consider when we evaluate the schema matching efficiencyis the critical phase of a matching process. A matching process consists ofmany phases and each phase contains multiple steps.

24 A. Algergawy, E. Schallehn, and G. Saake

00.2

0.40.6

0.81

0

0.5

1−1

−0.5

0

0.5

1

PrecisionRecall

Overa

ll

F−measure

Fig. 1. F-Measure and Overall against Precision and Recall

– A second factor is its automatabity. Human effort is very expensive; matchingapproaches that require excessive human interaction are impractical.

– A third factor that impacts schema matching efficiency is the type of method-ology used to compute similarity values.

Recent schema matching systems [16,8] introduce the time measure as a cri-terion of performance evaluation. In this paper, we take the time measure (T )as an indicator for the schema matching efficiency.

To sum up, With the emergence of applications that need fast matching sys-tems, such as incremental schema matching systems [1], and the applicationsthat match large schemas [4], the need to improve schema matching efficiencyincreases. For instance, consider a crisis management information system, it isnot sufficient to provide its users with the best possible mappings, it is alsonecessary to obtain these mappings within a reasonable time.

4 Combining Effectiveness and Efficiency

From the above criteria, we could conclude that the trade-off between effective-ness and efficiency of a schema matching system is considered as a multi-objectiveoptimization problem (MOOP). In this section, we present a definition for theMOOP and the approaches used to solve the problem [18,11]. In the followingdefinitions we will assume minimization (without loss of generality).

Definition 4. (Multi-objective Optimization Problem) An MOOP is defined as”Find x that minimizes F (X) = (f1(x)), f2(x), ..., fK(x))T s.t. x ∈ S andx = (x1, x2, ..., xn)T where f1(x), f2(x), ..., fk(x) are the k-objective functions,(x1, x2, ..., xn) are the n optimization parameters, and S ∈ Rn is the solution. �

In our approach, we have two objective functions, overall as a measure of effec-tiveness and time as a measure of efficiency. Therefore, we could rewrite themulti-objective function as: CE = (f1(OV )), f2(T )), where CE is the cost-effectiveness which to be maximized here. In a multi-objective problem, theoptimum solution consists of a set of solutions, rather than a single solution as

Combining Effectiveness and Efficiency for Schema Matching Evaluation 25

in global optimization. This optimal set is known as the Pareto Optimal set andis defined as follows: P := {x ∈ S| ∃x′ ∈ S F (x′) � F (x)}. Pareto optimalsolutions are known as the non-dominated or efficient solutions.

There are many methods available to tackle multi-objective optimizationproblems. Among them, we choose priori articulation of preference informa-tion. This means that before the actual optimization is conducted the differentobjectives are somehow aggregated to one single figure of merit. This can bedone in many ways, we choose weighted-sum approaches.

Weighted-Sum Approaches: The most easy and perhaps most widely usedmethod is the weighted-sum approach. The objective function is formulated asa weighted function as given min(ormax)

∑ki=1 wi ×fi(x) s.t. x ∈ S and wi ∈ R

|wi > 0,∑

wi = 1. By choosing different weightings for the different objectives,the preference of the application domain is taken into account. As the objectivefunctions are generally of different magnitudes and units, they should have tobe normalized first.

4.1 The Cost-Effectiveness of Schema Matching

Consider we have two schema matching systems A and B to solve the samematching problem. Let OVA and TA represent the overall and time measures ofthe system A respectively, while OVB and TB denote the same measures for thesystem B.

To analyze the cost-effectiveness of a schema matching system, we make useof the MOOP and its methods to solve it, namely the weighted-sum approach.Here, we have two objectives, namely effectiveness (measured by overall OV )and efficiency (measured by response time T ). Obviously, we can not directlyadd up an overall value to a response time value, since the resulting sum wouldbe meaningless, due to the difference of dimensional units. The overall value ofa schema matching system is normalized value, i.e. its range is between 0 and1, while the processing time is measured in seconds. Therefore, before summing(e.g. weighted average) the two quantities, we should normalize the processingtime.

To normalize the response time, for instance, the response time of the slowersystem (here TA) is normalized to the value 1, while the response time of thefaster system (TB) can be normalized to a value in the range [0,1] by dividingTB and TA, i.e. TB

TA.

We name the objective function of a schema matching system the cost-effectiveness (CE) and should be maximized. The cost-effectiveness is given by

CE =2∑

i=1

wi × fi(x) = w1 × OVn + w2 × 1Tn

(3)

where w1 is the weighting for the overall objective and denoted by (wov) and w2 isthe weighting for the time objective and denoted by (wt). In the case of compar-ing two schema matching systems, we have the following normalized quantities

26 A. Algergawy, E. Schallehn, and G. Saake

OVAn, OVBn, TAn and TBn where OVAn = OVA, OVBn= OVB, TAn =1, andTBn = TB

TA. We now endeavor to come up with a single formula involving two

quantities, namely normalized overall OVn and normalized response time Tn,where each of these quantity associated with a numerical weight to indicate itsimportance in the evaluation of the overall performance and to enrich the flexi-bility of the method. We write the equations that describe the cost-effectiveness(CE)for each system as follows:

CEA = wovA ∗ OVAn + wtA ∗ 1TAn

(4)

CEB = wovB ∗ OVBn + wtB ∗ 1TBn

(5)

where wov and wt are the numerical weights for the overall and time responsequantities respectively. If we let the time weights equal to zero, i.e. wt=0, thenthe cost-effective becomes the same normal evaluation considering only the ef-fectiveness aspects (wov=1).

The most cost-effectiveness schema matching system is the system having thelarger CE as measured by the above formulas. Equations 4 and 5 present a sim-ple but a valuable method to combine the effectiveness and the efficiency of aschema matching system. Moreover, this method is based on and supported by aproven and verified method; the multi-objective optimization problem. Althoughthe method is effective, it still has an inherent problem. It would be difficult todetermine good values for the numerical weights, since the relative importance ofoverall and time response is highly domain-dependent and, of course, very sub-jective. For example, when we are dealing with small-scale schemas, the overallmeasure is more dominant than the response time. Hence, we may select wov=0.8and wt=0.2. For the critically-time systems, the response time may have the sameimportance as the overall measure, then we may choose wov = wt=0.5.

To accommodate this problem, we need an optimization technique which en-ables us to determine the optimal (or close to optimal) numerical weights. Inthis paper, we set these values manually in the selected case studies. Automaticdetermination of numerical weight values is left for future work.

5 Experimental Evaluation

We evaluate our approach by comparing two recently well-known schema match-ing systems, namely COMA++ and BTreeMatch. We have obtained both sys-tems from its open source distribution1. All the experiments were performedusing the XBenchMatch tool developed in [8]. However, our proposed approachcan be applied easily to other schema matching systems. The problem is that itis hard to find available matching prototypes to test. We first briefly describe thetwo evaluated systems according to performance criteria describe above in the1 http://dbs.uni-leipzig.de/Research/coma.htmlhttp://www.lirmm.fr/�duchatea/XBenchMatch

Combining Effectiveness and Efficiency for Schema Matching Evaluation 27

paper. Then we describe used data sets. We then present the results for applyingthe proposed approach to evaluate different schema matching systems.

5.1 Evaluated Systems

Motivated by the fact that the two matching systems COMA++ and BTreeMatchare available through its open distribution, we use them to validate our approach.The two systems share some features and differ in others. The shared features in-clude they are schema-based approach; they utilize rule-based algorithms; theyaccept XML schemas as input and produce element-level mappings (one-to-one);they need pre-match effort e.g. tuning match parameters and defining match strat-egy; they evaluate matching effectiveness using precision, recall, and F-measure.The two systems differ in the following points: COMA++ exploits an external dic-tionary as auxiliary information; it uses a rich library of matchers including simple,hybrid, fragment, and context matchers; it did not consider matching efficiency inits evaluation. On the other hand, BTreeMatch does not utilize any auxiliary infor-mation sources; it uses a hybrid matcher based on Btree index; it deals with large-scale schemas measuring time response as the measure for matching efficiency.

5.2 Data Set

We used the same data sets described in [8]. To make this paper self-contained,we summarize the properties of the data sets in Table 2. The first one describesa person, the second is related to business order, the third one represents uni-versity courses, and the last one comes from the biology domain. These datasets are tested on the COMA++ and BTreeMatch systems to determine thecost-effectiveness and compare between them.

Table 2. Data set details from [8]

Person University Order BiologyNo. nodes(S1/ S2) 11/10 18/18 20/844 719/80

Avg No. nodes 11 18 432 400Max. depth (S1/ S2) 4/4 5/3 3/3 7/3

No. mappings 5 15 10 57

5.3 Experimental Results

In this section we show the experimental results of applying our approach toCOMA++ and BTreeMatch using the XBenchMatch tool developed in [8].

Small-scale Schemas: The cost-effectiveness of test matchers using small-scale schemas such as university and person schemas can be computed by thefollowing equations:

CECOMA++s = wOV ∗ OVCOMA++ + wt ∗ 1TCOMA++n

(6)

28 A. Algergawy, E. Schallehn, and G. Saake

CEBTMs = wOV ∗ OVBTM + wt ∗ 1TBTMn

(7)

where OVCOMA++=0.8, OVBTM=0.3, TCOMA++=0.9s, TBTM=0.6s andwOV =0.8 (for small-scale schemas), and wt=0.2, then

CECOMAs=0.64 + 0.21 =0.84, and CEBTMs=0.24 + 0.2

0.60.9

=0.5

Large-scale Schemas: The cost-effectiveness of test matchers using small-scale schemas such as the biology schema can be computed by the followingequations:

CECOMA++l = wOV ∗ OVCOMA++ + wt ∗ 1TCOMA++n

(8)

CEBTMl = wOV ∗ OVBTM + wt ∗ 1TBTMn

(9)

where OVCOMA++=0.4, OVBTM=0.8, TCOMA++=4s, TBTM=2s and wOV =0.6(for large-scale schemas), and wt=0.4, then

CECOMAs=0.24 + 0.41 =0.64, and CEBTMs=0.48 + 0.4

24

=1.28

5.4 Discussion

The experiment section shows that schema matching prototypes are best suitedfor certain situations. For example, see Table 3, the cost-effectiveness of COMA++is well accepted for small-scale schemas while it is not accepted for BTreeMatch.However, for large-scale schemas, the cost-effectiveness increases for BTreeMatchand decreases for COMA++.

Table 3. Summary of results

Evaluated System OV T CEsmall-scale large-scale small-scale large-scale small-scale large-scale

COMA++ 0.8 0.4 0.9s 4s 0.84 0.64BTreeMatch 0.3 0.8 0.6s 2s 0.5 1.28

We study the relationship between cost-effectiveness and both performanceaspects (overall and response time). Figures 2 illustrates this relationship, wherethe squared line represents the overall only, the dashed line represents the re-sponse time and the solid line represents both. Figure 2(a) is drawn for thesmall-scale case (i.e.wOV =.8 and wt=.2 ) while Fig. 2(b) is drawn for the large-scale schemas (wOV =.5 and wt =.5 ). In the case of small-scale schemas, thecost-effectiveness is more biased to overall measure than the response time of thesystem, while in the case of large-scale schemas, the cost-effectiveness is biasedby both performance aspects.

Combining Effectiveness and Efficiency for Schema Matching Evaluation 29

00.2

0.40.6

0.81

0

0.5

10

0.5

1

1.5

2

2.5

3

response timeoverall

cost

−effe

ctiv

enes

sbothresponse timeoverall

(a) small-scale schemas

00.2

0.40.6

0.81

0

0.5

10

0.5

1

1.5

2

2.5

3

response timeoverall

cost

−effe

ctiv

enes

s

bothresponse timeoverall

(b) large-scale schemas

Fig. 2. Performance Aspects with Cost-Effectiveness

6 Summary and Future Work

In this paper, we presented an approach, where both effectiveness and efficiencyare taken into account. We introduced the trade-off between aspects of schemamatching performance as a multi-objective optimization problem. Then, we makeuse of the weighted-sum approach as priori articulation of preference informa-tion. The cost-effectiveness is taken as a measure for the overall performance.This measure combines (weighted sum) the two performance aspects in a singleformula. The formula contains overall measure as an indicator for effectivenessand normalized response time as an indicator for efficiency. To enrich the flexibil-ity of the method, each quantity is associated with a numerical weight to indicateits importance in the evaluation of the overall performance. In this paper, weset the numerical weights manually depending on our experience.

We applied our proposed method to fairly evaluate and compare two well-known schema matching systems (COMA++ and BTreeMatch). We have dis-cussed the effect of schema size on match performance. For small-scale schemas,match performance is more affected by match effectiveness, while in large-scaleschemas two performance aspects have equal effect. Our proposed approach canbe integrated with the recent schema matching benchmarks. Moreover, our on-going work is to build a unified evaluation process in order to decide on schemamatching performance. The impact of numerical values of weights and identify-ing their optimal values automatically is one of our future work.

References

1. Bernstein, P.A., Melnik, S., Churchill, J.E.: Incremental schema matching. In:VLDB, Korea (2006)

2. Do, H.H., Melnik, S., Rahm, E.: Comparison of schema matching evaluations. In:the 2nd Int. Workshop on Web Databases (2002)

3. Do, H.H., Rahm, E.: COMA- a system for flexible combination of schema matchingapproaches. In: VLDB, pp. 610–621 (2002)

30 A. Algergawy, E. Schallehn, and G. Saake

4. Do, H.-H., Rahm, E.: Matching large schemas: Approaches and evaluation. Infor-mation Systems 32(6), 857–885 (2007)

5. Doan, A., Domingos, P., Halevy, A.: Reconciling schemas of disparate data sources:A machine-learning approach. SIGMOD, 509–520 (2001)

6. Doan, A., Halevy, A.: Semantic integration research in the database community:A brief survey. AAAI AI Magazine 25(1), 83–94 (2005)

7. Drumm, C., Schmitt, M., Do, H.-H., Rahm, E.: Quickmig - automatic schemamatching for data migration projects. In: Proc. ACM CIKM 2007, Portugal (2007)

8. Duchateau, F., Bellahsene, Z., Hunt, E.: Xbenchmatch: a benchmark for XMLschema matching tools. In: VLDB 2007, Austria, pp. 1318–1321 (2007)

9. Duchateau, F., Bellahsene, Z., Roche, M.: An indexing structure for automaticschema matching. In: SMDB Workshop, Turkey (2007)

10. Madhavan, J., Bernstein, P.A., Rahm, E.: Generic schema matching with cupid.In: VLDB, Italy, pp. 49–58 (2001)

11. Marler, R., arora, J.: Survey of multi-objective optimization methods for engineer-ing. Struct. Multidisc Optim. 26, 369–395 (2004)

12. Melnik, S., Garcia-Molina, H., Rahm, E.: Similarity flooding: A versatile graphmatching algorithm and its application to schema matching. In: ICDE 2002 (2002)

13. Rahm, E., Bernstein, P.A.: A survey of approaches to automatic schema matching.VLDB Journal 10(4), 334–350 (2001)

14. Rijsbergen, C.J.: Information Retrieval, 2nd edn., London (1979)15. Smiljanic, M.: XML Schema Matching Balancing Efficiency and Effectiveness by

means of Clustering. PhD thesis, Twente University (2006)16. Yatskevich, M.: Prelimanary evaluation of schema matching systems. Technical

Report #DIT-03-028, Tornoto University (2003)17. Zhang, Z., Che, H., Shi, P., Sun, Y., Gu, J.: Formulation schema matching problem

for combinatorial optimization problem. IBIS 1(1), 33–60 (2006)18. Zitzler, E., Thiele, L.: Multiobjective evolutionaty algorithms: A comparative case

study and the strength pareto approach. IEEE Tran. on EC 3, 257–271 (1999)