Evaluating the metadata quality of the IPL

Evaluating the Metadata Quality of the IPL

Authors

Shanshan Ma

Drexel University

College of Information Science and Technology, 3141 Chestnut Street, Philadelphia, PA 19104

Email: [email protected]

Caimei Lu

Drexel University



Xia Lin

Drexel University



Mike Galloway

Drexel University



Metadata quality is directly related to the quality of services provided by digital libraries. The

paper presents results from evaluating the metadata quality of the IPL (www.ipl.org). Two

evaluation methods were used: a preliminary automatic evaluation and a human involved

evaluation using a survey. The automatic evaluation was focused on the completeness of major

IPL metadata fields. The human evaluation asked evaluators to judge the accuracy,

completeness, consistency, functionality and usefulness of IPL metadata fields. Qualitative

feedback from evaluators provided us an in depth picture of the IPL metadata quality. We also

compared results from the automatic evaluation and the human evaluation.

Introduction

The Internet Public Library website (www.ipl.org) holds a collection of large amount of

authoritative websites on various subjects. The IPL primarily has three collections: the general

collection containing about 10,000 links; the youth collection (KidSpace) with about 3500 links;

and the teen collection (TeenSpace) with about 2000 links. Each of these links point to a website

that contains valuable content to the IPL collection. Each website is described by a metadata

record stored by an open source software known as Hypatia. The IPL metadata records are

created by non-professionals, mainly library science majored college students who acted as

volunteers.

We investigated the metadata records currently in the collection. The metadata evaluation

project was carried out in two stages: a preliminary automatic evaluation without human

evaluators and a human evaluation using a survey. We are interested in whether there were any

differences between the results from these two methods. Advantages and disadvantages of using

an automatic approach and a human involved approach for metadata evaluation will be

discussed in the paper.

Literature Review

Why Metadata Quality?

Metadata quality is closely related to the quality of services provided by digital

libraries. Low quality metadata can impede the process of resource discovery carried

out by users. For instance, incomplete and inaccurate data entry in the metadata

records may cause the described digital objects never being found by users. Metadata

quality is also essential for the interoperability and aggregation of distributed

repositories. For example, in digital projects like OAI, it is important that data

providers submit standard high-quality metadata. Otherwise, users would not be able

to search across different metadata repositories and find all the relevant resources

distributed in different digital repositories. Considering the significance of metadata

quality, it is necessary for digital libraries to evaluate their metadata collections

frequently. Evaluation results can help digital libraries identify defects in their

metadata collection.

When doing metadata quality evaluation, the first question one may ask is what high

quality metadata is, or what properties high quality metadata should possess.

Researchers answered this question from different perspectives (Cole, 2002; Lei et

al., 2006, Guy et al., 2004). In general, high-quality metadata should facilitate the

process of identifying, describing, managing and searching data.

Metadata Evaluation Criteria and Methods

With different evaluation goals and different types of data objects and collections,

researchers apply different criteria and principles in the evaluation process. Bruce and

Hillmann (2004) proposed six principles for evaluating metadata quality. The

suggested criteria include completeness, accuracy, provenance, conformance to

expectation, logical consistency, coherence, timeliness and accessibility. Moen &

Stewart (1998) evaluated the GILS metadata records according to four criteria:

completeness, profile, accuracy, and serviceability. Shreeves et al. (2005) evaluated

the quality of Dublin Core metadata records of collections harvested through the Open

Archives Initiative (OAI) Protocol. Especially, their evaluation focused on four

dimensions: completeness, structural consistency, semantic consistency and

ambiguity. Zeng et al. (2004) applied four criteria to evaluate NSDL’s metadata

repository: completeness, correctness, consistency and duplication. Tolosana-

Calasanz et al. (2006) developed a quantitative method for assessing the quality of

geographic metadata. The researchers first formulated a list of geographic quality

criteria by consulting domain experts. The identified criteria indicated two tendencies:

structural and semantic. Rather than evaluating the metadata quality directly based

on the structure and content of metadata records, some researchers tried to evaluate

the metadata in regards to its function and use. Skovitne (2000) evaluated the

effectiveness of Dublin Core metadata embedded in web pages in facilitating the

retrieval of web resources.

In the above research, the evaluated metadata was created by cataloging or

metadata professionals. Some researchers were interested in assessing the quality of

metadata generated by authors or contributors. Greenberg et al. (2001) conducted a

study to evaluate author-generated Dublin Core metadata. Two cataloging

professionals were asked to determine whether the quality of each metadata element

was acceptable. They also assessed the intelligibility and the general correctness of

the metadata. Especially, the subject keywords were evaluated in terms of specificity

and exhaustiveness. The results indicated that authors can create good quality Dublin

Core metadata. Wilson (2007) evaluated the contributor-supplied metadata at three

levels: syntax, structure, and semantic content. At the syntax level, the overall

completeness of the metadata record submitted by the contributor was evaluated. At

the structure level, consistency and correctness of the format of each metadata

element was assessed. At the semantic content level, the correctness of each

element’s content was checked.

While most existing metadata quality evaluation is based on human review, some

researchers explored how to evaluate metadata quality automatically without the

involvement of human judgment. Hughes (2004) and Bui & Park (2006) collected the

statistical information at the metadata repository level to get an overall estimation

about the quality of metadata records in the repository. Dushy & Hillmann (2003)

applied a visualization tool called Spotfire to facilitate automatic metadata

evaluation. The tool can present a graphical overall view of the structure of a

collection’s metadata.

Although statistical methods and visualization tools may be efficient during

evaluation process, they can only be used to assess the structural aspect about

metadata quality, such as completeness and consistency. Ochoa & Duval (2006)

proposes a way for automatic evaluation based on the evaluation framework

proposed by Bruce & Hillmann (2004). For each of the evaluation criterion, they

developed some quality metrics which can be calculated mathematically. For

instance, the completeness of a metadata record is measured by the number of filled

fields divided by the total number of fields in a record. Even the quality criteria at

semantic level can be measured quantitatively based on some metrics. For example,

the semantic accuracy of fields like title, abstract and keywords is measured through

the vector cosine distance between the content of these fields and the content of the

original resource.

In our research, we decided to use both automatic evaluation and human review to

gain a thorough understanding of the metadata quality in the IPL. We will focus on

accuracy, completeness and consistency of the IPL metadata. These criteria can both

be measured quantitatively and qualitatively. We also want to look into the function

roles of metadata. Therefore, evaluations on searchability, browsability and

management were also included in our survey.

Method

The IPL Metadata Schema

The metadata records in the IPL were developed by student volunteers. Before they

start working on developing any IPL metadata records, student volunteers are

required to go through an IPL collection development manual step by step. They first

get familiarized with the IPL's collection policy and practice with assignments. They

will then be evaluated in order to be considered as qualified volunteers. When a link

to a new or existing external resource is provided, the volunteers would have to first

evaluate the quality of the resource and decide whether to keep or add the resource

to the collection, depending on whether it meets the IPL’s collection policy standards.

After the decision is made, the volunteer then create and edit the metadata record for

the resource, following the IPL metadata development guidelines.

Collection development in the IPL is conducted using a web interface to the Hypatia

database. In order to get a full picture of the IPL metadata structure, the IPL

metadata schema is mapped to the simplified Dublin Core metadata standard in

Table 1. The IPL metadata schema contained 9 elements from Dublin Core. The IPL

metadata schema also contained some elements that are not present in the Dublin

Core, such as Youth level, Body, Comments.

Table 1. Dublin Core and the IPL Metadata elements

Evaluation Procedure

The metadata evaluation project was carried out in two stages: a preliminary

automatic evaluation without human evaluators and a human evaluation using a

survey. In the automatic evaluation stage, we mainly aimed at finding out how

complete the metadata fields are. For each of the metadata elements, a SQL query

was run to find out how many records have value in the field. A sample query is

shown here.

“SELECT count (ic.item) FROM Item_coll ic JOIN Value_d vd ON vd.idx = tf.type JOIN

Textfield tf ON tf.entry = ic.item WHERE ic.coll = 1 AND vd.attr = 2 AND vd.idx = 1 AND

vd.variant = 0”

For the second evaluation stage, human evaluators’ assessments of the metadata

quality were collected. An online evaluation form was developed using Google Doc.

The questionnaire contained 21 questions, trying to evaluate the metadata quality

from four aspects: accuracy, completeness, consistency, and functionality &

usefulness. The accuracy section included six questions, asking how accurately each

of the IPL metadata fields describes the pointed web resource; The completeness

section included eight questions, asking how complete the IPL metadata fields are;

The consistency section included four questions, asking how well the metadata record

adheres to the current IPL metadata development guidelines; and the functionality &

usefulness section included three questions, asking how functional and useful the

overall metadata record is.

We used a serious of random numbers generated by computer to collect a random

sample of 467 records from the IPL collection. Eighty-three college students who were

taking digital reference and digital library related courses were recruited to act as

human evaluators. They evaluated the records using the evaluation form and

submitted the evaluation online. The evaluation task was given as part of their course

assignments.

Quantitative Results

Automatic Evaluation Results

A completeness percentage was calculated for each of the IPL metadata fields. Three

largest collections in the IPL repository, the general collection, the youth collection

(the KidSpace) and the teen collection (the TeenSpace), were investigated. Since

results from the three collections were similar, we only report the completeness

percentage of the general collection, in figure 1. The results show that while the Main

title, Main URL and abstract were 100% complete, many other fields had very low

completion rates.

Figure 1. Automatic evaluation results of the completeness of IPL metadta

Human Evaluation Results: Accuracy of tested fields

Student evaluators judged the accuracy of keywords, subject headings and abstract

by giving a rating, ranging from 0 to 5. The rating of zero means that the value in the

field was not at all accurate. The rating of five means that the value in the field was

very much accurate. Results show that the mean accuracy rating for keywords was

2.06, and the standard deviation was 2.01. The mean accuracy rating for abstract

was 3.91, and the standard deviation was 1.25. The mean accuracy rating for subject

heading is 4.4, and the standard deviation was 1.11.

We calculated the percentage of ratings that were above three (from somewhat

accurate to very much accurate). While 44% of keyword fields were rated above three,

91% of the subject headings fields were rated above three and 83% of the abstract

fields were rated above three. Figure 2 shows the percentage of ratings that were

above three for the three tested fields: keywords, subject headings and abstract.

Figure 2. Ratings on the accuracy of the keyword, subject headings, and abstract

fields

Evaluators were also asked to test and judge the accuracy of the field of URL. The

results showed that 93% of the main URL links were working, and 88% of the URL

opened the page as the metadata record described.

Human Evaluation Results: Completeness of tested fields

Student evaluators judged the completeness of keywords, author and title by giving a

rate, ranging from 1 to 5. The rating of one means that the value in the field was not

at all complete. The rating of five means that the value in the filed was very much

complete. As the lowest, the mean completeness rating for keywords was 2.52, and

the standard deviation was 1.50. The mean completeness rating for author was 3.79,

and the standard deviation was 1.51. The mean completeness rating for title was

4.32, and the standard deviation was 0.98.

Similarly, we calculated the percentage of ratings that were above three (from

somewhat complete to very much complete). While 48% of the keywords fields were

rated above three, 74% of the author fields were rated above three, and 93% of the

title fields were rated above three, as shown in figure 3.

Figure 3. Ratings on the completeness of the keyword, author, and title fields

Question asking the presence of alternative URL resulted that 91% of the records

didn’t have an alternative URL. Questions asking about the presence of EMAIL

revealed that 78% of the records had an email but only 38% of the provided email

matched with what was given in the web page. The sources of the email were from

four categories: author of the page (29%), contact email (27%), web master (12%),

and cannot be determined (31%).

Human Evaluation Results: Consistency of tested fields

The consistency section of the questionnaire was mainly focused on the field of

author and the field of keywords, targeting at evaluating the level of adherence to the

IPL metadata development guidelines. Evaluators judged the consistency of the fields

by choosing “yes” or “no” to a consistency statement.

For the field of author, good consistency was described as “the field is filled with a

person’s name and whether the name is given in Last Name, First Name format”. The

results showed that only 12% of the records were considered consistent, with 28%

considered not consistent, and the rest undecided. For the field of keywords, good

consistency was described as “the field includes at least 2 keywords or phrases not

found in the Main Title, Abstract and Subject heading fields”. Only 27% of the records

were considered consistent, with 72% considered not consistent.

Human Evaluation Results: Overall evaluation

We asked the evaluators to judge the overall accuracy, completeness and consistency

for the metadata records. Similar to previous rating scales, the rating of one means

that the metadata record was not at all accurate, complete or consistent. The rating

of five means that the metadata record was very much accurate, complete and

consistent. Figure 4 shows the ratings of the overall consistency, completeness and

accuracy. The mean value of the overall consistency rating was 3.46, and the

standard deviation was 0.99. The mean value of the overall completeness rating was

3.58, and the standard deviation was 1.07. The mean value of the overall accuracy

rating was 3.90, and the standard deviation was 1.02.

Percentages of ratings to overall consistency, overall completeness and overall

accuracy that were above three were calculated. 86% of the metadata records were

considered at least somewhat consistent overall. 86% of the metadata records were

considered at least somewhat complete overall. 92% of the metadata records were

considered at least somewhat accurate overall.

Figure 4. Ratings of consistency, completeness and accuracy

We also asked evaluators to judge the searchability, browsability and management of

the metadata records, by giving a rating between 1 to 5 in three separate questions.

The mean value of the searchability was 3.52, and the standard deviation was 1.20.

The mean value of browsability was 3.77, and the standard deviation was 1.10. The

mean value of management was 3.70, and the standard deviation was 1.09.

Percentages of ratings to management, browsability and searchability that were

above three were calculated. 87% of the records were rated above three. 88% of the

records were rated above three. 81% of the records were rated above three.

Figure 5. Ratings of management, browsability, and searchability

Qualitative Results

We gathered qualitative evaluations for three fields: abstract, subject headings and keywords.

The evaluators gave detailed description of the rationales of their ratings. The qualitative

feedback helped us to understand the quantitative results.

Abstract

From the automatic evaluation, we found that the presence of abstract is close to

100%. However, the human evaluation found that the overall accuracy rating for

abstract is 3.91 (a perfect accurate rating is 5). The reasons why the author field was

considered not so accurate were from the following aspects.

First of all, a large number of abstracts could no longer describe the targeted web

resources. The web sites have consistently been updated. The update of the metadata

did not keep up with the pace. One evaluator used a sarcastic tone to indicate that

the abstract did not include some of the content that the web site has, “maybe [this

content] didn't exist 11 years ago when the metadata was first created”. The IPL holds

a large collection and the metadata development was done manually. Some

metadata records were created when the resources were first added to the database

and were never again updated.

Second, the abstracts were supposed to be summaries of the web site content, rather

than a quotation from the web site texts. Evaluators pointed out that a large number

of abstracts were simply copy and paste from the web site about page. The reason

might be that summarizing the content of a web site with large volume of data was

difficult for student volunteers. Since there was no quality control, volunteers might

simply choose a “copy and paste” approach.

Third, wordings of the abstracts seemed to be an important factor that affected

evaluators’ ratings. They described the abstracts as “vague”, “not comprehensive”,

“too technical”, “not so descriptive”, or sometimes “too casual”. Sometimes they were

too short to cover all the important resources from the web site; sometimes they were

too long that they exceeded the length limit. The tone for the abstract was also a

concern. Evaluators were not sure whether abstracts should sound like coming from

the web site owner, or from a neutral observer. The reasons for such confusions might

come from the IPL metadata development guidelines itself. There is no coherent

standard of quality. Different evaluators might have completely different opinions on

whether the abstract was good enough.

The fifth reason was that incorrect or wrong information was contained in the

abstract. The last reason was the presence of typos, misspelling and grammar

mistakes in the body of abstracts.

Subject Headings

Student volunteers decided on which subject headings to use by choosing from a drop

down list of pre-defined categories. The evaluators commented that the subject

heading structure sometimes did not contain the specific category that can pinpoint

resource. Some subject headings were close enough but not quite well.

Second, the evaluators found that some records needed more detailed sub-subject

headings or simply needed more subject headings. Sometimes there was just one

subject heading and a few more could be added. More subject headings can provide

new angles for patrons to get access to the information.

Keywords

The completeness percentage of keywords from automatic evaluation was below

60%. The human evaluation of keywords was consistent with the automatic

evaluation: 44% of keyword fields were rated above three for accuracy, 48% of the

keywords fields were rated above three for completeness. The mean accuracy rating

for keywords was 2.06. The mean completeness rating for keywords was 2.52.

Qualitative feedback revealed that the major reason for the low ratings was no

presence of keywords at all. Almost half of the reviewed records did not contain any

keywords. For those records that had keywords, some of them did not have enough

number of keywords.

Second, some of keywords were existing words from the abstract or titles, or subject

headings. Such keywords did not add any new value to the metadata, which was

against the IPL development guidelines.

Third, the evaluators were not satisfied with the quality of the keywords. They were

either too broad or too narrow. There were no coherent standards on how general or

how specific the keywords should be.

Fourth, some keywords were not likely the ones that might be used by patrons for

search. The evaluators believed that the development of keywords should keep the

users in mind and came up with the list of keywords that would be useful for later

searching.

The fifth reason came from the currency of the information: the information of the

target resources was updated but the keywords were not updated accordingly. The

last reason came from misspellings, capitalizing issues, and typos. Some evaluators

had the concern that whether plural form and singular form of the same keyword

should both be present in the keyword field.

Overall Consistency Problems

The currency problem seems to be the most common reason for the poor quality of

the metadata records. It often happens that the web site that the metadata described

was not the one that the URL pointed to: either the web site has changed address, or

the web site has been updated, or sometimes the link simply did not work.

Second, the metadata developers had confusions about the IPL metadata schema,

which resulted in inconsistency in the metadata records. For example, “The record

treats the site's editor as the site's author”; “The creator listed is for the parent site,

but the author linked at the bottom of the game theory page for that content is an

entirely different person, parts of whose cybernetic dictionary have been represented

with his permission”; “Full contact info for the real author has been provided, and the

creator should actually be the publisher or editor as he is part of an editorial board

that maintains the parent site”; “The current URL was placed in the ‘Former URL

section’ and the former URL was placed in the ‘Main URL section’”.

The third reason was that some resources described no longer met the IPL collection

inclusion criterion. In some cases, the web site started to require payment to get

access to the information; or the site was no longer active; or sometimes it became

too commercial and contained too much ads. For example, an evaluator pointed out

that the overall information on the site was valuable, but that there was an area for

jokes which may not be appropriate.

Fourth, adherence to the IPL development guidelines seems to be a common

problem. Violations to the guidelines were observed quite often in the metadata

records. For instance, the names in the author field were not in last name, first name

format has been mentioned many times by the evaluators.

The last but not the least, incorrect information was contained in the metadata fields,

either coming from human mistakes or typos.

Discussion and Conclusion

In order to assess the current quality status of the IPL metadata, we planed and conducted a two-

stage evaluation. In the first stage, the automatic evaluation gave us an overview of

completeness for the metadata records in the collection. We found out that the field of main title,

main URL and abstract seemed to be fine. The field of author and keywords were far from

satisfactory as the completeness percentage were low. We then used a survey method to collect

human evaluators’ feedback on the quality of the metadata, focusing on the following aspects:

accuracy, completeness, consistency and over functionality. Human evaluators were asked not

only to rate the quality for test fields, but also to give out qualitative comments on why whey gave

out such ratings. The results from human evaluation were much more informative than the

automatic evaluation.

Following findings in terms of the IPL metadata quality were revealed from the evaluation: main

title did not have any quality problems. However, the field of title contained a list of sub-fields:

former title, sort title, acronym, alternate title, alternate spelling, real title, and authority title.

Some of these sub-fields were required, some optional, which made it a complicated field for

student volunteers. As results shown, only 60% of the title field was considered very complete.

Main URL was 100% complete based on the automatic evaluation, however, for about 12% of the

records, the page opened by the URL was not the page described in the records. The field of

abstract was close to 100% complete based on the automatic evaluation, but only 45% was

considered accurate by the evaluators. Automatic evaluation pointed out that author and keyword

might have the most problems. Indeed human evaluators found out that only half of the author

fields were very complete. For the field of keywords only less than one fifth of the records were

considered very complete or very accurate.

From the qualitative feedback from evaluators, we found out five major reasons that have caused

the problems to the IPL metadata quality: large amount of the metadata information was

outdated; student volunteers did not fully understand the IPL metadata schema and had troubles

in filling out some of the fields; some of the recourses included in the IPL but actually should be

cut from the collection; the understanding of the IPL metadata development guidelines were not

as good as it should be; problems associated with typos, misspellings and grammar mistakes.

We believe that some changed can be made to the IPL metadata development process. First of

all, student volunteers should go through more thorough trainings of the IPL schema itself and

the development guidelines as well. There should be more quality control in the IPL metadata

development process. Second, manual metadata development seems to be an easy way to start

with; it gets more difficult as the IPL keeps growing in both collection size and personnel. Some

automatic or semiautomatic metadata creation method is needed for the IPL.

Acknowledgements

We would like to thank all the researchers who have been working on the IPL and IPL related

projects. We would also like to thank all the students who participated in the metadata

evaluation project and provided their valuable feedback.

References

Bui, A. & Park, J. (2006). An Assessment of Metadata Quality: A Case Study of the National

Science Digital Library Metadata Repository. 2006 annual conference of the Canadian

Association for Information Science.

Bruce, T. R. & Hillmann, D. (2004). The continuum of metadata quality: defining, expressing,

exploiting. Chicago: American Library Association.

Cole, T. W. (2002). Creating a Framework of Guidance for Building Good Digital Collections, First

Monday. vol. 7.

Dushay, N. & Hillmann, D. (2003). Analyzing metadata for effective use and re-use. DCMI

Metadata Conference and Workshop Seattle, USA.

Guy, M., Powell, A. & Day, M. (2004). Improving the Quality of Metadata in Eprint Archives.

Ariadne, vol. 38.

Greenberg, J., Pattuelli, M. C., Parsia, B. & Robertson, W. D. (2001). Author-generated Dublin Core

Metadata for Web Resources: A Baseline Study in an Organization. Journal of Digital Information,

vol. 2, pp. 1-10.

Hughes, B. (2004). Metadata quality evaluation: Experience from the Open Language Archives

Community. Proceedings of Digital Libraries: International Collaboration and Cross-Fertilization,

vol. 3334, pp. 320-329.

Lei, Y. G., Sabou, M., Lopez, V., Zhu, J. H., Uren, V. & Motta, E. (2006). An infrastructure for

acquiring high quality semantic metadata. Proceedings of Semantic Web: Research and

Applications, vol. 4011, pp. 230-244.

Moen, W. E. & Stewart, E. L. (1998). Assessing Metadata Quality: Findings and Methodological

Considerations from an Evaluation of the U.S. Government Information Locator Service (GILS).

IEEE International Forum on Research and Technology Advances in Digital Libraries, Santa

Barbara, CA, USA, pp. 246-255.

Ochoa, X. & Duval, E. (2006). Towards Automatic Evaluation of Learning Object Metadata Quality.

Advances in Conceptual Modeling - Theory and Practice, pp. 372-381.

Shreeves, S. L., Knutson, E. M., Stvilia, B., Palmer, C. L., Twidale, M. B. & Cole, T. W. (2005). Is

“Quality” Metadata “Shareable” Metadata? The Implications of Local Metadata Practices for

Federated Collections. Proceedings of Twelfth National Conference of the Association of College

and Research Libraries Chicago, pp. 223-237.

Sokvitne, L. (2000). An Evaluation of the Effectiveness of Current Dublin Core Metadata for

Retrieval. ALA Biennial Conference and Exhibition

Stvilia, B., Gasser, L., Twidale, M. B. & Smith, L. C. (2007). A framework for information quality

assessment. Journal of the American Society for Information Science and Technology, vol. 58, pp.

1720-1733.

Tolosana-Calasanz, R., Alvarez-Robles, J. A., Lacasta, J., Nogueras-Iso, J., Muro-Medrano, P. R. &

Zarazaga-Soria, F. J. (2006). On the problem of identifying the quality of geographic metadata.

Research and Advanced Technology for Digital Libraries, vol. 4172, pp. 232-243.

Wilson, A. J. (2007). Toward releasing the metadata bottleneck - A baseline evaluation of

contributor-supplied metadata. Library Resources & Technical Services, vol. 51, pp. 16-28.

Zeng, M. L., Subrahmanyam, L., & Shreve, G. M. (2004). Metadata quality study for the National

Science Digital Library (NSDL) Metadata Repository. Digital Libraries: International Collaboration

and Cross-Fertilization, Proceedings, vol. 3334, pp. 339-340.

Documents

Evaluating the metadata quality of the IPL