10% Wrong 90% Wrong

Preview:

DESCRIPTION

2011 Evergreen International Conference presentation on our MARC de-duplication project.

Citation preview

10% Wrong, 90% Done

Rogan HambySouth Carolina State Libraryrhamby@statelibrary.sc.gov

Shasta Brewer York County Library Shasta.brewer@yclibrary.net

A practical approach to bibliographic de-duplication.

Made Up Words

When I say ‘deduping’ I mean

‘MARC record de-duplication’

The Melting Pot

We were ten library systems with no standard source of MARC records.

We came from five ILSes.

Each had its own needs and workflow.

The MARC records reflected that.

Over 2,000,000 Records

Ten library systems joined in three waves.

Wave 1

Wave 2

Wave 3

0

500000

1000000

1500000

2000000

2500000

Early Effort

During each wave we ran a deduping script.

The script functioned as designed, however its matches were too few for our needs.

100% Accurate

It had a very high standard for creating matches.

No bad merges were created.

Service Issue

When a patron searched the catalog it was messy.

This caused problems with searching and placing holds.

It’s All About the TCNs

Why was this happening?

Because identical items were divided among multiple similar bib records with distinct fingerprints due to coming from multiple

sources.

Time for the Cleaning Gloves

In March 2009 we began discussing the issue with ESI. The low merging rate was due to the very precise and conservative finger printing of

the deduping process. In true open source spirit we decided to roll

our own solution and start cleaning up the database.

Finger Printing

Finger printing is identifying a unique MARC record by its

properties.

As finger printing identifies unique records it was of

limited use since our records came from many

sources.

A Disclaimer

The initial deduping, as designed, was very accurate. It emphasized avoiding imprecise

matches.

We decided that we had different priorities and were willing to make compromises.

MARC Crimes Unit

We decided to go past finger printing and build profiles based on broad MARC attributes.

Project Goals

Improve Searching

Faster Holds Filling

The Team

Shasta Brewer – York County

Lynn Floyd – Anderson County

Rogan Hamby – Florence County / State Library

The Mess

2,048,936 bib records

On Changes

During the development process a lot changed from early discussion to implementation.

We weighed decisions heavily on the side of needing to have a significant and practical

impact on the catalog.

I watch the ripples change their size / But never leave the stream - David Bowie, Changes

Modeling the Data

Determining match points determines the scope of the record set you create mergers from.

Due to lack of uniformity in records, matching became extremely important. Adding a single extra limiting

match point caused high percentage drops in possible matches reducing the effectiveness of the project.

Tilting at Windmills

We refused to believe that the highest priority for deduping should be avoiding bad matches.The highest priority is creating the maximum

positive impact on the catalog.

Many said we were a bit mad. Fortunately, we took it as a complement.

We ran extensive reports to model the bib data.

A risky and non-conventional model was proposed.

Although we kept trying other models, the benefit of large matches using the risky model

made it too compelling to discard.

Why not just title and ISBN?

We did socialize this idea. And everyone did think we were nuts.

Method to the Madness

Title and ISBN are the most commonly populated fields for identifying unique items.

Records with ISBNs and Titles accounted for over 60% of the bib records in the system. The

remainder included SUDOCs, ISSNs, pre-ISBN items and some that were just plain

garbage.

Geronimo

We decided to do it!

What Was Left Behind

Records without a valid ISBN.Records without any ISBN (serials, etc..).

Pre-Cat, stubs records, etc…Pure Junk Records.

And other things that would require such extraordinarily convoluted matching that it exceeded the risk even beyond our pain

threshold for a first run.

We estimated based on modeling a conservative ~300,000 merges or about 25%

of our ISBNs.

The Wisdom of Crowds

Conventional wisdom said that MARC could not be generalized because of unique

information in the records.We were taking risks and very aware of it but

the need to create a large impact on our database drove us to disregard friendly

warnings.

An Imperfect World

We knew that we would miss things that could potentially be merged.

We knew that we would create some bad merges.

10% wrong to get it 90% done.

Next Step … Normalization

With matching decided we needed to normalize the data. This was done to copies of the production MARC records and that used to

make lists.

Normalization is needed because of variability in how data was entered. It allows us to get the most possible matches based on data.

Normalization Details

We normalized case, punctuation, numbers, non-Roman characters, trailing and leading spaces, some GMDs put in as parts of titles,

redacted fields, 10 digit ISBNs as 13 digit and lots, lots more.

This was not done to permanent records but to copies used to make the lists.

Weighting

Finally, we had to weight the records that have been matched to determine which should be

the record to keep.

To do this each bib record was given a score to profile its quality.

The Weighting Criteria

We looked at the presence, length, and number of entries in the 003, 02X, 24X, 300, 260$b, 100, 010, 500s, 440, 490, 830s, 7XX,

9XX and 59X fields to manipulate, add to, subtract from, bludgeon, poke and eventually determine a 24 digit number that would profile

the quality of a bib record.

The Merging

Once the weighing is done the highest scored record in each group is made the master

record, the copies and holds from the others moved to it and those bibs marked deleted.

Checking the Weight

We did a report of items that would group based on our criteria and had staff do sample

manual checks to see if they could live with the dominant record.

We collectively checked ~1,000 merges.

90 % of the time we felt the highest quality record was selected as the dominant. More

than 9% of the time an acceptable record was selected.

In a very few instances human errors in the record made the system create a bad profile,

but never an actual bad dominant record.

The Coding

We proceeded to contract with Equinox to have them develop the code and run it against our test environment (and eventually production).

Galen Charlton was our primary contact in this. In addition to his coding of the algorithm

he also provided input about additional criteria to include in the weighting and normalization.

Test Server

Once run on the test server we took our new batches of records and broke them into 50,000 record chunks. We then gave those chunks to

member libraries and had them do random samples for five days.

Fixed As We Went

Non-Standard Cataloging (ongoing)13 digit ISBNs normalizing as 10 digit ISBNs. Identified many parts of item sets as issues.

Shared title publications with different formats. The order of the ISBNs.

Kits.

In Conclusion

We don’t know how many bad matches were formed.

Total discovered after a year is less than 200.

We were able to purge 326,098 bib records or about 27% of our ISBN based collection.

Evaluation

The catalog is visibly cleaner.

The cost per bib record was 1.5 cents.

Absolutely successful!

Future

We want to continue to refine it (eg. 020 subfield z).

There are still problems that need to be cleaned up in the catalog – some manually and

some by automation.

Raising Standards.

New libraries that have joined SCLENDs use our deduping algorithm not the old one.

It has continued to be successful.

Open Sourcing the Solution

We are releasing the algorithm under the Creative Commons Attribution Non-

Commercial license.

We are releasing the SQL code under the GPL.

Questions?