Word Occurrence Based Extraction of Work Contributors from Statements of Responsibility

Word Occurrence Based Extraction of Work Contributors from Statements of

Responsibility

Nuno FreireThe European Library

TPDL-2013Valletta, September 2013

OverviewStatements of responsibility from library bibliographic data:

“French Canadian freely arranged by Katherine K. Davis”.

“ed. by Peter Noever ; with a forew. by Frank O. Gehry; and contrib. by Coop Himmelblau.”

“W. Lange, A.C. Zeven and N.G. Hogenboom, editors”

“Vicente Aleixandre ; estudio previo, selección y notas de Leopoldo de Luis”

Extracting work contributors for use in a rights infrastructure: ARROW

http://arrow-net.eu

Outline

The context• The ARROW rights infrastructure• The use of national bibliographies in ARROW

The problem The approach Evaluation Conclusion and future work

4

The ARROW rights infrastructure

ARROW aims to support mass digitisation projects with automated ways to clear the rights of the books to be digitised.

To identify and clear the rights associated with a book a complex process needs to be undertaken:• Determine the work(s) contained within the book• Identify all the other expressions of the same work(s)• Identify the publisher(s) and contributor(s) involved• Determine the dates of publication at work level• Determine whether that work(s), and not the book itself, is

still in commerce• If necessary, obtain any licenses from the rights holders or

collective rights organizations

5

What is ARROW

A rights infrastructure and system for the identification of:• Rights status

• In or out of copyright• In or out of print / commercialised or not

• Rights• Which rights are involved

• Right holders• Authors• Publishers

• How and where to clear the rights• Orphan Works and their registration

6

Sources of Information in ARROW

ARROW makes information available from several sources:• The European Library:

• National bibliographies - to identify the book and to cluster it with all other books containing the same intellectual work

• Virtual International Authority File - to better identify the authors and support the identification of in copyright works

• Books in Print database - to know if any of the books concerned are actively commercialised by any publisher

• Reproduction Rights Organisation – to see if they know or can trace the rightholders

The ARROW Workflow

The Role of Libraries•National Libraries as Metadata Providers

• Provide the National Bibliographies to The European Library

The Role of Libraries•National Libraries as Metadata Providers

• Provide the National Bibliographies to The European Library

The Role of The European Library (TEL)•To match library requests with national bibliographies•Identify all other manifestations that potentially share intellectual work with a manifestation•To create a Work record: work metadata, manifestations, contributors, etc.

The Role of The European Library (TEL)•To match library requests with national bibliographies•Identify all other manifestations that potentially share intellectual work with a manifestation•To create a Work record: work metadata, manifestations, contributors, etc.

The Role of Books-in-Print (BIP)•To provide data about in print/out of print status•To provide data about publishers•To add new manifestation records of the work

The Role of Books-in-Print (BIP)•To provide data about in print/out of print status•To provide data about publishers•To add new manifestation records of the work

The Role of Reproduction Rights Organisation (RRO)•RROs as Metadata Provider

• To provide data about authors and publishers• To provide data about available licenses

…

The Role of Reproduction Rights Organisation (RRO)•RROs as Metadata Provider

• To provide data about authors and publishers• To provide data about available licenses

…

Statements of responsibility

These statements usually contain information about authorship, editors, photographers, translators, and others involved in creating the work

In printed books, the statement of responsibility is typically present on the title page• The statement of responsibility is transcribed by the cataloguer

exactly as it appears in the book (according to Anglo-American Cataloguing Rules)

Examples of statements of responsibility

“French Canadian freely arranged by Katherine K. Davis”.

“ed. by Peter Noever ; with a forew. by Frank O. Gehry; and contrib. by Coop Himmelblau.”

“W. Lange, A.C. Zeven and N.G. Hogenboom, editors”

“by Pamela and Neal Priestland”

“Vicente Aleixandre ; estudio previo, selección y notas de Leopoldo de Luis”

The problem

National bibliographies are reliable on representing in structured form the first author of a work

But secondary contributors are often not represented in structured form

Secondary contributors may reside only within the statements of responsibility

The approach

To approach the problem as a Named Entity Recognition task in text that may not be grammatically correct, thus lacking lexical evidence

Some requirements from the ARROW context• Easily applicable to several languages• The outcomes of the recognition task must be explainable

Design decisions• Exploring the structured data within national bibliographies

• By analysis of the frequency of word occurrences in names of persons, and in other textual data

• Using word occurrence frequency allows to • bypass the need for building training sets• be able to provide simpler explanations of the name recognition

results

The process – pre-processing

A pre-processing of each national bibliography is performed:• Word frequency is calculated• The frequency values are normalized, for

independence on the size of the national bibliography• The pre-processing results in four dictionaries:

• Words in titles• Words in person’s surnames• Words in other parts of person’s names, than the surname• Words that appear in lowercase in person names

(such as “von” in German names, or “de” in Portuguese names)

• The dictionaries contain the normalized frequency associated the words

The process – bibliographic record processing

The named entity recognition is performed for a record as follows:• Statement of responsibility is tokenized• The person names are recognized by comparing the

tokens with the dictionaries• The recognized names are compared against the

names of the contributors present in the structured fields of the record.

• If no similar name exists in the record, the contributor is added to the record in a structured data field

The process – named entity recognition

Possible token sequences used to locate person names:(in Augmented Backus–Naur Form)

non-ambiguous-surname / ( initial / non-ambiguous-first-name / non-ambiguous-surname / non-ambiguous-non-capitalized-name ) *(initial / first-name / surname / non-capitalized-name) surname

(more details on the definition of these tokens are included in the paper)

Evaluation data set(size of bibliographies and evaluation samples)

National BibliographyTotal

recordsMain

language

Evaluation sample

Statements of responsibility

ReferredPersons

British Library 13.4 million English 205 328German National Library

9.4 million German 200 378

National Library of the Netherlands

3.2 million Dutch 200 335

National Library of Greece 0.4 million Greek 297 379

Central Institute for the Union Catalogue of Italian Libraries

12.4 million Italian 224 297

Royal Library of Belgium 1 million

French and Dutch

203 387

Total: 1329 2104

Evaluation results

Dataset

Exact match metric

Partial match metric

Precision Recall Precision Recall

British Library 0.981 0.979 0.991 0.991German National Library 0.975 0.934 0.992 0.992

National Library of the Netherlands

0.973 0.875 0.977 0.979

National Library of Greece 0.656 0.414 0.758 0.868

Central Institute for the Union Catalogue of Italian Libraries

0.97 0.896 0.971 0.973

Royal Library of Belgium 0.981 0.959 0.981 0.982

Overall: 0.948 0.837 0.958 0.963

Evaluation results analysis

The main causes of recognition errors:• Foreign person names negatively affected recall

• Names of persons used in names of organizations negatively affected precision

• Two persons with same surname mentioned together negatively affected recall. As for example:

• “hrsg. von Volker und Michael Kriegeskorte”

• “by Pamela and Neal Priestland”

Conclusions

The approach performed reliably in most languages and bibliographic datasets• Datasets of at least one million records• Precision and recall above 0.97 on all but one dataset

The results obtained on the Greek national bibliography were not satisfactory• This dataset has distinct characteristics from the

others: • smaller size, • a different alphabet• different language

• Further investigation of the Greek national bibliography is necessary

Future work

Evaluation of the impact of this solution on the final results of the rights clearance process of ARROW

Building the dictionaries from comprehensive source of names of persons• Virtual International Authority File (VIAF)• International Standard Name Identifier (ISNI)

Further functionality: • recognition of organization names• recognition of the role of the recognized contributors

(illustrator, editor, etc.) Other application scenarios

• Functional Requirements for Bibliographic Records• Resource Description and Access

Co-funded by the Community programme

eContentplus

Acknowledgments

The European Library• Marcela Strelcova, Chiara Latronico and

Eva Kralt-Yap

Associazione Italiana Editori University of Innsbruck

This work was partially supported by the ARROWplus project, with co-funding by the European Commission programme eContentplus

Thank you

Questions or comments?

Contact:Nuno Freire – [email protected]

Technology

Word Occurrence Based Extraction of Work Contributors from Statements of Responsibility