(Big) bibliographic data @ ScaDS project meeting - 2015-06-12

(Big) Bibliographic DataUB Leipzig & SLUB Dresden

ScaDS project meeting, 12.6.2015

Leander Seige, Felix Lohmeier, Ralf Talkenberger

“The library of the

21st century

is a data hub.”quoted from an internal strategic paper of

Leipzig University Library, 2015

simple bibliographic metadata

<metadata>title

authorisbn

publisheryear…

<resource>booksserials

newspapersarticles

<resource> book● printed books in the library’s shelves

● bought ebooks

● licensed ebooks

● pay-per-use ebooks

● free content

● ebooks to be bought by the library (patron driven acquisition = pda)

● even printed books to be bought by the library (pda too)

<resource> journals● printed journals in the library’s shelves

● much more licensed electronic journals

○ full text accessible via web interfaces

● do we have article metadata?

● yes: licensed journal articles: 10s of millions per library

<metadata> accessibility information● where is a ressource? (physical or on the net)

● who is allowed to access this content? (students? faculty? everyone?)

● is it available off-campus?

● did we buy it or is it just licensed?

● may the user copy or print it?

● is the library allowed to store the electronic file?

● may we grant access from wifi connections?

● ...or any combination of these...

<metadata> knowledge bases● librarians built large knowledge bases to describe resources

● in german speaking countries: GND (Gemeinsame Normdatei) der

Deutschen Nationalbibliothek http://www.dnb.de/EN/gnd

● international: http://viaf.org

● provide dbpedia-links to explore the linked data cloud and to enrich

library data

<metadata> knowledge bases● GND (and other national authority files via VIAF)

○ describe Persons, Corporate bodies, Conferences and Events,

Geographic Information, Topics, Works and relationships

between them

○ form a generic knowledge base, independent from any specific

domain

○ provide links to other knowledge bases (dbpedia, geonames...)

resource discovery● traditional “OPACs” provided access to traditional library resources like

printed books, users had to use proprietary vendor drive portals to

access electronic ressources

● today, printed materials represent only a small part of library resources

● in contrast: resource discovery systems aim to integrate all

resources of a library and present them in one single search

interface

Cooperation● UBL and SLUB joined forces in March 2015

● Goals:

a. Exchange of metadata after processing

b. Develop common workflows to avoid “double work”

→ integrate existing tools finc & d:swarm

finc Community● maintains a large search engine infrastructure

● developed and hosted at Leipzig University Library

● based on Apache Solr und VuFind

● rugged metadata management system,

processing millions of data records each day

● integrates more than 50 data sources

https://finc.info

finc Community● provides more than 15 university libraries with

resource discovery systems

● offers great potential to design and implement user oriented

functions on real world systems, serving thousands of library

users in Saxony and beyond, every day

● employs the aggregated index at Leipzig University Library

https://finc.info

10% physical items

90% electronic content

on the net

aggregated index atLeipzig University Library

● 12 million traditional data records (growing)● 80 million electronic article data records (growing)● each records contains 20 data fields

1.8 billion triple(if you triplify it)

(without any enrichment data)

Data processing today

● distributed data storage○ 2 Solr in Leipzig

(~12 mio + ~80 mio records)○ 2 Solr in Dresden

(~2 mio + ~2 mio records)

● constraint: each data source is handled separately → difficult to build up relations and deep data integration

d:swarm

● yet another tool…?

a. property graph database

b. gui for library staff

Toolsfinc d:swarm

focus data normalization data integration and enrichment

technology script-based transformations (python, go, ElasticSearch)

encapsulates metafacture (open source toolchain for metadata transformation)

Property Graph (Neo4j)

status Works fine with ~100 mio. records (less than one day)

Scability issues (~ 4 mio. records in less than one day)

integrating finc with d:swarm● enhance data processing regarding

○ authority data linking (NLP)

○ fuzzy deduplication

○ classification

○ relate bibliographic data to places, topics, abstract terms

○ publish machine readable data (linked data)

● create user interfaces to enable system librarians to control metadata

processing

Tomorrow: common workflows● All data flows through both tools (finc + d:swarm)

● Deduplication (in graphDB easier duplication recognition)

● FRBRization (aggregate different physical and formal versions of a

● Knowledge graph makes enrichment (authorities, altmetrics data,

usage data, …) and analytics easier

Scalability issues● current implementation of property graph is too slow

● test results with 64GB RAM, SSD, 16 cores

○ 1,2 mio records (flat format): 10 hours for complete workflow

(ingest, transformation, export)

○ more complex formats (MARC21) up to 5x statements

● single Neo4j instance, storage and memory issues

d:swarm architecture

Possible solutions?● “mit Hardware erschlagen”

● Another graphDB, parallelization?

○ ArangoDB: https://www.arangodb.com

○ Apache Giraph: http://giraph.apache.org

○ Blaze Graph: http://blazegraph.com (Wikidata’s choice)

● Gradoop?!

(Big) bibliographic data @ ScaDS project meeting - 2015-06-12

Education

Technical Services Records count: bibliographic Forum ... · Records count: bibliographic records in a networked environment Records count: bibliographic records in a networked environment

Bibliographic coupling

Future of Bibliographic Systems: Designing a Roadmap to a new Bibliographic Information Ecosystem

Structures and Standards for Bibliographic Dataloc.gov/bibliographic-future/meetings/docs/greenberg-may9-2007.pdfStructures and Standards for Bibliographic Data Library of Congress

ISBD(PM): International Standard Bibliographic Description ... · International Standard Bibliographic Description arose out of a ... International Standard Bibliographic Description

BIG DATA INTEGRATION AT SCADS DRESDEN/LEIPZIGdbs.uni-leipzig.de/file/BigDataIntegrationScaDS... · 2014. 12. 2. · Two Centers of Excellence for Big Data in Germany ScaDS Dresden/Leipzig

Big Data Stream Processing - ScaDS · Big Data Stream Processing ... • More details on Storm, Spark, Flink ... • Unified primitives for batch and stream processing

NISO Bibliographic Roadmap Meeting - Carpenter welcome and overview of bibliographic infrastructure copy

Mapping Bibliographic Metadata

New MARC Fields with RDA Bibliographic and Authority Formats Bibliographic and Authority Formats

The FRBR Model (Functional Requirements for Bibliographic ... · The FRBR Model (Functional Requirements for Bibliographic Records) ... In the Functional Requirements for Bibliographic

ISBD International Standard Bibliographic · PDF fileThe International Standard Bibliographic Description (ISBD) is intended to serve as a principal standard to promote universal bibliographic

Big NoSQL Data - ScaDS · 2017-08-25 · Plan for Today’s Talk •The pre-relational and relational eras •Moving beyond rows and columns (?) 1. The object-oriented DB era 2. The

APA Bibliographic Format

Bibliographic Management

Linked Bibliographic Data

HUL Bibliographic Standards

TOPIC EVOLUTION OF BIBLIOGRAPHIC DATA … EVOLUTION OF BIBLIOGRAPHIC DATA EXCHANGE ARE CURRENT BIBLIOGRAPHIC MODELS SUITABLE FOR INTEGRATION WITH THE WEB? A TRANSFORMATIVE OPPORTUNITY:

BASIC BIBLIOGRAPHIC INFORMATIONmladomino.mla.org/webhelp/2020ManualFiles/BASIC... · 2020. 1. 30. · 1 3. BASIC BIBLIOGRAPHIC INFORMATION Indexers enter bibliographic information

Bibliographic Services