View
482
Download
5
Category
Tags:
Preview:
Citation preview
(Big) Bibliographic DataUB Leipzig & SLUB Dresden
ScaDS project meeting, 12.6.2015
Leander Seige, Felix Lohmeier, Ralf Talkenberger
“The library of the
21st century
is a data hub.”quoted from an internal strategic paper of
Leipzig University Library, 2015
simple bibliographic metadata
<metadata>title
authorisbn
publisheryear…
<resource>booksserials
newspapersarticles
...
<resource> book● printed books in the library’s shelves
● bought ebooks
● licensed ebooks
● pay-per-use ebooks
● free content
● ebooks to be bought by the library (patron driven acquisition = pda)
● even printed books to be bought by the library (pda too)
<resource> journals● printed journals in the library’s shelves
● much more licensed electronic journals
○ full text accessible via web interfaces
● do we have article metadata?
● yes: licensed journal articles: 10s of millions per library
<metadata> accessibility information● where is a ressource? (physical or on the net)
● who is allowed to access this content? (students? faculty? everyone?)
● is it available off-campus?
● did we buy it or is it just licensed?
● may the user copy or print it?
● is the library allowed to store the electronic file?
● may we grant access from wifi connections?
● ...or any combination of these...
<metadata> knowledge bases● librarians built large knowledge bases to describe resources
● in german speaking countries: GND (Gemeinsame Normdatei) der
Deutschen Nationalbibliothek http://www.dnb.de/EN/gnd
● international: http://viaf.org
● provide dbpedia-links to explore the linked data cloud and to enrich
library data
<metadata> knowledge bases● GND (and other national authority files via VIAF)
○ describe Persons, Corporate bodies, Conferences and Events,
Geographic Information, Topics, Works and relationships
between them
○ form a generic knowledge base, independent from any specific
domain
○ provide links to other knowledge bases (dbpedia, geonames...)
resource discovery● traditional “OPACs” provided access to traditional library resources like
printed books, users had to use proprietary vendor drive portals to
access electronic ressources
● today, printed materials represent only a small part of library resources
● in contrast: resource discovery systems aim to integrate all
resources of a library and present them in one single search
interface
Cooperation● UBL and SLUB joined forces in March 2015
● Goals:
a. Exchange of metadata after processing
b. Develop common workflows to avoid “double work”
→ integrate existing tools finc & d:swarm
finc Community● maintains a large search engine infrastructure
● developed and hosted at Leipzig University Library
● based on Apache Solr und VuFind
● rugged metadata management system,
processing millions of data records each day
● integrates more than 50 data sources
https://finc.info
finc Community● provides more than 15 university libraries with
resource discovery systems
● offers great potential to design and implement user oriented
functions on real world systems, serving thousands of library
users in Saxony and beyond, every day
● employs the aggregated index at Leipzig University Library
https://finc.info
10% physical items
90% electronic content
on the net
aggregated index atLeipzig University Library
aggregated index atLeipzig University Library
● 12 million traditional data records (growing)● 80 million electronic article data records (growing)● each records contains 20 data fields
1.8 billion triple(if you triplify it)
(without any enrichment data)
Data processing today
● distributed data storage○ 2 Solr in Leipzig
(~12 mio + ~80 mio records)○ 2 Solr in Dresden
(~2 mio + ~2 mio records)
● constraint: each data source is handled separately → difficult to build up relations and deep data integration
d:swarm
● yet another tool…?
a. property graph database
b. gui for library staff
Toolsfinc d:swarm
focus data normalization data integration and enrichment
technology script-based transformations (python, go, ElasticSearch)
encapsulates metafacture (open source toolchain for metadata transformation)
Property Graph (Neo4j)
status Works fine with ~100 mio. records (less than one day)
Scability issues (~ 4 mio. records in less than one day)
integrating finc with d:swarm● enhance data processing regarding
○ authority data linking (NLP)
○ fuzzy deduplication
○ classification
○ relate bibliographic data to places, topics, abstract terms
○ publish machine readable data (linked data)
● create user interfaces to enable system librarians to control metadata
processing
Tomorrow: common workflows● All data flows through both tools (finc + d:swarm)
● Deduplication (in graphDB easier duplication recognition)
● FRBRization (aggregate different physical and formal versions of a
work)
● Knowledge graph makes enrichment (authorities, altmetrics data,
usage data, …) and analytics easier
Scalability issues● current implementation of property graph is too slow
● test results with 64GB RAM, SSD, 16 cores
○ 1,2 mio records (flat format): 10 hours for complete workflow
(ingest, transformation, export)
○ more complex formats (MARC21) up to 5x statements
● single Neo4j instance, storage and memory issues
d:swarm architecture
Possible solutions?● “mit Hardware erschlagen”
● Another graphDB, parallelization?
○ ArangoDB: https://www.arangodb.com
○ Apache Giraph: http://giraph.apache.org
○ Blaze Graph: http://blazegraph.com (Wikidata’s choice)
● Gradoop?!
Recommended