HathiTrust Research Center Architecture Data subsystem

HathiTrust Research Center Architecture

Data subsystem

Agent framework

Page/volume tree (file system)

Authoritative volume store (Cassandra)

SEASR analytics service

Web portal Desktop SEASR client

Task deployment

WSO2 registry- services, collections, data

capsule images

Solr indexesSolr indexesSolr indexes

HathiTrust corpus

rsync

WSO

2 En

terp

rise

serv

ice

bus

FutureGrid

NCSA local resources

Penguin on Demand

Replicated volume stores Replicated

volume stores Replicated volume stores

Programmatic access (e.g.,

Bamboo)

CI logon(NCSA)

Access control

(e.g. Grouper)

University of Michigan

MeandreOrches-tration

Agent instanceAgent

instance

Agent instanceAgent

instance

Non-consumptiveData capsules

NCSA HPC resources

Agent framework





Task deployment


capsule images


HathiTrust corpus

rsync

WSO

2 En

terp

rise

serv

ice

bus

FutureGrid


Penguin on Demand




Bamboo)

CI logon(NCSA)

Access control

(e.g. Grouper)



Agent instanceAgent

instance

Agent instanceAgent

instance


NCSA HPC resources

Agent framework





Task deployment


capsule images


HathiTrust corpus

rsync

WSO

2 En

terp

rise

serv

ice

bus

FutureGrid


Penguin on Demand




Bamboo)

CI logon(NCSA)

Access control

(e.g. Grouper)



Agent instanceAgent

instance

Agent instanceAgent

instance


NCSA HPC resources

Solr quick introduction • Lucene is a high-performance, full-featured text search engine

library• Solr is a web service frontend to Lucene• Index consists of documents and document consists of fields

which are name/value pair

HTRC Solr

• Has both bibliographic information and full-text OCR scan– 29 fields– volume ID, title, author, several reference IDs (ISBN, ISSN, callnumber,

etc), and full text

• Basic search like term query, wildcard, fuzzy query, phrase query and range query:– Example: “OCR: war”, search documents containing the word “war” in text

• Term Vector is enabled to get word frequency and offset for each word :– Occurences– position and offset

Filtered Term Vector

• Default Term Vector is massive – O(5MB) per volume– Extremely slow response for multiple volumes

• We extended Solr to filter unwanted words to enhance response speed significantly.– Reduced term vector size to

O(80KB) per volume.

Agent framework





Task deployment


capsule images


HathiTrust corpus

rsync

WSO

2 En

terp

rise

serv

ice

bus

FutureGrid


Penguin on Demand




Bamboo)

CI logon(NCSA)

Access control

(e.g. Grouper)



Agent instanceAgent

instance

Agent instanceAgent

instance


NCSA HPC resources

Ingest Procedure

• Use rsync to pull filesystem data from HT main collection.

• Too many small text files...

• Parse structural metadata (METS) – ordering of page, page checksum (and verification); some metadata stored to NoSQL.

• Analyze delta logs to push incremental changes to NoSQL store

Bib metdata

Collection namespace 1

Collection namespace 2 …

pairtree_root pairtree_root

pairtree

Rsync root

pairtree

Rsync split pairtree list

Rsync root

Parallel rsync of the rest

using split tree list

…

…

Bib metdata

Collection namespace 1

Collection namespace 2 …

pairtree_root pairtree_root

pairtree

…

…pairtree

Split pairtree

list

Split pairtree

list

Delta logs

Push modified volume

contents from pairtree to

noSQL

Cassandra noSQL

repository

Update collections list

HathiTrust (remote)

HathiTrust Research Center (local)

HTRC Text Corpora Ingest Workflow

Agent framework





Task deployment


capsule images


HathiTrust corpus

rsync

WSO

2 En

terp

rise

serv

ice

bus

FutureGrid


Penguin on Demand




Bamboo)

CI logon(NCSA)

Access control

(e.g. Grouper)



Agent instanceAgent

instance

Agent instanceAgent

instance


NCSA HPC resources

NoSQL Repository

• Utilizing Cassandra as a storage space for our text collections and related metadata– Aggregates small texts

• Allows us to manage flexible schemas• Key-value based column store• Offers good scalability, redundancy, and

performance

Cassandra Schema

• Each row represents a volume– Row key is the volume ID– Each row contains many columns– First column contains metadata attributes about the volume– Each subsequent column family is a page, key is page ID– Page-specific columns contain page contents and metadata about the page

Key: (volume ID)

Inu.320001

metadata

copyright

public

Page count

16

Inu.320001/001

content

What’s up doc?

size

12

MD5

12345f

Inu.320001/xxx

content

Rabbits

size

7

MD5

aabbcc

Inu.320002

metadata

copyright

In-copyright

Page count

2406

Inu.320002/001

content

2b|!2b

size

6

MD5

7effdd

Inu.320002/xxx

content

A question

size

10

MD5

deadbeef

…

Cassandra Schema

• Pros– Works well for all access primitives– Well organized metadata – no repetitions– Volume level versioning could follow similar schema, but version number needs to be

concatenated to volume ID for historical versions

• Cons– Subcolumn families cannot be indexed– Extra metadata are picked up even when only page contents are needed– Must store historical versions of volumes as deltas; naïve translation of the above format

to historical versioning would have high cost in space

Key: (volume ID)

Inu.320001

metadata

copyright

public

Page count

16

Inu.320001/001

content

What’s up doc?

size

12

MD5

12345f

Inu.320001/xxx

content

Rabbits

size

7

MD5

aabbcc

Inu.320002

metadata

copyright

In-copyright

Page count

2406

Inu.320002/001

content

2b|!2b

size

6

MD5

7effdd

Inu.320002/xxx

content

A question

size

10

MD5

deadbeef

…

Documents

HathiTrust Research Center Architecture Data subsystem