Upload
preston-hutchinson
View
219
Download
2
Tags:
Embed Size (px)
Citation preview
Agent framework
Page/volume tree (file system)
Authoritative volume store (Cassandra)
SEASR analytics service
Web portal Desktop SEASR client
Task deployment
WSO2 registry- services, collections, data
capsule images
Solr indexesSolr indexesSolr indexes
HathiTrust corpus
rsync
WSO
2 En
terp
rise
serv
ice
bus
FutureGrid
NCSA local resources
Penguin on Demand
Replicated volume stores Replicated
volume stores Replicated volume stores
Programmatic access (e.g.,
Bamboo)
CI logon(NCSA)
Access control
(e.g. Grouper)
University of Michigan
MeandreOrches-tration
Agent instanceAgent
instance
Agent instanceAgent
instance
Non-consumptiveData capsules
NCSA HPC resources
Agent framework
Page/volume tree (file system)
Authoritative volume store (Cassandra)
SEASR analytics service
Web portal Desktop SEASR client
Task deployment
WSO2 registry- services, collections, data
capsule images
Solr indexesSolr indexesSolr indexes
HathiTrust corpus
rsync
WSO
2 En
terp
rise
serv
ice
bus
FutureGrid
NCSA local resources
Penguin on Demand
Replicated volume stores Replicated
volume stores Replicated volume stores
Programmatic access (e.g.,
Bamboo)
CI logon(NCSA)
Access control
(e.g. Grouper)
University of Michigan
MeandreOrches-tration
Agent instanceAgent
instance
Agent instanceAgent
instance
Non-consumptiveData capsules
NCSA HPC resources
Agent framework
Page/volume tree (file system)
Authoritative volume store (Cassandra)
SEASR analytics service
Web portal Desktop SEASR client
Task deployment
WSO2 registry- services, collections, data
capsule images
Solr indexesSolr indexesSolr indexes
HathiTrust corpus
rsync
WSO
2 En
terp
rise
serv
ice
bus
FutureGrid
NCSA local resources
Penguin on Demand
Replicated volume stores Replicated
volume stores Replicated volume stores
Programmatic access (e.g.,
Bamboo)
CI logon(NCSA)
Access control
(e.g. Grouper)
University of Michigan
MeandreOrches-tration
Agent instanceAgent
instance
Agent instanceAgent
instance
Non-consumptiveData capsules
NCSA HPC resources
Solr quick introduction • Lucene is a high-performance, full-featured text search engine
library• Solr is a web service frontend to Lucene• Index consists of documents and document consists of fields
which are name/value pair
HTRC Solr
• Has both bibliographic information and full-text OCR scan– 29 fields– volume ID, title, author, several reference IDs (ISBN, ISSN, callnumber,
etc), and full text
• Basic search like term query, wildcard, fuzzy query, phrase query and range query:– Example: “OCR: war”, search documents containing the word “war” in text
• Term Vector is enabled to get word frequency and offset for each word :– Occurences– position and offset
Filtered Term Vector
• Default Term Vector is massive – O(5MB) per volume– Extremely slow response for multiple volumes
• We extended Solr to filter unwanted words to enhance response speed significantly.– Reduced term vector size to
O(80KB) per volume.
Agent framework
Page/volume tree (file system)
Authoritative volume store (Cassandra)
SEASR analytics service
Web portal Desktop SEASR client
Task deployment
WSO2 registry- services, collections, data
capsule images
Solr indexesSolr indexesSolr indexes
HathiTrust corpus
rsync
WSO
2 En
terp
rise
serv
ice
bus
FutureGrid
NCSA local resources
Penguin on Demand
Replicated volume stores Replicated
volume stores Replicated volume stores
Programmatic access (e.g.,
Bamboo)
CI logon(NCSA)
Access control
(e.g. Grouper)
University of Michigan
MeandreOrches-tration
Agent instanceAgent
instance
Agent instanceAgent
instance
Non-consumptiveData capsules
NCSA HPC resources
Ingest Procedure
• Use rsync to pull filesystem data from HT main collection.
• Too many small text files...
• Parse structural metadata (METS) – ordering of page, page checksum (and verification); some metadata stored to NoSQL.
• Analyze delta logs to push incremental changes to NoSQL store
Bib metdata
Collection namespace 1
Collection namespace 2 …
pairtree_root pairtree_root
pairtree
Rsync root
pairtree
Rsync split pairtree list
Rsync root
Parallel rsync of the rest
using split tree list
…
…
Bib metdata
Collection namespace 1
Collection namespace 2 …
pairtree_root pairtree_root
pairtree
…
…pairtree
Split pairtree
list
Split pairtree
list
Delta logs
Push modified volume
contents from pairtree to
noSQL
Cassandra noSQL
repository
Update collections list
HathiTrust (remote)
HathiTrust Research Center (local)
HTRC Text Corpora Ingest Workflow
Agent framework
Page/volume tree (file system)
Authoritative volume store (Cassandra)
SEASR analytics service
Web portal Desktop SEASR client
Task deployment
WSO2 registry- services, collections, data
capsule images
Solr indexesSolr indexesSolr indexes
HathiTrust corpus
rsync
WSO
2 En
terp
rise
serv
ice
bus
FutureGrid
NCSA local resources
Penguin on Demand
Replicated volume stores Replicated
volume stores Replicated volume stores
Programmatic access (e.g.,
Bamboo)
CI logon(NCSA)
Access control
(e.g. Grouper)
University of Michigan
MeandreOrches-tration
Agent instanceAgent
instance
Agent instanceAgent
instance
Non-consumptiveData capsules
NCSA HPC resources
NoSQL Repository
• Utilizing Cassandra as a storage space for our text collections and related metadata– Aggregates small texts
• Allows us to manage flexible schemas• Key-value based column store• Offers good scalability, redundancy, and
performance
Cassandra Schema
• Each row represents a volume– Row key is the volume ID– Each row contains many columns– First column contains metadata attributes about the volume– Each subsequent column family is a page, key is page ID– Page-specific columns contain page contents and metadata about the page
Key: (volume ID)
Inu.320001
metadata
copyright
public
Page count
16
Inu.320001/001
content
What’s up doc?
size
12
MD5
12345f
Inu.320001/xxx
content
Rabbits
size
7
MD5
aabbcc
Inu.320002
metadata
copyright
In-copyright
Page count
2406
Inu.320002/001
content
2b|!2b
size
6
MD5
7effdd
Inu.320002/xxx
content
A question
size
10
MD5
deadbeef
…
Cassandra Schema
• Pros– Works well for all access primitives– Well organized metadata – no repetitions– Volume level versioning could follow similar schema, but version number needs to be
concatenated to volume ID for historical versions
• Cons– Subcolumn families cannot be indexed– Extra metadata are picked up even when only page contents are needed– Must store historical versions of volumes as deltas; naïve translation of the above format
to historical versioning would have high cost in space
Key: (volume ID)
Inu.320001
metadata
copyright
public
Page count
16
Inu.320001/001
content
What’s up doc?
size
12
MD5
12345f
Inu.320001/xxx
content
Rabbits
size
7
MD5
aabbcc
Inu.320002
metadata
copyright
In-copyright
Page count
2406
Inu.320002/001
content
2b|!2b
size
6
MD5
7effdd
Inu.320002/xxx
content
A question
size
10
MD5
deadbeef
…