Upload
everett-tucker
View
214
Download
1
Embed Size (px)
Citation preview
Technology Choices for the JSTOR Online Archive
Presented by
Chang FengDepartment of Computer Engineering and Computer Science, University of Missouri-Columbia, Columbia,
MO 65211
Reference
Technology Choices for the JSTOR Online Archive, S. W. Thomas, K. Alexander, and K. Guthrie, Computer (February 1999), 60-65.
JSTOR Overview
Goals: To increase access to older scholarly materials by converting them to digital media and providing a full-text search capability.
Benefits: Preservation of the original documents and conserving library shelf space.
Development phases:– Phase-I (scheduled for completion by the end of
1999): minimum of 100 journal titles, primarily in the humanities and social sciences.
– As of December 1998: 67 journal titles, total 450,000 articles and 2.7 million pages.
Implementation JSTOR
Principles– Let mission guide technical choices.
– User first. Issues to be addressed when building the digital
library– Formats (e.g., image v.s. formatted text)
– Storage, display and distribution technologies (e.g., CD-ROM v.s. Internet)
Implementing JSTOR
Mission: A reliable and faithful electronic archive Choice of technology: Scanned-in image at 600 dpi
for each page. Mission: Searchable Choice of technology: Use OCR software to create
text files that would let the user search journals’ full text.
Mission: Reduce long-term library costs Choice of technology: Database storage centralized,
with distribution over the Internet.
Delivering JSTOR Pages
Deliver in GIF format: ~30 Kbytes/page. Converts page to screen resolution as needed. System caches converted pages for 3-4 days. Deliver one page at a time with next page pre-
loading. Print entire article: ( at 600 or 150 dpi resolution )
– JPrint as a separate application (faster)
– Adobe Acrobat files
– PostScript files
Searching JSTOR
Graphic searching interface. Stores the full text in one file per page. Each article also contains a citation file. Text files have embedded tags that specify
which parts of the text belong to which article. Separate index for each journal title. Articles are indexed using Full-Text
Lexicographer (U. of Michigan):– Allow dynamic updating (no index down time).– Periodically optimizing index with no down time.
Browser Interoperability
Major issue: Back compatibility.– Support HTML 3.2 standard
– JSTOR interface uses frame, but can adjust itself automatically to an unframed interface.
– Use new technology to enhance functionality, but not to provide basic functionality.
– Plug-ins not encouraged.
JSTOR Server Infrastructure
Storage: – Online: 600 dpi TIFF page images compressed with
Cartesian Perceptual Compression (1:4, CPI Inc.).
– Offline: multiple copies of the original TIFF images for archival purposes.
Performance:– Replacing CGI programs with FastCGI or Java
servlets.
– Server mirroring
Issues of Server Mirroring
Mirror server load balancing: Currently using a round-robin method.
Mirror server synchronization: Currently, new release (> 1 GB/month) are shipped overnight on magnetic tape to mirror sites.
User state synchronization: Currently,– Regenerate the data at the current server if possible, or
– Current server request information from the server that originally created it and caches that copy for future use.
Authentication
Cross organization access management JSTOR currently rely on participating
institutions to supply with authenticated IP address.
Under evaluation: – digital certificates issued by the participating
institutions.
– password-based access control.
Conclusions
The choice of technology is based on the mission of the project and user feedback.
Must remain flexible.