Upload
benwbrum
View
697
Download
0
Embed Size (px)
DESCRIPTION
3-hour long workshop on crowdsourced transcription software for the University of Toronto's Roots and Routes seminar in 2012.
Citation preview
Crowdsourced Manuscript Transcription
Ben BrumfieldRoots and Routes 2012
Not just crowdsourcing...
● Collaborative work● Off-site solo work● Private work
Not just manuscripts...
● Maps● Textiles● Music● Flawed OCR
Not just transcription...
● Indexing● Editing● Identification Counting seals on Arctic ice caps.
What it isn't
We'll concentrate on web-based tools for extracting text from images, not addressing:● Oral History● Video● Audio Transcription● Image Manipulation● Transcription/Facsimile Display Tools exist for these tasks, nevertheless.
Break
What materials are you working with outside of modern, printed books and websites?
Origins (Approaches)
Two Approaches and one Dead End● Indexing● Editing● Tagging
Indexing
● Structured Data● Extracts from Text vs. Representing Text● Databases for Search and Analysis● Granular Quality Control● Gamification
Editing
● Books, Diaries, Letters, Articles● Representing Text● Traditional Editorial Workflow● Digital or Print Editions
Tagging
● Too small● Too imprecise
Origins (Traditions)
● OCR Correction● Documentary Editing● Genealogy● Natural Science● Astronomy Split this into 5 slides
Online Tools
● Recent (none older than 2005)● Influenced by origin● Still pretty raw● Most require tech expertise for set-up and
customization● All require making trade-offs
Lab Session 1: Breadth
NYPL What's on the MenuIndexing
Wikisource
Editing
Selection Factors
● Source Material● Transcript Purpose● Organizational/Project Management Fit● Financial and Technical Resources
Source Material
Evaluating your source material:● Is it of interest to anyone else?● Is it under copyright?● Does it need restricted access?● Is it composed of documents or records?● Is it non-textual?● How complex is the layout? How important
is that layout?
Purpose
How will you be using the transcribed data?● Traditional print editions● Searchable online editions● Do you want to use the system to analyze
the text?● How do you want to analyze the text?● Is public engagement a goal?● Should the transcripts be open?
Organizational/Project Management Fit
● How important is traditional editorial workflow?
● Will you rely on volunteers? How will you motivate them?
● What is the duration of the project?● Is there a "final version"?● Is TEI a mandate?
Financial and Technical Resources
Do you have or need:● System administrators to install non-hosted
software?● Money to pay hosting costs?● Programming skills to customize a tool?● Money to pay programmers for
customization?● Support for on-going costs to keep the site
running, however small?
Lab Session 2: Markup Options
FromThePage TranscribeBentham
Technical Questions to Answer
● Where are the images now?● How do images get into the system?● How do transcripts get out of the system?● How mature is the underlying technology?● How configurable is the technology?● How does the system work with the public
face of your project?● Where does the metadata live?● Who will maintain this? How long?● How many sites are using this system?
Wikisource
Pro:● Mediawiki plus its add-on modules (e.g.
print-on-demand, export).● Wikimedia community.● Incredibly mature.Con:● Wikimedia policy.● Public editing.● Limited mark-up.
Bentham Transcription Desk
Pro: ● MediaWiki is very mature.● TEI Toolbar (can also be used on other
systems)● Deployed outside original project. Con:● Development efforts halted.
Scripto
Pro:● Team at CHNM has a great track record.● Your CMS is your public face.● MediaWiki is very mature.● Deployed and under active development. Con:● Your CMS handles all metadata.● Mark-up is extremely limited.
FromThePage
Pro:● Designed for intensive editing and indexing.● Semantic mark-up and analysis.● Hosting available. Con:● Single developer (me).● No TEI mark-up.
Islandora TEI Editor
Caveat: I don't know much about this tool or this team. ● Based on Drupal and Fedora● Supports TEI via friendly interface● Many Drupal-based projects considering it.
T-PEN
Caveat: I don't know much about this tool. ● Designed for medieval manuscripts.● Supports TEI natively.● Line-by-line interface.● Hosted version available.
Scribe
Pro:● Excellent for complex layout or non-
documentary transcription.● Zooniverse team is large, well-funded,
experienced.● Configurable.Con:● No automated tool for loading images or
viewing transcript database (yet!)● No concept of image-as-a-text.
Pybossa
Caveat: I don't know much about this tool or this team. ● Open Knowledge Foundation's
crowdsourcing task management tool.● Designed for tabular data.● Google Spreadsheet data entry.● Extremely young.
TextLab
Caveat: I don't know much about this tool or this team. ● Melville Electronic Library.● Direct addition of TEI tags to image.
Lab Session 3: Configuration
ScribeOld Weather, What's the Score, Development deployments