View
67
Download
0
Category
Tags:
Preview:
DESCRIPTION
Mining Newspaper Archives. Tara Carlisle Kathleen Murray. Topics. Introduction Types of Information Technology & Standards Searching Historical Newspapers Using Search Results. Introduction. National Digital Newspaper Program (NDNP). Partnership - PowerPoint PPT Presentation
Citation preview
PowerPoint Presentation
National Digital Newspaper ProgramUniversity of North Texas
The Portal to Texas History Texas Digital Newspaper Program
6Mining Newspaper Archives(30 sec) The University of North Texas was selected to be the lead institution for the National Digital Newspaper Program - the two-year grant requires that 100,000 pages of content be digitized. The newspapers are hosted on UNTs The Portal to Texas History. To provide some background, the Portal to Texas History, which was created in 2002, is a digital gateway to over 170,000 historical items that include photographs, letters, documents, maps, artifacts, books and newspapers. AND many types of institutions; such as museums, public libraries, historical societies, government agencies, and universities large and small, contribute their collections, thus making the Portal a vital resource for scholars, k-12 teachers, students and genealogists. 6TopicsIntroductionTypes of InformationTechnology & StandardsSearching Historical NewspapersUsing Search Results2Mining Newspaper Archives269
Mining Newspaper Archives6970
Mining Newspaper Archives70Chronicling AmericaUS Newspaper DirectoryDatabase: 1690 presentUS Newspaper ProgramFunded by NEH: 1980 - 2007140,000 bibliographic title entries900,000 separate library holdings recordsDirectory ListingMissouri Republican (St. Louis, Mo.) 1822-18385
Mining Newspaper ArchivesNational Digital Newspaper Program (NDNP)NEH Grants: 28 statesGoal: All US states & territoriesCoverage: 1836-1922Newspaper Selection ProcessPrimarily technically-suitable microfilm holdings Emphasis on bibliographic completenessDiversity and "orphaned" newspapers Searchable Database: 4,580,151 pages
The Newspaper Title Directory is derived from the library catalog records created by state institutions during the NEH-sponsored United States Newspaper Program (http://www.neh.gov/projects/usnp.html), 1980-2007. This program funded state-level projects to locate, describe (catalog), and selectively preserve (via treatment and microfilm) historic newspaper collections in that state, published from 1690 to the present. Under this program, each institution created machine-readable cataloging (MARC) via the Cooperative ONline SERials Program (CONSER) for its state collections, contributing bibliographic descriptions and library holdings information to the Newspaper Union List, hosted by the Online Computer Library Center (OCLC). This data, approximately 140,000 bibliographic title entries and 900,000 separate library holdings records, was acquired and converted to MARCXML format for use in the Chronicling America Newspaper Title Directory. Contact a CONSER member for updates and corrections to bibliographic records (see http://www.loc.gov/acq/conser/conmembs.html ) through CONSER. The Chronicling America Directory bibliographic records are updated annually from the CONSER dataset hosted by OCLC.5Texas Digital Newspaper Program7
Mining Newspaper Archives(30 sec)Too meet the needs of the large state of Texas, in addition to digitizing newspapers through the National Digital Newspaper Program, UNT created the Texas Digital Newspaper Program and applied for state and foundation to digitize more Texas newspapers. Altogether, through national, state and private funding, UNT has digitized more than 600,000 pages of content, put another way:
86,000+ newspaper issues spanning 200 years, dating 1810 - 2010Representing 50 countiesSix languages English, Spanish, German primarily a few Czech, French, Swedish and Chinese newspapers. 7 Digitization Standards8
Mining Newspaper Archives(30 sec) Other than its primary mission of providing online access to Texas historical newspapers, The Texas Digital Newspaper Program strives to promote best practice standards for preservation and promotes NDNPs digitization standards for newspapers.
Anecdote: Rio Grande City Public Librarymicrofilm
Most successful initiative has been the Tocker Foundation grant that provides funding to rural public libraries (with a population less than 14,000 to digitize their historic newspaper collection). UNT Libraries partners with the libraries by providing the digitization services and hosting the newspapers on the Portal. For some rural areas historical newspapers are the only source or documentation of their local history.
8Types of information9Mining Newspaper Archives(1-2 min) InteractivityStart by asking the audience:When you think of newspapers for genealogical research, what comes to your mind?What types of information or data are you seeking/hoping to find?9Types of InformationBirths and deathsMarriage announcementsMilitary serviceLand purchasesPromotionsAdvertisements: Family businessesTravel announcementsSocial activities10Mining Newspaper Archives(15 sec) Note to PresentersAnimate to present content on click
Present some interesting examples!Christmas decoration fight!TO DO: NEED examplesFind a family and have examples for 3 or more of the information typesOsterhout, John P. Tara will look this up as a possibility
Chronicling America: Family of Hugh MurrayTravel http://chroniclingamerica.loc.gov/lccn/sn84020274/1902-12-21/ed-1/seq-33/Travel family visit http://chroniclingamerica.loc.gov/lccn/sn84020274/1902-11-30/ed-1/seq-36/Marriage http://chroniclingamerica.loc.gov/lccn/sn87052181/1893-09-16/ed-1/seq-3/Social event http://chroniclingamerica.loc.gov/lccn/sn84020274/1902-09-21/ed-1/seq-3/Social event http://chroniclingamerica.loc.gov/lccn/sn84020274/1900-01-07/ed-1/seq-26/Real estate: http://chroniclingamerica.loc.gov/lccn/sn84020274/1903-11-04/ed-1/seq-7/
10Death notices11
Mining Newspaper Archives(15) Types of Information: Death NoticesAt times we find the unfortunate truth about our ancestors.Anecdote: a Ms. Lanham was researching her family and finally had solved the mystery about the lack of information about her great, great grandfathers death. He was killed by her great uncle.11
J.P. Osterhout children12
Bellville Countryman, 1861
Texas Countryman, 1868Mining Newspaper Archives(10 sec)12
J.P. Osterhout (1826-1903)13
Fort Worth Gazette, 1891Fort Worth Gazette, 1889Mining Newspaper Archives(10 sec)1314
J.P. Osterhout children
Belton Evening News, 1918Sherman Democrat, 1903Mining Newspaper Archives(10 sec)14Technology & Standards15Mining Newspaper ArchivesThe next part of the presentation provides an overview of the basic technology and the standards used for Chronicling America and The Portal to Texas History because knowing how a system is organized and how the parts connect gives one the tools to conduct a more effective search. Many of the basic concepts that were covering today can be applied to other digital library systems.
15Technology & Standards16
Page FormatsJPEGJP2PDFOCR TextMetadataTitleIssue DateGeographic CoverageApplication Programming InterfaceDirectory searchingLinks to title, issues, pagesLinked dataOptical Character RecognitionScanningOCR
Mining Newspaper ArchivesI will discuss metadata and the part it plays in the digital library system.
UNT DPUModel: Mekel Mach V Microfilm Scanner
Manufacturer: Mekel Technology, San Dimas, CaliforniaFeatures: 8,192 pixel CCD array; automatic gain control; fiber optic lighting; accepts 16mm/35mm, simplex/duplex, positive/negative, silver, diazo, and vesicular formats in 100', 215', and 1000' rolls; 100-600 true optical dpi
How we use it: scanning microfilm for the Texas Digital Newspaper Program and other initiativesExample scans:Breckenridge American, Vol. 41, No. 48The Hereford Brand, Vol. 8, No. 27
16Metadata17Metadata enhances information retrieval within the system and between other systems.
Descriptive metadata is used to describe an individual item and provides such information as creator, publisher, contents, size, relationship to other resources, and more.
Metadata may also contain "preservation" components that help us to maintain the integrity of digital files over time.
Set in a Resource Discovery Framework supports open access and linked data.
Mining Newspaper Archives(30 sec) Interactivity:To start off - Tell me what you know about metadataWhat comes to mind when I say metadata schema? Any examples?
Digital library systems such as Chronicling America and The Portal to Texas Historyuse metadata to describe and organize the digital content.
Descriptive metadata the creator, date, subject, coverageAdministrative metadata digital file type, size and unique identifier for the systemResource Discovery Framework support interoperability
17Dublin Core Elementsfor descriptive metadataTitleSubjectDescriptionTypeSourceRelationCoverageCreator
9. Publisher10. Contributor11. Rights12. Date13. Format14. Identifier15. Language
18Mining Newspaper ArchivesTitleSubjectDescription Resource Type article, book, photograph Source reference to where resource was derived scanned page within a book for exampleRelation reference to a related source, image item to a text itemCoverage time period and or locationCreator one who is responsible for making the content: artist, author, photographerPublisherContributor one who made contributions to the content or played a secondary roleRights state rights held over the resourceDateFormat sound file, text, imageIdentifier unique number assigned to individual itemLanguage 1819
Mining Newspaper Archives(10 sec) On this newspaper record we see some of the core metadata elements and to the left one has the option to view the metadata files in several formats. 1920
Mining Newspaper Archives(10 sec) UNT based its metadata schema on the Dublin Core Metadata Initiative
20Qualified Dublin Core 21
Dublin Core elementsQualified Dublin CoreMining Newspaper Archives(30 sec) UNT took simple Dublin Core
And locally qualified itUNT Libraries uses qualified Dublin Core, so while promoting interoperability with widely accepted standards, UNT Libraries metadata elements allow flexibility at the local level to integrate with existing and anticipated changes.
Thus allowing us to implement faceted searching
21Digitization Process22Optical Character Recognition
ScanningOCR
Mining Newspaper ArchivesMetadata is one of the key components of a digital library because it creates relationships between digital objects and optimizes searching.
But the most vital part of a digital library is the actual digital object and its accompanying files. How is the digital object made and organized?
Hand over to Kathleen to discuss the digitization process.22Digitization Process23PaperMicrofilmScan ImageDigital MasterDerivativeProductionJPEG2000PDFJPEGQualityOriginalCompleteCleanQuality1990s or laterMaster negative(first generation)Original copiesDensityReduction ratioOriginal SourcesQuality300-400 ppiLossless (tiff)GrayscaleBi-tonalMining Newspaper ArchivesMost large scale projects scan from microfilm, not paperQuality of microfilm improved in the 1990s with the establishment and use of imaging standards23OCR in the Process24PaperMicrofilmScan ImageDigital MasterOptimization for OCRHigh B&W contrastGrayscale to bi-tonalDe-skew pagesSmooth, round, sharpened character edgesOCR SoftwareAnalyze & breakdown page layoutAnalyze stroke edges of charactersMatch edges to pattern imagesCharacter decisionWord matching in dictionaryConfidence decisionOCR TextMining Newspaper ArchivesMost large scale projects scan from microfilm, not paperQuality of microfilm improved in the 1990s with the establishment and use of imaging standardsObjective of de-skewing is to align the words horizontallyFonts & layoutsGenerally OCR software is quite capableDifficulties: handwriting, script, Gothic fontsOCR post-processingDynamic thresholding: Creating temporary bi-tonal images; Goal: Produce better OCR results
Analyze page structureBlocks of text (columns), tables, imagesLines: Words then charactersComparison of characters with pattern images in databaseAnalysis of stroke edges, discontinuities, backgroundBest guess character decision made Confidence rating encoded: ALTO standard for newspaper ranges from 0-9, 9 being very confidentAnalysis at word levelBuilt-in dictionaries consulted for word matchHuman-mediated training to improve character recognitionExpensive & not generally done in large-scale projects
Citation: Holley, R. (2009, March/April). How good can it get? Analyzing and improving OCR accuracy in large scale historic newspaper diigitisation programs. D-Lib Magazine. Retrieved January 21, 2012 from http://www.dlib.org/dlib/march09/holley/03holley.html
24OCR & QualityWhat affects microfilm quality?Quality of printed newspaperReduction ratio: Lower is better ( 20x)Variation in density: Narrow range is better ( .2; .90-1.20)Measurement of light able to pass through filmTechnically suitable film: Can produce a 300-400 ppi digital imageExample: 400 ppi imageOptical resolution of scanner: 8,000 ppiMicrofilm reduction ratio needs to be 20x8,000 ppi / 400 ppi = 20:125Mining Newspaper ArchivesSource: http://finereader.abbyy.com/professional/ (ABBYY website ABBYY Finereader)What is OCR?Optical Character Recognition (OCR) is a technology that enablesconversion ofimages, received from scanner ordigital camera,and PDFs to editable and searchabletext documentsready for editing, quoting, search, and archiving.
Reference:Microfilm Selection for Digitization - NDNPhttp://www.loc.gov/ndnp/guidelines/NEH_MicrofilmSelectionNDNP.pdf
Example: 12X or 12:1 means the size of the image on film will be 1/12th the actual size of the original document. Reduction should always be identified on preservation quality microfilm. In the event it is not, you can use the following formula to determine the reduction. Size of the original divided by the size of the frame = reduction.
Digitization parametersBit depth: 1 bit, 8-bit, 24-bitResolution: pixels per inch (ppi)
KRM: work this up
Include Interactivity25OCR Text: Cost v. QualityLayout irregularitiesIf inconsistent, cannot automate parametersTraining the OCR software Human mediation to confirm or correct best guesses of softwareSegmenting articles (including cont. articles)Requires additional resourcesOffered by fee-based archivesThe British Newspaper ArchiveThe New York Times Archive26Mining Newspaper ArchivesLayout irregularities affect the amount of human intervention needed; if layout is not standard, cannot use/automate parametersTraining the OCR software (ABBYY feature) by confirming or correcting best guesses; human mediation is expensiveSegmenting by article, including cont. articles requires additional resources (Fee archives BL & NYTimes)26Search: Metadata & OCR Text27OCR TextMetadata
chroniclingamerica.loc.gov/lccn/sn86071264/1853-01-03/ed-1/seq-3
Mining Newspaper Archives27Application Programming Interface- API -28
Mining Newspaper Archives28API: OpenSearch- Newspaper Pages -All searches start with protocol & server name: http://chroniclingamerica.loc.gov/Search query example: Frederick Gardner, a Missouri governor29http://chroniclingamerica.loc.gov//search/pages/results/?andtext=frederick+gardner+missouri
Mining Newspaper ArchivesPage Search parametersandtext: the search queryformat: 'html' (default), or 'json', or 'atom' (optional)page: for paging results (optional)
Examples:/search/pages/results/?andtext=thomas search for "thomas", HTML response /search/pages/results/?andtext=thomas&format=atom search for "thomas", Atom response /search/pages/results/?andtext=thomas&format=atom&page=11 search for "thomas", Atom response, starting at page 11
Chronicling America Pages Search Search digital newspaper content from the Library of Congress UTF-8 http://loc.gov/favicon.ico 29API: OpenSearchNewspaper Pages30
http://chroniclingamerica.loc.gov/search/pages/results/?andtext=frederick+gardner+missouri
Mining Newspaper Archiveshttp://chroniclingamerica.loc.gov/search/pages/results/?andtext=frederick+gardner+missouri30API: Link to Titles, Issues, Edition, & Pages 31http://http://chroniclingamerica.loc.gov/The Bourbon news. (Paris, Ky.)/lccn/sn86069873/ Title Information/lccn/sn86069873/issues/ Calendar View of Issues/lccn/sn86069873/issues/first_pages/ Browse First Pages/lccn/sn86069873/1900-01-05/ed-1/ 1st Available Edition from 05JAN1900/lccn/sn86069873/1900-01-05/ed-1/seq-3/ 3rd Available Page from 05JAN1900Applications: BookmarksShare on other sites
Example: St. Louis Republic, 16SEP1893, page 3http://chroniclingamerica.loc.gov/lccn/sn87052181/1893-09-16/ed-1/seq-3
Mining Newspaper ArchivesExample: St. Louis Republichttp://chroniclingamerica.loc.gov/lccn/sn87052181/1893-09-16/ed-1/seq-3
Link pattern: LC long-term commitment
Link to titles, issues, editions, and pages
The Chronicling America Web site uses links that follow a straightforward pattern. You can use this pattern to construct links into specific newspaper titles, to any of its available issues and their editions, and even to specific pages. These links can be readily bookmarked and shared on other sites.We are committed to supporting this link pattern over time, so even if we change how the site works, we will redirect any requests to the system using this specific pattern into the new site. We established redirect rules for links into the previous version of the site when we released a new version in early 2009, and we intend to sustain those rules.The link pattern uses LCCNs, dates, issue numbers, edition numbers, and page sequence numbers.Examples: The Bourbon news. (Paris, Ky.) 1895-19??
/lccn/sn86069873/ title information for LCCN sn 86069873 /lccn/sn86069873/issues/ calendar view of available issues /lccn/sn86069873/issues/first_pages/ browse first pages of available issues /lccn/sn86069873/1900-01-05/ed-1/ first available edition from January 5, 1900 /lccn/sn86069873/1900-01-05/ed-1/seq-3/ third available page from first edition, January 5, 1900
31File FormatsTIFFUncompressed format; standard for scanned archival imagesJPEGLossy compression; used for non-archival purposesPDFCreated from TIFF or JPEG; Adobe Portable Document FormatJPEG2000File extension .jp2; Superior compression performance; used for archival purposes32
JPEG Page Images
Mining Newspaper Archives32Formats: NDNP GuidelinesTIFF 6.0, 8-bit grayscale, 400 dpiPDF derivative, 150 dpiJPEG 2000, Part 1 (derivative for Web access)ALTO-encoded, machine readable text, XML filesIn column-reading orderCreated with OCR softwareMETS XML data objects describing newspaper issues, pages, and microfilm reels33
FormatsPage ImagesMining Newspaper ArchivesKRM: Merge with next page33Searching historical newspapers34Mining Newspaper ArchivesKnowing the basic structure and organization of a digital library like Chronicling America and the Portal to Texas History will help one conduct a more effective search. For example, by knowing how metadata and OCR work.
34Searching Basic Search Maximum flexibilityTargeted search
Advanced SearchMore control
Exploring or Browsing - Overview of collections35
Mining Newspaper ArchivesBasic Search provides a quick, no-fuss way to search by keywords or phrases for items in the Portal, but you can also use it to create more sophisticated searches by combining terms, targeting your search, and/or limiting your search to particular types of items.for maximum flexibilityto target your search to item records (metadata)to target your search to specific fields in item records
Advanced Search more control
Exploring A great way to get an overview to know whats in the collection overall. Which states or counties are representedWhich newspaper titles are includedTotal Date range35Basic Search36
No surname field
And is implicit
Phrase searching and quotation marks
Diacritics are romanized
Mining Newspaper ArchivesOften results in too many results36Advanced Search37
Mining Newspaper ArchivesAdvanced Search provides
Easy-to-use fields that allow you to specify keywords and phrases and combine them without the need to remember special symbols.
Advanced Search also provides a number of options for limiting your search. 3738Advanced Search
Mining Newspaper ArchivesBasic SearchAll states or select a stateRange of years or one yearSearch words
38Exploring39
Mining Newspaper Archives39Explore a Collection40
Mining Newspaper Archives40Browse: Serial Title41
Mining Newspaper Archives41Browse Newspaper Issues42
http://chroniclingamerica.loc.gov/lccn/sn83045555/Mining Newspaper Archiveshttp://chroniclingamerica.loc.gov/lccn/sn83045555/
From the title page (About page):
Calendar viewAll front pagesFirst issueLast issue42Browse Newspaper Issues43
Mining Newspaper Archives43Browse by Topic44
http://www.loc.gov/rr/news/topics/topics.htm
Mining Newspaper ArchivesTopics in Chronicling Americahttp://www.loc.gov/rr/news/topics/topics.html
Chronicling America provides free access to millions of historic American newspaper pages. Listed here are topics widely covered in the American press of the time. We will be adding more topics on a regular basis. To find out what's new, sign up for Chronicling Americas weekly notification service, that highlights interesting content on the site and lets you know when new newspapers and topics are added. Users can use the icons at the lower-left side of the Chronicling America Web page to subscribe. If you would like to suggest other topics, use the Ask a Librarian contact form available on the Newspaper and Current Periodical Reading Room site. Dates show the approximate range of sample articles.
44Using search results45Mining Newspaper Archives45Bowles-Perry Family Tree46
http://trees.ancestry.com/tree/14333492/family
Mining Newspaper Archives46Gallery View: Results47
Mining Newspaper ArchivesDealing with a lot of results:Limit by State
Viewing pages:Open page in separate tabUse persistent link to get page without highlighted text
47List View: Results48
OptionsSort : Relevance, State, Title, DateResults per page: 20 or 50Mining Newspaper ArchivesAlso discuss Navigation Options for the Search Results:By visible page numbersUsing arrow(s)Jumping t a specific page48Print Search Results49
Mining Newspaper Archives49Newspaper Pages:Print, Share, & Save50
Mining Newspaper ArchivesUse persistent link to get page without highlighted textIf slow to resolve clearly, try the PDF version
50Print a Page51
Mining Newspaper Archives51Share & Email52
Identical features:Search resultsNewspaper pagesMining Newspaper Archives52Save53
Pages
Search Results
Mining Newspaper Archives53Newspaper Pages:View & Download54
Mining Newspaper Archives54View OCR Text55
Mining Newspaper Archives55View PDF56
Mining Newspaper Archives56Download JPEG200057
Mining Newspaper Archives57Clip JPEG Image58
Zoom in before you clip!
Mining Newspaper Archives58Search Results: List View59
Mining Newspaper Archives59Search Results: Grid View60
Mining Newspaper Archives60Search Results: Brief View61
Mining Newspaper Archives61Limiting Search Results62
Mining Newspaper Archives62Share a Page63http://texashistory.unt.edu/ark:/67531/metapth47935/?q=Bowles
Share OptionsEmailPrintSocial Media
Mining Newspaper Archives63Download a Page: JPEG64
Mining Newspaper Archives64Snip & Save an Article65
Mining Newspaper Archives65Snipping Tool:PNGGIFJPEGHTMLSnip & Save an Article66
Mining Newspaper Archives66Historical NewspapersSource ListsUniversity of Pennsylvania: Penn Libraries - Historical Newspapers Onlinehttp://gethelp.library.upenn.edu/guides/hist/onlinenewspapers.html Library of Congress: Newspaper Archives/Indexes/Morgueshttp://www.loc.gov/rr/news/oltitles.html ICON: International Coalition on Newspapers http://icon.crl.edu/digitization.htm Cyndis List http://www.cyndislist.com/newspapers United States Online Historical Newspapers http://sites.google.com/site/onlinenewspapersite/Home/usa 67Mining Newspaper ArchivesUniversities: Example - University of Pennsylvania: Penn Libraries - Historical Newspapers Online http://gethelp.library.upenn.edu/guides/hist/onlinenewspapers.html This table provides a list of historical U.S. newspapers that are available online at no cost. Newspapers available for free through Google News Historical Archives and Newspaperarchives.com are listed individually as I identify them. Newspapers available through Chronicling America and state digitization projects are usually listed as a group. For instance, under "Wyoming" I have not listed every newspaper digitized in the project but simply described what is available.
Libraries:Example - Library of Congress: Newspaper Archives/Indexes/Morgues. Lists newspapers in four categories: (a) Archive sources on the Web, (b) US newspapers, (c) Morgues (US), and (d) International. http://www.loc.gov/rr/news/oltitles.html Example - ICON: International Coalition on Newspapers - http://icon.crl.edu/digitization.htm The International Coalition on Newspapers project develops strategies to preserve and improve access to newspapers from around the globe, working on issues including bibliographic access, copyright, and information dissemination. ICON was officially established in 1999 by 13 charter members and is based at the Center for Research Libraries. This page highlights and links to past, present, and prospective digitization projects of historic newspapers. The focus is primarily on digital conversion efforts, not full-text collections of current news sources. Genealogy Sites: Example Cyndis List http://www.cyndislist.com/newspapers Example - United States Online Historical Newspapers http://sites.google.com/site/onlinenewspapersite/Home/usa Digital Content Providers: Example - Google News Archives (a/o May 2011 no longer adding newspapers or enhancing access to archive) http://news.google.com/newspapers6768Thanks!
Presentation and resources: http://goo.gl/6rt7D Mining Newspaper Archives6872
Missouri Republican
LibrariesNationalStateAcademicPublicPrivateHistorical SocietiesNationalStateLocalMining Newspaper Archives72
Recommended