Click here to load reader

Mining Newspaper Archives

  • Upload
    jaguar

  • View
    67

  • Download
    0

Embed Size (px)

DESCRIPTION

Mining Newspaper Archives. Tara Carlisle Kathleen Murray. Topics. Introduction Types of Information Technology & Standards Searching Historical Newspapers Using Search Results. Introduction. National Digital Newspaper Program (NDNP). Partnership - PowerPoint PPT Presentation

Citation preview

PowerPoint Presentation

National Digital Newspaper ProgramUniversity of North Texas

The Portal to Texas History Texas Digital Newspaper Program

6Mining Newspaper Archives(30 sec) The University of North Texas was selected to be the lead institution for the National Digital Newspaper Program - the two-year grant requires that 100,000 pages of content be digitized. The newspapers are hosted on UNTs The Portal to Texas History. To provide some background, the Portal to Texas History, which was created in 2002, is a digital gateway to over 170,000 historical items that include photographs, letters, documents, maps, artifacts, books and newspapers. AND many types of institutions; such as museums, public libraries, historical societies, government agencies, and universities large and small, contribute their collections, thus making the Portal a vital resource for scholars, k-12 teachers, students and genealogists. 6TopicsIntroductionTypes of InformationTechnology & StandardsSearching Historical NewspapersUsing Search Results2Mining Newspaper Archives269

Mining Newspaper Archives6970

Mining Newspaper Archives70Chronicling AmericaUS Newspaper DirectoryDatabase: 1690 presentUS Newspaper ProgramFunded by NEH: 1980 - 2007140,000 bibliographic title entries900,000 separate library holdings recordsDirectory ListingMissouri Republican (St. Louis, Mo.) 1822-18385

Mining Newspaper ArchivesNational Digital Newspaper Program (NDNP)NEH Grants: 28 statesGoal: All US states & territoriesCoverage: 1836-1922Newspaper Selection ProcessPrimarily technically-suitable microfilm holdings Emphasis on bibliographic completenessDiversity and "orphaned" newspapers Searchable Database: 4,580,151 pages

The Newspaper Title Directory is derived from the library catalog records created by state institutions during the NEH-sponsored United States Newspaper Program (http://www.neh.gov/projects/usnp.html), 1980-2007. This program funded state-level projects to locate, describe (catalog), and selectively preserve (via treatment and microfilm) historic newspaper collections in that state, published from 1690 to the present. Under this program, each institution created machine-readable cataloging (MARC) via the Cooperative ONline SERials Program (CONSER) for its state collections, contributing bibliographic descriptions and library holdings information to the Newspaper Union List, hosted by the Online Computer Library Center (OCLC). This data, approximately 140,000 bibliographic title entries and 900,000 separate library holdings records, was acquired and converted to MARCXML format for use in the Chronicling America Newspaper Title Directory. Contact a CONSER member for updates and corrections to bibliographic records (see http://www.loc.gov/acq/conser/conmembs.html ) through CONSER. The Chronicling America Directory bibliographic records are updated annually from the CONSER dataset hosted by OCLC.5Texas Digital Newspaper Program7

Mining Newspaper Archives(30 sec)Too meet the needs of the large state of Texas, in addition to digitizing newspapers through the National Digital Newspaper Program, UNT created the Texas Digital Newspaper Program and applied for state and foundation to digitize more Texas newspapers. Altogether, through national, state and private funding, UNT has digitized more than 600,000 pages of content, put another way:

86,000+ newspaper issues spanning 200 years, dating 1810 - 2010Representing 50 countiesSix languages English, Spanish, German primarily a few Czech, French, Swedish and Chinese newspapers. 7 Digitization Standards8

Mining Newspaper Archives(30 sec) Other than its primary mission of providing online access to Texas historical newspapers, The Texas Digital Newspaper Program strives to promote best practice standards for preservation and promotes NDNPs digitization standards for newspapers.

Anecdote: Rio Grande City Public Librarymicrofilm

Most successful initiative has been the Tocker Foundation grant that provides funding to rural public libraries (with a population less than 14,000 to digitize their historic newspaper collection). UNT Libraries partners with the libraries by providing the digitization services and hosting the newspapers on the Portal. For some rural areas historical newspapers are the only source or documentation of their local history.

8Types of information9Mining Newspaper Archives(1-2 min) InteractivityStart by asking the audience:When you think of newspapers for genealogical research, what comes to your mind?What types of information or data are you seeking/hoping to find?9Types of InformationBirths and deathsMarriage announcementsMilitary serviceLand purchasesPromotionsAdvertisements: Family businessesTravel announcementsSocial activities10Mining Newspaper Archives(15 sec) Note to PresentersAnimate to present content on click

Present some interesting examples!Christmas decoration fight!TO DO: NEED examplesFind a family and have examples for 3 or more of the information typesOsterhout, John P. Tara will look this up as a possibility

Chronicling America: Family of Hugh MurrayTravel http://chroniclingamerica.loc.gov/lccn/sn84020274/1902-12-21/ed-1/seq-33/Travel family visit http://chroniclingamerica.loc.gov/lccn/sn84020274/1902-11-30/ed-1/seq-36/Marriage http://chroniclingamerica.loc.gov/lccn/sn87052181/1893-09-16/ed-1/seq-3/Social event http://chroniclingamerica.loc.gov/lccn/sn84020274/1902-09-21/ed-1/seq-3/Social event http://chroniclingamerica.loc.gov/lccn/sn84020274/1900-01-07/ed-1/seq-26/Real estate: http://chroniclingamerica.loc.gov/lccn/sn84020274/1903-11-04/ed-1/seq-7/

10Death notices11

Mining Newspaper Archives(15) Types of Information: Death NoticesAt times we find the unfortunate truth about our ancestors.Anecdote: a Ms. Lanham was researching her family and finally had solved the mystery about the lack of information about her great, great grandfathers death. He was killed by her great uncle.11

J.P. Osterhout children12

Bellville Countryman, 1861

Texas Countryman, 1868Mining Newspaper Archives(10 sec)12

J.P. Osterhout (1826-1903)13

Fort Worth Gazette, 1891Fort Worth Gazette, 1889Mining Newspaper Archives(10 sec)1314

J.P. Osterhout children

Belton Evening News, 1918Sherman Democrat, 1903Mining Newspaper Archives(10 sec)14Technology & Standards15Mining Newspaper ArchivesThe next part of the presentation provides an overview of the basic technology and the standards used for Chronicling America and The Portal to Texas History because knowing how a system is organized and how the parts connect gives one the tools to conduct a more effective search. Many of the basic concepts that were covering today can be applied to other digital library systems.

15Technology & Standards16

Page FormatsJPEGJP2PDFOCR TextMetadataTitleIssue DateGeographic CoverageApplication Programming InterfaceDirectory searchingLinks to title, issues, pagesLinked dataOptical Character RecognitionScanningOCR

Mining Newspaper ArchivesI will discuss metadata and the part it plays in the digital library system.

UNT DPUModel: Mekel Mach V Microfilm Scanner

Manufacturer: Mekel Technology, San Dimas, CaliforniaFeatures: 8,192 pixel CCD array; automatic gain control; fiber optic lighting; accepts 16mm/35mm, simplex/duplex, positive/negative, silver, diazo, and vesicular formats in 100', 215', and 1000' rolls; 100-600 true optical dpi

How we use it: scanning microfilm for the Texas Digital Newspaper Program and other initiativesExample scans:Breckenridge American, Vol. 41, No. 48The Hereford Brand, Vol. 8, No. 27

16Metadata17Metadata enhances information retrieval within the system and between other systems.

Descriptive metadata is used to describe an individual item and provides such information as creator, publisher, contents, size, relationship to other resources, and more.

Metadata may also contain "preservation" components that help us to maintain the integrity of digital files over time.

Set in a Resource Discovery Framework supports open access and linked data.

Mining Newspaper Archives(30 sec) Interactivity:To start off - Tell me what you know about metadataWhat comes to mind when I say metadata schema? Any examples?

Digital library systems such as Chronicling America and The Portal to Texas Historyuse metadata to describe and organize the digital content.

Descriptive metadata the creator, date, subject, coverageAdministrative metadata digital file type, size and unique identifier for the systemResource Discovery Framework support interoperability

17Dublin Core Elementsfor descriptive metadataTitleSubjectDescriptionTypeSourceRelationCoverageCreator

9. Publisher10. Contributor11. Rights12. Date13. Format14. Identifier15. Language

18Mining Newspaper ArchivesTitleSubjectDescription Resource Type article, book, photograph Source reference to where resource was derived scanned page within a book for exampleRelation reference to a related source, image item to a text itemCoverage time period and or locationCreator one who is responsible for making the content: artist, author, photographerPublisherContributor one who made contributions to the content or played a secondary roleRights state rights held over the resourceDateFormat sound file, text, imageIdentifier unique number assigned to individual itemLanguage 1819

Mining Newspaper Archives(10 sec) On this newspaper record we see some of the core metadata elements and to the left one has the option to view the metadata files in several formats. 1920

Mining Newspaper Archives(10 sec) UNT based its metadata schema on the Dublin Core Metadata Initiative

20Qualified Dublin Core 21

Dublin Core elementsQualified Dublin CoreMining Newspaper Archives(30 sec) UNT took simple Dublin Core

And locally qualified itUNT Libraries uses qualified Dublin Core, so while promoting interoperability with widely accepted standards, UNT Libraries metadata elements allow flexibility at the local level to integrate with existing and anticipated changes.

Thus allowing us to implement faceted searching

21Digitization Process22Optical Character Recognition

ScanningOCR

Mining Newspaper ArchivesMetadata is one of the key components of a digital library because it creates relationships between digital objects and optimizes searching.

But the most vital part of a digital library is the actual digital object and its accompanying files. How is the digital object made and organized?

Hand over to Kathleen to discuss the digitization process.22Digitization Process23PaperMicrofilmScan ImageDigital MasterDerivativeProductionJPEG2000PDFJPEGQualityOriginalCompleteCleanQuality1990s or laterMaster negative(first generation)Original copiesDensityReduction ratioOriginal SourcesQuality300-400 ppiLossless (tiff)GrayscaleBi-tonalMining Newspaper ArchivesMost large scale projects scan from microfilm, not paperQuality of microfilm improved in the 1990s with the establishment and use of imaging standards23OCR in the Process24PaperMicrofilmScan ImageDigital MasterOptimization for OCRHigh B&W contrastGrayscale to bi-tonalDe-skew pagesSmooth, round, sharpened character edgesOCR SoftwareAnalyze & breakdown page layoutAnalyze stroke edges of charactersMatch edges to pattern imagesCharacter decisionWord matching in dictionaryConfidence decisionOCR TextMining Newspaper ArchivesMost large scale projects scan from microfilm, not paperQuality of microfilm improved in the 1990s with the establishment and use of imaging standardsObjective of de-skewing is to align the words horizontallyFonts & layoutsGenerally OCR software is quite capableDifficulties: handwriting, script, Gothic fontsOCR post-processingDynamic thresholding: Creating temporary bi-tonal images; Goal: Produce better OCR results

Analyze page structureBlocks of text (columns), tables, imagesLines: Words then charactersComparison of characters with pattern images in databaseAnalysis of stroke edges, discontinuities, backgroundBest guess character decision made Confidence rating encoded: ALTO standard for newspaper ranges from 0-9, 9 being very confidentAnalysis at word levelBuilt-in dictionaries consulted for word matchHuman-mediated training to improve character recognitionExpensive & not generally done in large-scale projects

Citation: Holley, R. (2009, March/April). How good can it get? Analyzing and improving OCR accuracy in large scale historic newspaper diigitisation programs. D-Lib Magazine. Retrieved January 21, 2012 from http://www.dlib.org/dlib/march09/holley/03holley.html

24OCR & QualityWhat affects microfilm quality?Quality of printed newspaperReduction ratio: Lower is better ( 20x)Variation in density: Narrow range is better ( .2; .90-1.20)Measurement of light able to pass through filmTechnically suitable film: Can produce a 300-400 ppi digital imageExample: 400 ppi imageOptical resolution of scanner: 8,000 ppiMicrofilm reduction ratio needs to be 20x8,000 ppi / 400 ppi = 20:125Mining Newspaper ArchivesSource: http://finereader.abbyy.com/professional/ (ABBYY website ABBYY Finereader)What is OCR?Optical Character Recognition (OCR) is a technology that enablesconversion ofimages, received from scanner ordigital camera,and PDFs to editable and searchabletext documentsready for editing, quoting, search, and archiving.

Reference:Microfilm Selection for Digitization - NDNPhttp://www.loc.gov/ndnp/guidelines/NEH_MicrofilmSelectionNDNP.pdf

Example: 12X or 12:1 means the size of the image on film will be 1/12th the actual size of the original document. Reduction should always be identified on preservation quality microfilm. In the event it is not, you can use the following formula to determine the reduction. Size of the original divided by the size of the frame = reduction.

Digitization parametersBit depth: 1 bit, 8-bit, 24-bitResolution: pixels per inch (ppi)

KRM: work this up

Include Interactivity25OCR Text: Cost v. QualityLayout irregularitiesIf inconsistent, cannot automate parametersTraining the OCR software Human mediation to confirm or correct best guesses of softwareSegmenting articles (including cont. articles)Requires additional resourcesOffered by fee-based archivesThe British Newspaper ArchiveThe New York Times Archive26Mining Newspaper ArchivesLayout irregularities affect the amount of human intervention needed; if layout is not standard, cannot use/automate parametersTraining the OCR software (ABBYY feature) by confirming or correcting best guesses; human mediation is expensiveSegmenting by article, including cont. articles requires additional resources (Fee archives BL & NYTimes)26Search: Metadata & OCR Text27OCR TextMetadata

chroniclingamerica.loc.gov/lccn/sn86071264/1853-01-03/ed-1/seq-3

Mining Newspaper Archives27Application Programming Interface- API -28

Mining Newspaper Archives28API: OpenSearch- Newspaper Pages -All searches start with protocol & server name: http://chroniclingamerica.loc.gov/Search query example: Frederick Gardner, a Missouri governor29http://chroniclingamerica.loc.gov//search/pages/results/?andtext=frederick+gardner+missouri

Mining Newspaper ArchivesPage Search parametersandtext: the search queryformat: 'html' (default), or 'json', or 'atom' (optional)page: for paging results (optional)

Examples:/search/pages/results/?andtext=thomas search for "thomas", HTML response /search/pages/results/?andtext=thomas&format=atom search for "thomas", Atom response /search/pages/results/?andtext=thomas&format=atom&page=11 search for "thomas", Atom response, starting at page 11

Chronicling America Pages Search Search digital newspaper content from the Library of Congress UTF-8 http://loc.gov/favicon.ico 29API: OpenSearchNewspaper Pages30

http://chroniclingamerica.loc.gov/search/pages/results/?andtext=frederick+gardner+missouri

Mining Newspaper Archiveshttp://chroniclingamerica.loc.gov/search/pages/results/?andtext=frederick+gardner+missouri30API: Link to Titles, Issues, Edition, & Pages 31http://http://chroniclingamerica.loc.gov/The Bourbon news. (Paris, Ky.)/lccn/sn86069873/ Title Information/lccn/sn86069873/issues/ Calendar View of Issues/lccn/sn86069873/issues/first_pages/ Browse First Pages/lccn/sn86069873/1900-01-05/ed-1/ 1st Available Edition from 05JAN1900/lccn/sn86069873/1900-01-05/ed-1/seq-3/ 3rd Available Page from 05JAN1900Applications: BookmarksShare on other sites

Example: St. Louis Republic, 16SEP1893, page 3http://chroniclingamerica.loc.gov/lccn/sn87052181/1893-09-16/ed-1/seq-3

Mining Newspaper ArchivesExample: St. Louis Republichttp://chroniclingamerica.loc.gov/lccn/sn87052181/1893-09-16/ed-1/seq-3

Link pattern: LC long-term commitment

Link to titles, issues, editions, and pages

The Chronicling America Web site uses links that follow a straightforward pattern. You can use this pattern to construct links into specific newspaper titles, to any of its available issues and their editions, and even to specific pages. These links can be readily bookmarked and shared on other sites.We are committed to supporting this link pattern over time, so even if we change how the site works, we will redirect any requests to the system using this specific pattern into the new site. We established redirect rules for links into the previous version of the site when we released a new version in early 2009, and we intend to sustain those rules.The link pattern uses LCCNs, dates, issue numbers, edition numbers, and page sequence numbers.Examples: The Bourbon news. (Paris, Ky.) 1895-19??

/lccn/sn86069873/ title information for LCCN sn 86069873 /lccn/sn86069873/issues/ calendar view of available issues /lccn/sn86069873/issues/first_pages/ browse first pages of available issues /lccn/sn86069873/1900-01-05/ed-1/ first available edition from January 5, 1900 /lccn/sn86069873/1900-01-05/ed-1/seq-3/ third available page from first edition, January 5, 1900

31File FormatsTIFFUncompressed format; standard for scanned archival imagesJPEGLossy compression; used for non-archival purposesPDFCreated from TIFF or JPEG; Adobe Portable Document FormatJPEG2000File extension .jp2; Superior compression performance; used for archival purposes32

JPEG Page Images

Mining Newspaper Archives32Formats: NDNP GuidelinesTIFF 6.0, 8-bit grayscale, 400 dpiPDF derivative, 150 dpiJPEG 2000, Part 1 (derivative for Web access)ALTO-encoded, machine readable text, XML filesIn column-reading orderCreated with OCR softwareMETS XML data objects describing newspaper issues, pages, and microfilm reels33

FormatsPage ImagesMining Newspaper ArchivesKRM: Merge with next page33Searching historical newspapers34Mining Newspaper ArchivesKnowing the basic structure and organization of a digital library like Chronicling America and the Portal to Texas History will help one conduct a more effective search. For example, by knowing how metadata and OCR work.

34Searching Basic Search Maximum flexibilityTargeted search

Advanced SearchMore control

Exploring or Browsing - Overview of collections35

Mining Newspaper ArchivesBasic Search provides a quick, no-fuss way to search by keywords or phrases for items in the Portal, but you can also use it to create more sophisticated searches by combining terms, targeting your search, and/or limiting your search to particular types of items.for maximum flexibilityto target your search to item records (metadata)to target your search to specific fields in item records

Advanced Search more control

Exploring A great way to get an overview to know whats in the collection overall. Which states or counties are representedWhich newspaper titles are includedTotal Date range35Basic Search36

No surname field

And is implicit

Phrase searching and quotation marks

Diacritics are romanized

Mining Newspaper ArchivesOften results in too many results36Advanced Search37

Mining Newspaper ArchivesAdvanced Search provides

Easy-to-use fields that allow you to specify keywords and phrases and combine them without the need to remember special symbols.

Advanced Search also provides a number of options for limiting your search. 3738Advanced Search

Mining Newspaper ArchivesBasic SearchAll states or select a stateRange of years or one yearSearch words

38Exploring39

Mining Newspaper Archives39Explore a Collection40

Mining Newspaper Archives40Browse: Serial Title41

Mining Newspaper Archives41Browse Newspaper Issues42

http://chroniclingamerica.loc.gov/lccn/sn83045555/Mining Newspaper Archiveshttp://chroniclingamerica.loc.gov/lccn/sn83045555/

From the title page (About page):

Calendar viewAll front pagesFirst issueLast issue42Browse Newspaper Issues43

Mining Newspaper Archives43Browse by Topic44

http://www.loc.gov/rr/news/topics/topics.htm

Mining Newspaper ArchivesTopics in Chronicling Americahttp://www.loc.gov/rr/news/topics/topics.html

Chronicling America provides free access to millions of historic American newspaper pages. Listed here are topics widely covered in the American press of the time. We will be adding more topics on a regular basis. To find out what's new, sign up for Chronicling Americas weekly notification service, that highlights interesting content on the site and lets you know when new newspapers and topics are added. Users can use the icons at the lower-left side of the Chronicling America Web page to subscribe. If you would like to suggest other topics, use the Ask a Librarian contact form available on the Newspaper and Current Periodical Reading Room site. Dates show the approximate range of sample articles.

44Using search results45Mining Newspaper Archives45Bowles-Perry Family Tree46

http://trees.ancestry.com/tree/14333492/family

Mining Newspaper Archives46Gallery View: Results47

Mining Newspaper ArchivesDealing with a lot of results:Limit by State

Viewing pages:Open page in separate tabUse persistent link to get page without highlighted text

47List View: Results48

OptionsSort : Relevance, State, Title, DateResults per page: 20 or 50Mining Newspaper ArchivesAlso discuss Navigation Options for the Search Results:By visible page numbersUsing arrow(s)Jumping t a specific page48Print Search Results49

Mining Newspaper Archives49Newspaper Pages:Print, Share, & Save50

Mining Newspaper ArchivesUse persistent link to get page without highlighted textIf slow to resolve clearly, try the PDF version

50Print a Page51

Mining Newspaper Archives51Share & Email52

Identical features:Search resultsNewspaper pagesMining Newspaper Archives52Save53

Pages

Search Results

Mining Newspaper Archives53Newspaper Pages:View & Download54

Mining Newspaper Archives54View OCR Text55

Mining Newspaper Archives55View PDF56

Mining Newspaper Archives56Download JPEG200057

Mining Newspaper Archives57Clip JPEG Image58

Zoom in before you clip!

Mining Newspaper Archives58Search Results: List View59

Mining Newspaper Archives59Search Results: Grid View60

Mining Newspaper Archives60Search Results: Brief View61

Mining Newspaper Archives61Limiting Search Results62

Mining Newspaper Archives62Share a Page63http://texashistory.unt.edu/ark:/67531/metapth47935/?q=Bowles

Share OptionsEmailPrintSocial Media

Mining Newspaper Archives63Download a Page: JPEG64

Mining Newspaper Archives64Snip & Save an Article65

Mining Newspaper Archives65Snipping Tool:PNGGIFJPEGHTMLSnip & Save an Article66

Mining Newspaper Archives66Historical NewspapersSource ListsUniversity of Pennsylvania: Penn Libraries - Historical Newspapers Onlinehttp://gethelp.library.upenn.edu/guides/hist/onlinenewspapers.html Library of Congress: Newspaper Archives/Indexes/Morgueshttp://www.loc.gov/rr/news/oltitles.html ICON: International Coalition on Newspapers http://icon.crl.edu/digitization.htm Cyndis List http://www.cyndislist.com/newspapers United States Online Historical Newspapers http://sites.google.com/site/onlinenewspapersite/Home/usa 67Mining Newspaper ArchivesUniversities: Example - University of Pennsylvania: Penn Libraries - Historical Newspapers Online http://gethelp.library.upenn.edu/guides/hist/onlinenewspapers.html This table provides a list of historical U.S. newspapers that are available online at no cost. Newspapers available for free through Google News Historical Archives and Newspaperarchives.com are listed individually as I identify them. Newspapers available through Chronicling America and state digitization projects are usually listed as a group. For instance, under "Wyoming" I have not listed every newspaper digitized in the project but simply described what is available.

Libraries:Example - Library of Congress: Newspaper Archives/Indexes/Morgues. Lists newspapers in four categories: (a) Archive sources on the Web, (b) US newspapers, (c) Morgues (US), and (d) International. http://www.loc.gov/rr/news/oltitles.html Example - ICON: International Coalition on Newspapers - http://icon.crl.edu/digitization.htm The International Coalition on Newspapers project develops strategies to preserve and improve access to newspapers from around the globe, working on issues including bibliographic access, copyright, and information dissemination. ICON was officially established in 1999 by 13 charter members and is based at the Center for Research Libraries. This page highlights and links to past, present, and prospective digitization projects of historic newspapers. The focus is primarily on digital conversion efforts, not full-text collections of current news sources. Genealogy Sites: Example Cyndis List http://www.cyndislist.com/newspapers Example - United States Online Historical Newspapers http://sites.google.com/site/onlinenewspapersite/Home/usa Digital Content Providers: Example - Google News Archives (a/o May 2011 no longer adding newspapers or enhancing access to archive) http://news.google.com/newspapers6768Thanks!

Presentation and resources: http://goo.gl/6rt7D Mining Newspaper Archives6872

Missouri Republican

LibrariesNationalStateAcademicPublicPrivateHistorical SocietiesNationalStateLocalMining Newspaper Archives72