The Future of The Past
The New York Times and the Challenge of Archives
Evan Sandhaus, Sophia Van Valkenburg
Jane Cotler
The New York Times@nytarchives
(us)
A Problem of Archives“How do you faithfully represent information created with one technology using another?”
A Problem We Know Well• Migrating The Index to The Times Information Bank• Migrating The Microfilm Archive to TimesMachine• Migrating Legacy Web Content to Modern Online
Presentation (or the challenge of multiple legacy formats)
The Problem By The Numbers
60,000Issues Published Since
September 18, 1851
Almost
The Problem By The Numbers
3,500,000+Unique Pages Printed Since
September 18, 1851
The Problem By The Numbers
15,000,000+Articles Published
September 18, 1851
Digital Archives
1851-
1859
1860-
1865
1866-
1949
1970-
1980
1981-
1995
1996-
2016
Full Text NYT5
Full Text NYT4
Abstracts NYT4
Abstracts NYT5
1950-
1959
1960-
1969
The New York Times Information Bank
The Index
Evan Sandhaus
The New York Times Company Archives
The New York Times Company Archives
The New York Times Company Archives
The New York Times Company Archives
The New York Times Company Archives
TimesMachine
The Deep Archive
0
45000
90000
135000
180000
1851
1858
1865
1872
1879
1886
1893
1900
1907
1914
1921
1928
1935
1942
1949
1956
1963
1970
1977
1984
1991
1998
2005
2012
Scanned Articles Digital Articles Blogs
≈75% ≈25%
The Deep Archive
The Numbers
46,592Issues Published Since
September 18, 1851
The Numbers
2,335,446Unique Pages Printed Since
September 18, 1851
The Numbers
11,298,320Articles Published
September 18, 1851
The Scanned Archive
The Scanned Archive
HeadlineCROWD ROARS THUNDEROUS WELCOME;
Breaks Through Lines of Soldiers and Police and Surging to Plane Lifts Weary Flier from His Cockpit AVIATORS SAVE HIM FROM FRENZIED MOB OF
100,000 Paris Boulevards Ring With Celebration After Day and Night Watch -- American Flag Is
Called For and Wildly Acclaimed
The Scanned Archive
Lede ParagraphPARIS, May 21. -- Lindbergh did it. Twenty minutes
after 10 o'clock tonight suddenly and softly there slipped out of the darkness a gray-white airplane as 25,000 pairs of eyes strained toward it. At 10:24 the Spirit of St. Louis landed and lines of soldiers, ranks
of policemen and stout steel fences went down before a mad rush as irresistible as the tides of the
ocean.
The Scanned Archive
“Dirty” ASCII…Lifte Fro'm His Cockpit. As he was lifted to the
ground Lindbergh w as l,-:k:, :::. - hair unkempt, he looked completely worn out. lle h-:: strength
enough, however, to smile, and waved his hand to t? ' crowd. Soldiers with fixed bayonets were unable to keep bach the crowd. United States Ambassador
Herrick was among the first to welcome and congratulate the hero.s…
The Scanned Archive
Indexing MetadataHeadings
People, Places, Organizations, Subject
AbstractsConcise summary of the facts in the article
Demo
TimesMachineVersion 2.0
Archive Transcription
The Problem
• As a subscriber exclusive TimesMachine does not appear in Google Search results.
• Lack of full text before 1980 makes it difficult to rank, or even appear, in Google results.
• For example: In 1945 The Times published 161,961 articles and only a tiny fraction appear in Google results.
The Solution
• Transcribe articles from archival scans and publish these assets as searchable pages on nytimes.com.
• Transcribe and publish 1964 as pilot.• If that works transcribe and publish all remaining
articles between 1960-1980.
Progress & Results
• All articles between 1960-1980 transcribed.• All articles between 1970-1979 available on
nytimes.com with more to come.• Google now indexing 672,500 new assets published
between 1970-1979!• Plans to publish 1960-1969, and to monitor
performance of new pages.
Online Archive Modernization
Online Archive Modernization
Archival Content on NYTimes.com
Archival Content on NYTimes.com
The Initial Solution
new format for CMS (JSON)
print data(XML)
The Case Of The Missing Articles
The Case Of The Missing Articles
web data(HTML)
new format for CMS (JSON)
print data(XML)
The Case of the Missing Articles
1. What is the complete list of article URLs from 1996-2006?
2. How do we identify which of the missing web articles correspond to existing print articles so that we can combine them and avoid duplicate content?
3. Which articles are web-only and not in our print archive at all, and how do we scrape that page for content & metadata?
4. Can we build a system that will process all the data for each year easily & efficiently?
The Definitive List of Articles
4 different sources:
1. Print archive2. Site analytics (from the past 6 months)3. Movie, theater, and restaurant reviews4. Sitemaps
The Archive Migration Pipeline For A Given Year
archive XML
definitive list of URLs
extracted URLs
missing URLs
missing HTML
URLs with no article
body
XML to HTML
matches
unmatched HTML
JSON from XML and
HTML
JSON from unmatched
HTML
skipped files
JSON with no
duplicate
The Archive Migration Pipeline For A Given Year
archive XML
definitive list of URLs
extracted URLs
missing URLs
missing HTML
URLs with no article
body
XML to HTML
matches
unmatched HTML
JSON from XML and
HTML
JSON from unmatched
HTML
skipped files
JSON with no
duplicate
The Archive Migration Pipeline For A Given Year
archive XML
definitive list of URLs
extracted URLs
missing URLs
missing HTML
URLs with no article
body
XML to HTML
matches
unmatched HTML
JSON from XML and
HTML
JSON from unmatched
HTML
skipped files
JSON with no
duplicate
The Archive Migration Pipeline For A Given Year
archive XML
definitive list of URLs
extracted URLs
missing URLs
missing HTML
URLs with no article
body
XML to HTML
matches
unmatched HTML
JSON from XML and
HTML
JSON from unmatched
HTML
skipped files
JSON with no
duplicate
The Archive Migration Pipeline For A Given Year
archive XML
definitive list of URLs
extracted URLs
missing URLs
missing HTML
URLs with no article
body
XML to HTML
matches
unmatched HTML
JSON from XML and
HTML
JSON from unmatched
HTML
skipped files
JSON with no
duplicate
The Archive Migration Pipeline For A Given Year
archive XML
definitive list of URLs
extracted URLs
missing URLs
missing HTML
URLs with no article
body
XML to HTML
matches
unmatched HTML
JSON from XML and
HTML
JSON from unmatched
HTML
skipped files
JSON with no
duplicate
The Archive Migration Pipeline3%
12.9%
36.2%
48.3% Print Archive (56K)Print Archive and Web (42K)Web-only (15K)Bad urls (3K)
2004 Articles (116K total)
All The Little Things…
• 1996• Article Matching• Better URLs• Quality Assurance• Next Steps
Article Matching: Fusion
archive XML
definitive list of URLs
extracted URLs
missing URLs
missing HTML
URLs with no article
body
XML to HTML
matches
unmatched HTML
JSON from XML and
HTML
JSON from unmatched
HTML
skipped files
JSON with no
duplicate
Fusion Explained
web data(HTML)
print data(XML)
Search Engine Optimization27iht-scoutus.t.html
Search Engine Optimizationcurb-violates-free-speech-supreme-court-rules-72-justices-void-internet.html
The Case Of The Missing Sections
The Case Of The Missing Sections
Next Steps
1851-
1859
1860-
1865
1866-
1949
1970-
1980
1981-
1995
1996-
2016
1950-
1959
1960-
1969
Full Text
No Full Text
Next StepsPhotos
Next Steps
Digital preservation
To Conclude…
Thank You!
Evan Sandhaus, Sophia Van Valkenburg, Jane Cotler
The New York Times