Download pdf - Sandhaus, Van Valkenburg, Cotler; NYT Technical Team: The Future of the Past

The Future of The Past

The New York Times and the Challenge of Archives

Evan Sandhaus, Sophia Van Valkenburg

Jane Cotler

The New York Times@nytarchives

(us)

A Problem of Archives“How do you faithfully represent information created with one technology using another?”

A Problem We Know Well• Migrating The Index to The Times Information Bank• Migrating The Microfilm Archive to TimesMachine• Migrating Legacy Web Content to Modern Online

Presentation (or the challenge of multiple legacy formats)

The Problem By The Numbers

60,000Issues Published Since

September 18, 1851

Almost


3,500,000+Unique Pages Printed Since

September 18, 1851


15,000,000+Articles Published

September 18, 1851

Digital Archives

1851-

1859

1860-

1865

1866-

1949

1970-

1980

1981-

1995

1996-

2016

Full Text NYT5

Full Text NYT4

Abstracts NYT4

Abstracts NYT5

1950-

1959

1960-

1969

The New York Times Information Bank

The Index

Evan Sandhaus

The New York Times Company Archives





TimesMachine

The Deep Archive

0

45000

90000

135000

180000

1851

1858

1865

1872

1879

1886

1893

1900

1907

1914

1921

1928

1935

1942

1949

1956

1963

1970

1977

1984

1991

1998

2005

2012

Scanned Articles Digital Articles Blogs

≈75% ≈25%

The Deep Archive

The Numbers

46,592Issues Published Since

September 18, 1851

The Numbers

2,335,446Unique Pages Printed Since

September 18, 1851

The Numbers

11,298,320Articles Published

September 18, 1851

The Scanned Archive

The Scanned Archive

HeadlineCROWD ROARS THUNDEROUS WELCOME;

Breaks Through Lines of Soldiers and Police and Surging to Plane Lifts Weary Flier from His Cockpit AVIATORS SAVE HIM FROM FRENZIED MOB OF

100,000 Paris Boulevards Ring With Celebration After Day and Night Watch -- American Flag Is

Called For and Wildly Acclaimed

The Scanned Archive

Lede ParagraphPARIS, May 21. -- Lindbergh did it. Twenty minutes

after 10 o'clock tonight suddenly and softly there slipped out of the darkness a gray-white airplane as 25,000 pairs of eyes strained toward it. At 10:24 the Spirit of St. Louis landed and lines of soldiers, ranks

of policemen and stout steel fences went down before a mad rush as irresistible as the tides of the

ocean.

The Scanned Archive

“Dirty” ASCII…Lifte Fro'm His Cockpit. As he was lifted to the

ground Lindbergh w as l,-:k:, :::. - hair unkempt, he looked completely worn out. lle h-:: strength

enough, however, to smile, and waved his hand to t? ' crowd. Soldiers with fixed bayonets were unable to keep bach the crowd. United States Ambassador

Herrick was among the first to welcome and congratulate the hero.s…

The Scanned Archive

Indexing MetadataHeadings

People, Places, Organizations, Subject

AbstractsConcise summary of the facts in the article

Demo

TimesMachineVersion 2.0

Archive Transcription

The Problem

• As a subscriber exclusive TimesMachine does not appear in Google Search results.

• Lack of full text before 1980 makes it difficult to rank, or even appear, in Google results.

• For example: In 1945 The Times published 161,961 articles and only a tiny fraction appear in Google results.

The Solution

• Transcribe articles from archival scans and publish these assets as searchable pages on nytimes.com.

• Transcribe and publish 1964 as pilot.• If that works transcribe and publish all remaining

articles between 1960-1980.

http://nytimes.com

Progress & Results

• All articles between 1960-1980 transcribed.• All articles between 1970-1979 available on

nytimes.com with more to come.• Google now indexing 672,500 new assets published

between 1970-1979!• Plans to publish 1960-1969, and to monitor

performance of new pages.

http://nytimes.com

Online Archive Modernization

Online Archive Modernization

Archival Content on NYTimes.com

http://nytimes.com

Archival Content on NYTimes.com

http://nytimes.com

The Initial Solution

new format for CMS (JSON)

print data(XML)

The Case Of The Missing Articles

The Case Of The Missing Articles

web data(HTML)

new format for CMS (JSON)

print data(XML)

The Case of the Missing Articles

1. What is the complete list of article URLs from 1996-2006?

2. How do we identify which of the missing web articles correspond to existing print articles so that we can combine them and avoid duplicate content?

3. Which articles are web-only and not in our print archive at all, and how do we scrape that page for content & metadata?

4. Can we build a system that will process all the data for each year easily & efficiently?

The Definitive List of Articles

4 different sources:

1. Print archive2. Site analytics (from the past 6 months)3. Movie, theater, and restaurant reviews4. Sitemaps

The Archive Migration Pipeline For A Given Year

archive XML

definitive list of URLs

extracted URLs

missing URLs

missing HTML

URLs with no article

body

XML to HTML

matches

unmatched HTML

JSON from XML and

HTML

JSON from unmatched

HTML

skipped files

JSON with no

duplicate


archive XML


extracted URLs

missing URLs

missing HTML


body

XML to HTML

matches

unmatched HTML

JSON from XML and

HTML

JSON from unmatched

HTML

skipped files

JSON with no

duplicate


archive XML


extracted URLs

missing URLs

missing HTML


body

XML to HTML

matches

unmatched HTML

JSON from XML and

HTML

JSON from unmatched

HTML

skipped files

JSON with no

duplicate


archive XML


extracted URLs

missing URLs

missing HTML


body

XML to HTML

matches

unmatched HTML

JSON from XML and

HTML

JSON from unmatched

HTML

skipped files

JSON with no

duplicate


archive XML


extracted URLs

missing URLs

missing HTML


body

XML to HTML

matches

unmatched HTML

JSON from XML and

HTML

JSON from unmatched

HTML

skipped files

JSON with no

duplicate


archive XML


extracted URLs

missing URLs

missing HTML


body

XML to HTML

matches

unmatched HTML

JSON from XML and

HTML

JSON from unmatched

HTML

skipped files

JSON with no

duplicate

The Archive Migration Pipeline3%

12.9%

36.2%

48.3% Print Archive (56K)Print Archive and Web (42K)Web-only (15K)Bad urls (3K)

2004 Articles (116K total)

All The Little Things…

• 1996• Article Matching• Better URLs• Quality Assurance• Next Steps

Article Matching: Fusion

archive XML


extracted URLs

missing URLs

missing HTML


body

XML to HTML

matches

unmatched HTML

JSON from XML and

HTML

JSON from unmatched

HTML

skipped files

JSON with no

duplicate

Fusion Explained

web data(HTML)

print data(XML)

Search Engine Optimization27iht-scoutus.t.html

Search Engine Optimizationcurb-violates-free-speech-supreme-court-rules-72-justices-void-internet.html

The Case Of The Missing Sections

The Case Of The Missing Sections

Next Steps

1851-

1859

1860-

1865

1866-

1949

1970-

1980

1981-

1995

1996-

2016

1950-

1959

1960-

1969

Full Text

No Full Text

Next StepsPhotos

Next Steps

Digital preservation

To Conclude…

Thank You!

Evan Sandhaus, Sophia Van Valkenburg, Jane Cotler

The New York Times