Technical Overview PASIG 2019 LOCKSS Seminar€¦ · LOCKSS Seminar: Technical Overview 1. Origins...

Preview:

Citation preview

PASIG 2019 LOCKSS Seminar:Technical Overview

Thib Guicherd-Callin – Technical Manager, LOCKSS Programthib@cs.stanford.edu – github.com/thibgc

PASIG 2019LOCKSS Seminar:

Technical Overview

1. Origins of LOCKSS Technology in Research Libraries

2. LOCKSS Approach to Digital Preservation Threat Models

3. LOCKSS Polling and Repair Primer

4. LOCKSS Software in Motion

PASIG 2019LOCKSS Seminar:

Technical Overview

1. Origins of LOCKSS Technology in Research Libraries

2. LOCKSS Approach to Digital Preservation Threat Models

3. LOCKSS Polling and Repair Primer

4. LOCKSS Software in Motion

Research Libraries in the Paper Era

● Ownership model● Many independent replicas● Features

○ Disaster resistance○ Disaster recovery○ Tamper evident○ Permanent access

Research Libraries in the Web Era

● Leasing model● One master copy● Misfeatures

○ Disaster resistance?○ Disaster recovery?○ Tamper evident?○ Permanent access?

LOCKSS Technology in Response

● Re-establish ownership● Inter-library collaboration● Diversity

○ Geography○ Hardware○ Software○ Organizational structure○ Jurisdiction

PASIG 2019LOCKSS Seminar:

Technical Overview

1. Origins of LOCKSS Technology in Research Libraries

2. LOCKSS Approach to Digital Preservation Threat Models

3. LOCKSS Polling and Repair Primer

4. LOCKSS Software in Motion

Digital Preservation Threat Models

David S.H. Rosenthal, Thomas S. Robertson, Tom Lipkis, Vicky Reich, Seth Morabito. Requirements for Digital Preservation Systems: A Bottom-Up Approach. D-Lib Magazine, vol. 11, iss. 11, November 2005. DOI: 10.1045/november2005-rosenthal

Digital Preservation: Definition and Key Properties

The goal of a digital preservation system is that the information it contains remains accessible to users over a period of time much longer than the lifetime of individual storage media, hardware and software components.

● No single point of failure● Media, hardware and software flow through as they fail or are replaced● Regular audits frequent enough to keep probability of irrecoverable failure

acceptable

Threat Taxonomy

● Media failure● Hardware failure● Software failure● Communication errors● Failure of network services● Natural disaster

● Media and hardware obsolescence● Software obsolescence

● Operator error● Economic failure● Organizational failure

● External attack● Internal attack

LOCKSS Polling and Repair

● Landslide agreementTake no action(high confidence in outcome)

● Inconclusive agreementTake no action and raise alarm(low confidence in outcome)

● Landslide disagreementSeek repair and notify(high confidence in outcome)

Attacker'sgoal

(Stealthmodification

gap)

PASIG 2019LOCKSS Seminar:

Technical Overview

1. Origins of LOCKSS Technology in Research Libraries

2. LOCKSS Approach to Digital Preservation Threat Models

3. LOCKSS Polling and Repair Primer

4. LOCKSS Software in Motion

P2

P1

P3

P4

P5P6

What is hash(X)?

XThe peers hold identical replicas of XPeer P1 calls a poll on content X

P2

P1

P3

P4

P5P6

X

hash(X) = h1 hash(X) = h1

hash(X) = h1

hash(X) = h1hash(X) = h1

P2, P3, P4, P5, P6 agreed with me on X

Landslide agreement

P2

P1

P3

P4

P5P6

Peer P2 calls a poll on content X

X

What is hash(X)?

P2

P1

P3

P4

P5P6

hash(X) = h1

hash(X) = h1

hash(X) = h1hash(X) = h1

hash(X) = h1

P1, P3, P4, P5, P6 agreed with me on X

XLandslide agreement

P2

P1

P3

P4

P5P6

What is hash(X)?

XPeer P1 incurs damage on content XPeer P1 later calls a poll on content X X

P2

P1

P3

P4

P5P6

X

hash(X) = h1 hash(X) = h1

hash(X) = h1

hash(X) = h1hash(X) = h1

hash(X) = h2

Landslide disagreement

X

P2

P1

P3

P4

P5P6

Help me repair X

X

XRepair request

P2

P1

P3

P4

P5P6

P1 agreed with me on X

X

X

X

Repair

P2

P1

P3

P4

P5P6

X

XThe peers hold identical replicas of X

PASIG 2019LOCKSS Seminar:

Technical Overview

1. Origins of LOCKSS Technology in Research Libraries

2. LOCKSS Approach to Digital Preservation Threat Models

3. LOCKSS Polling and Repair Primer

4. LOCKSS Software in Motion

Functionality of the LOCKSS System

● Storage layer (POSIX file system → HDFS)● Web crawler (LOCKSS Crawler)● LOCKSS polling and repair protocol (LCAP)● Metadata extraction and metadata database● Web replay (ServeContent → OpenWayback, Pywb)

Evolution of the LOCKSS System

● Standalone Java software stack performing all functions, controlled via a Web user interface

● Standalone Java software stack performing all functions, controlled via a Web user interface and a limited set of Web services

● Suite of Java software components performing specialized functions, controlled via a Web user interface and REST Web services

Evolution of the LOCKSS Platform

● A dedicated physical machine with a read-only OpenBSD operating system and read-only configuration data, running the standalone LOCKSS software exclusively, with locally attached disk storage

● A physical or virtual machine with a Linux operating system, running the standalone LOCKSS software exclusively or non-exclusively, with locally attached or proximally available storage, with an optional local database

● A physical or virtual machine, running a set of Docker containers, with local or remote HDFS storage and database connections

Software Modernization Initiative

● "A successful 20-year-old open source codebase is still a 20-year-old codebase"● Spring Framework● Major refactoring● Distribution as artifacts on Maven Central, Docker containers on Docker Hub● Orchestration through Docker (Docker Swarm, Docker Stack)

Thank You

● Resources○ LOCKSS Web site: lockss.org○ LOCKSS Documentation Portal: lockss.github.io

● Software○ LOCKSS at GitHub: github.com/lockss○ LOCKSS at Maven Central: group ID org.lockss○ LOCKSS at Docker Hub: hub.docker.com/u/lockss

● Communication○ Twitter: twitter.com/lockss○ Slack: tinyurl.com/slackjoinlockss

● Q&A