17
Archive-It Architecture Introduction April 3, 2006 Dan Avery Internet Archive

Archive-It Architecture Introduction

  • Upload
    deanna

  • View
    24

  • Download
    0

Embed Size (px)

DESCRIPTION

Archive-It Architecture Introduction. April 3, 2006 Dan Avery Internet Archive. Archive-It Components. Crawling User Interface Storage Playback Text Indexing Integration. Component Integration. Crawling. Heritrix ( http://crawler.archive.org / ) Java application Open source (LGPL) - PowerPoint PPT Presentation

Citation preview

Page 1: Archive-It  Architecture Introduction

Archive-It Architecture Introduction

April 3, 2006Dan Avery

Internet Archive

Page 2: Archive-It  Architecture Introduction

Archive-It Components

•Crawling

•User Interface

•Storage

•Playback

•Text Indexing

•Integration

Page 3: Archive-It  Architecture Introduction

Component Integration

Page 4: Archive-It  Architecture Introduction

Crawling

•Heritrix ( http://crawler.archive.org/ )

•Java application

•Open source (LGPL)

•Crawls for completeness/depth

•Highly configurable

Page 5: Archive-It  Architecture Introduction

Crawling - Distributed Crawling•Heritrix Cluster Controller

•Java component - open source - developed by IA

•http://crawler.archive.org/hcc

•Provides proxy access to pool of Heritrix instances through JMX interface

•Provides crawler control and status

•Currently controlling 33 crawler instances on three commodity dual Opterons--upper bound unknown

Page 6: Archive-It  Architecture Introduction

Archive-It Web Application

• User Interface and Crawl Scheduling

• Gets seed URLs and crawl parameters from users

• Schedules new periodic crawls

• Talks to crawler pool through HCC

• Provides access, search, and crawl history UI

Page 7: Archive-It  Architecture Introduction

Storage•archive.org ARC repository

•custom Perl system

•simple storage on primary/backup pairs

•monthly MD5 digest verification

•robust, non proprietary file format

•Alexandria (Egypt)/Amsterdam

Page 8: Archive-It  Architecture Introduction

Access• Internet Archive Wayback

Machine

• Replaying archived web pages since 2001

• Current IA version written in Perl and C, with components distributed across various machines

• Not open source, but open source beta (in Java) available now

Page 9: Archive-It  Architecture Introduction

Full-Text Indexing

•Nutch (http://nutch.org)

•NutchWAX (http://archive-access.sf.net) additions create and search indexes of stored ARC files

•Standard text search plus link analysis

•can search by date instead of relevance, useful for individual archives

Page 10: Archive-It  Architecture Introduction

Text Indexing Challenges•Some parts are distributable,

some are not

•Incremental indexing - goal of new crawls in index within 72 hours

•Working on Archive-It usable map/reduce version - July

•In the meantime, a lot of workarounds

Page 11: Archive-It  Architecture Introduction

Integration•Group of Perl and bash scripts - planning more complex than the execution

•Most components available individually

•Decentralized control, centralized monitoring

•Each component operates almost entirely independently

Page 12: Archive-It  Architecture Introduction

The Big Picture

Page 13: Archive-It  Architecture Introduction

Future Challenges

•Crawler trap detection

•Scalability

•Current setup can accommodate 300 partners at current crawling rates

•During pilot we crawled/indexed/stored just over 100,000,000 documents (~4TB) in eight weeks

•More machines can be easily added to storage and crawling clusters

Page 14: Archive-It  Architecture Introduction

Scalability

•Current Nutch is between versions

•Old version has some non-distributable pieces

•New version is much more distributable and scalable (map/reduce - Hadoop), but not ready for incremental indexing

Page 15: Archive-It  Architecture Introduction

Looking ahead•After basic UI/archiving/indexing...

•Time-based search UI

•Analyzing archives for research and ongoing collection improvement

•Content classification

•Rate of change

•New site suggestions

Page 16: Archive-It  Architecture Introduction

http://www.archive-it.org

Page 17: Archive-It  Architecture Introduction

RLG’s Web Archiving Program•Collaborative collection

development.

•Descriptive metadata for web archives.

•Usability/user studies

•Intellectual property concerns

•Web Archiving 101

•Web archiving services and software