10
Document Content Analysis for Digital Archives Eric Saund Perceptual Document Analysis Area Intelligent Systems Laboratory Palo Alto Research Center

Document Content Analysis for Digital Archives

  • Upload
    ramona

  • View
    29

  • Download
    0

Embed Size (px)

DESCRIPTION

Document Content Analysis for Digital Archives. Eric Saund Perceptual Document Analysis Area Intelligent Systems Laboratory Palo Alto Research Center. Digital Archives. Index. Metadata layer. Content layer. Tasks. Operations. -browse by topic, type, etc. -search for known items - PowerPoint PPT Presentation

Citation preview

Page 1: Document Content Analysis for Digital Archives

Document Content Analysisfor Digital Archives

Eric SaundPerceptual Document Analysis Area

Intelligent Systems Laboratory Palo Alto Research Center

Page 2: Document Content Analysis for Digital Archives

Digital Archives

Tasks Operations

-casual browsing-look up information-follow trails-compose narratives-form and organize collections-distribute -assemble timelines

-browse by topic, type, etc.-search for known items-search for items meeting criteria-find duplicate items-find similar items-follow links-establish links-apply logical rules-edit metadata

All enabled by Metadata

Content layer

Metadata layer

Index

Page 3: Document Content Analysis for Digital Archives

Metadata

Two major problems with metadata:

1. Extracting metadata from raw content items.

2. Metadata is always incomplete for some purposes.

Title: Sarix neobDate: 37-23-55Media: niobiumFormat: jnbAuthor: Rsi LiwerText: “aliirn xeca sarlia isyb...”Index ID: 34962s

pointer to item

Metadata as a static record

computeSimilarityTo()containsEntity?()fitsSlotInModel?();extractTextAfterImageCleanup()

Metadata as an interface

functions applied to item content

Automatic Content Analysis

Page 4: Document Content Analysis for Digital Archives

State of the Art

• document image analysis

• photographic image analysis

• video/film analysis

• audio analysis

• web site analysis

text

appearance, layout

whowhatwherewhen

topicsentitites

genrecategoryfunctional roles

genresceneswho, what, ...

genrespeech/musicspeaker IDtransciption

Page 5: Document Content Analysis for Digital Archives

APR 21 2004 17:38 FR ---- 203 749 4519 TO 4264 P.02/06 * 9STCapitalModularSpace SALE INVOICE _ jz5| g'" ni'idspace.com -IPage: 1 FAX TO:_ BILL TO: REMIT TO: ACCOUNT NO.: ;m11 GE Capital Corp10 Riverview Drive Danbury, CT 06810 PO NUMBER: per Chad LOCATION OF UNITS: SAME AS ABOVE UNIT NO.: 076613 SERIAL NO.: SM069A 26,351.00 4,;- UNIT NO.: 076614 SERIAL NO.: SM0G9B 26, 351. 00 DOWN PAYMENT 0. 00 BUILDING DELIVERY0. 00 BUILDING DELIVERY 400.00 BLOCK AND LEVEL 0. 00 BLOCK AND LEVEL 2,100.00ANCHOR/TIE DOWN 780 00 DECKING 950. 00 / ELECTRICAL 1, 350. 00 / PLUMBING3, 025. 00 INSTALLATION SITE MANAGEMENT 1,100 00 SKIRTING- VINYL 1,360. 00TOTAL DUE THIS INVOICE 63,767.00

When OCR Works...

Page 6: Document Content Analysis for Digital Archives

Headeralignment

Graphical

logo

Font / Layout /Symbol Patternof Fax ID Line

RedactingmarkingsAddress

block

Repeatedelements

Hand-drawngraphical annotation

Handwritten Textual Annotation

Textual FieldIndicator

Tabular Layout

Graphic separator

ST

Amount Field

How People See a Document

CategoryType

Structural Elementsand Relations

RelationalContext

• Invoice • Construction project

• Supplier relationship

• Inventory & materials management

• Bill

• Itemized purchase listing

• Annotated document

Page 7: Document Content Analysis for Digital Archives

Technology Ecology

Academia Industry• Computer Vision• Document Recognition• Information Retrieval• Machine Learning• Speech Recognition• Natural Language• Artificial Intelligence

• Document Imaging• Transaction Processing• Workflow Systems• Database Vendors• Business Software• Business Process Outsourcing• Advertising/Search

Paying Customer:• government• industry

• businesses• consumers• government

Hobbiests

• museums• schools• local governments• NGOs• individuals• startups• boutique companies• shoestring projects in Academia and Industry

Characteristics:• science-based• toy problems• fragile

• engineering-based• robust• limited capabilities

Page 8: Document Content Analysis for Digital Archives

A Hobby Project

Document Capture Station

+ Collection Comprehension Engine

Wanted:

Page 9: Document Content Analysis for Digital Archives

Collection Comprehension Engine

OCR

308991

DocumentStructure Modeling

Document Collection Linking

Image Processing

Automatic Cataloging

Genre Tagging Clustering

Classification

Visualization GUI

Page 10: Document Content Analysis for Digital Archives

Conclusion

The hobby stage brings together kindred spirits.