Upload
ramona
View
29
Download
0
Tags:
Embed Size (px)
DESCRIPTION
Document Content Analysis for Digital Archives. Eric Saund Perceptual Document Analysis Area Intelligent Systems Laboratory Palo Alto Research Center. Digital Archives. Index. Metadata layer. Content layer. Tasks. Operations. -browse by topic, type, etc. -search for known items - PowerPoint PPT Presentation
Citation preview
Document Content Analysisfor Digital Archives
Eric SaundPerceptual Document Analysis Area
Intelligent Systems Laboratory Palo Alto Research Center
Digital Archives
Tasks Operations
-casual browsing-look up information-follow trails-compose narratives-form and organize collections-distribute -assemble timelines
-browse by topic, type, etc.-search for known items-search for items meeting criteria-find duplicate items-find similar items-follow links-establish links-apply logical rules-edit metadata
All enabled by Metadata
Content layer
Metadata layer
Index
Metadata
Two major problems with metadata:
1. Extracting metadata from raw content items.
2. Metadata is always incomplete for some purposes.
Title: Sarix neobDate: 37-23-55Media: niobiumFormat: jnbAuthor: Rsi LiwerText: “aliirn xeca sarlia isyb...”Index ID: 34962s
pointer to item
Metadata as a static record
computeSimilarityTo()containsEntity?()fitsSlotInModel?();extractTextAfterImageCleanup()
Metadata as an interface
functions applied to item content
Automatic Content Analysis
State of the Art
• document image analysis
• photographic image analysis
• video/film analysis
• audio analysis
• web site analysis
text
appearance, layout
whowhatwherewhen
topicsentitites
genrecategoryfunctional roles
genresceneswho, what, ...
genrespeech/musicspeaker IDtransciption
APR 21 2004 17:38 FR ---- 203 749 4519 TO 4264 P.02/06 * 9STCapitalModularSpace SALE INVOICE _ jz5| g'" ni'idspace.com -IPage: 1 FAX TO:_ BILL TO: REMIT TO: ACCOUNT NO.: ;m11 GE Capital Corp10 Riverview Drive Danbury, CT 06810 PO NUMBER: per Chad LOCATION OF UNITS: SAME AS ABOVE UNIT NO.: 076613 SERIAL NO.: SM069A 26,351.00 4,;- UNIT NO.: 076614 SERIAL NO.: SM0G9B 26, 351. 00 DOWN PAYMENT 0. 00 BUILDING DELIVERY0. 00 BUILDING DELIVERY 400.00 BLOCK AND LEVEL 0. 00 BLOCK AND LEVEL 2,100.00ANCHOR/TIE DOWN 780 00 DECKING 950. 00 / ELECTRICAL 1, 350. 00 / PLUMBING3, 025. 00 INSTALLATION SITE MANAGEMENT 1,100 00 SKIRTING- VINYL 1,360. 00TOTAL DUE THIS INVOICE 63,767.00
When OCR Works...
Headeralignment
Graphical
logo
Font / Layout /Symbol Patternof Fax ID Line
RedactingmarkingsAddress
block
Repeatedelements
Hand-drawngraphical annotation
Handwritten Textual Annotation
Textual FieldIndicator
Tabular Layout
Graphic separator
ST
Amount Field
How People See a Document
CategoryType
Structural Elementsand Relations
RelationalContext
• Invoice • Construction project
• Supplier relationship
• Inventory & materials management
• Bill
• Itemized purchase listing
• Annotated document
Technology Ecology
Academia Industry• Computer Vision• Document Recognition• Information Retrieval• Machine Learning• Speech Recognition• Natural Language• Artificial Intelligence
• Document Imaging• Transaction Processing• Workflow Systems• Database Vendors• Business Software• Business Process Outsourcing• Advertising/Search
Paying Customer:• government• industry
• businesses• consumers• government
Hobbiests
• museums• schools• local governments• NGOs• individuals• startups• boutique companies• shoestring projects in Academia and Industry
Characteristics:• science-based• toy problems• fragile
• engineering-based• robust• limited capabilities
A Hobby Project
Document Capture Station
+ Collection Comprehension Engine
Wanted:
Collection Comprehension Engine
OCR
308991
DocumentStructure Modeling
Document Collection Linking
Image Processing
Automatic Cataloging
Genre Tagging Clustering
Classification
Visualization GUI
Conclusion
The hobby stage brings together kindred spirits.