16
ITTL.ppt- Information Technology & Telecommunications Laboratory Document Type Recognition and Content Summarization William Underwood Persistent Archives Testbed Working Meeting SDSC, La Jolla, CA Feb 17-18, 2005

ITTL.ppt-1 Information Technology & Telecommunications Laboratory Document Type Recognition and Content Summarization William Underwood Persistent Archives

Embed Size (px)

Citation preview

Page 1: ITTL.ppt-1 Information Technology & Telecommunications Laboratory Document Type Recognition and Content Summarization William Underwood Persistent Archives

ITTL.ppt-1

Information Technology & Telecommunications Laboratory

Document Type Recognition and Content Summarization

William Underwood

Persistent Archives Testbed Working Meeting

SDSC, La Jolla, CA

Feb 17-18, 2005

Page 2: ITTL.ppt-1 Information Technology & Telecommunications Laboratory Document Type Recognition and Content Summarization William Underwood Persistent Archives

ITTL.ppt-2

Information Technology & Telecommunications Laboratory

Overview

• Information Extraction

• Machine learning and recognition of document types

• Content Extraction

• Summarization (Folder titles and Content Notes)

• FOIA Review

Page 3: ITTL.ppt-1 Information Technology & Telecommunications Laboratory Document Type Recognition and Content Summarization William Underwood Persistent Archives

ITTL.ppt-3

Information Technology & Telecommunications Laboratory

Access Restriction Checker

Domain Knowledge

Office &Staff Names

Family&FriendNames

LexicalKnowledge

Interface Agent

DocumentArchivist’s Annotations

Document ContextDocumentASCII version of DocumentMarked up DocumentDocument ProfileDocument TypeArchivist’s AnnotationsRestrictions, Locations, Rationale

Questions to ArchivistsArchivists’ Answers

Conclusions

Blackboard

Control

Info Extractor

Reader

Access Restriction Architecture

ARCHIVIST

Agenda

Scenario Templates

Document Typer

FOIA/PRA Restriction Checker

Record Typer

Profiler

Learner

InteractionHistorian

Summarizer

Community of CollaboratingIntelligent Agents

Advisors

OntologiesPolitical, Military, Etc.

Page 4: ITTL.ppt-1 Information Technology & Telecommunications Laboratory Document Type Recognition and Content Summarization William Underwood Persistent Archives

ITTL.ppt-4

Information Technology & Telecommunications Laboratory

Information Extraction

• Information extraction (IE) is a procedure that selects, extracts and combines data from text in order to produce structured information.

• The Named entity (NE) Task is to identify all named persons, organizations, locations, dates, times, numeric monetary amounts and percentages in text.

Page 5: ITTL.ppt-1 Information Technology & Telecommunications Laboratory Document Type Recognition and Content Summarization William Underwood Persistent Archives

ITTL.ppt-5

Information Technology & Telecommunications Laboratory

Letter From George Bush to Ronald Reagan

Page 6: ITTL.ppt-1 Information Technology & Telecommunications Laboratory Document Type Recognition and Content Summarization William Underwood Persistent Archives

ITTL.ppt-6

Information Technology & Telecommunications Laboratory

Named Entity Recognition

Page 7: ITTL.ppt-1 Information Technology & Telecommunications Laboratory Document Type Recognition and Content Summarization William Underwood Persistent Archives

ITTL.ppt-7

Information Technology & Telecommunications Laboratory

Content Extraction Tasks

• The Template Element (TE) Task is to fill in templates about persons and organizations from an automatic analysis of text.

• The Scenario Template (ST) task is to fill in templates about events and their participants (persons, organizations, etc.) from an automatic analysis of text?

Page 8: ITTL.ppt-1 Information Technology & Telecommunications Laboratory Document Type Recognition and Content Summarization William Underwood Persistent Archives

ITTL.ppt-8

Information Technology & Telecommunications Laboratory

Content Extraction Applied to Recognizing Request for Confidential Advice

Page 9: ITTL.ppt-1 Information Technology & Telecommunications Laboratory Document Type Recognition and Content Summarization William Underwood Persistent Archives

ITTL.ppt-9

Information Technology & Telecommunications Laboratory

Content Extraction and Access Restriction Rules

Action: Request

Agent: Person

Job_Title: President

Object: Analysis of the War Powers Resolution

Patient: C Boyden Gray

Job_Title: Counsel to the President

Presidential_Advisor: C Boyden Gray

If Document(X), and

Action(X) = Request, and

Agent(X) = Y, and

(Job_Title(Y) = President, or Presidential_Advisor(Y)) and

Patient(X) = Z and

Presidential_Advisor(Z) and

Object(X) = Information

Then Access_Restriction(X) = a(5).

Page 10: ITTL.ppt-1 Information Technology & Telecommunications Laboratory Document Type Recognition and Content Summarization William Underwood Persistent Archives

ITTL.ppt-10

Information Technology & Telecommunications Laboratory

Some Document Types in Bush Presidential Electronic Records

• Agenda• Biographical Information • Briefing Memo• Decision Memo• Executive Order• Information Memo• White House Letter• List of Candidates for Appointment to Federal Office• Mailing List• Minutes of Meeting• Nomination for Appointment to Federal Office• Press Release• Resume• Schedule• Telephone Call Recommendation

Page 11: ITTL.ppt-1 Information Technology & Telecommunications Laboratory Document Type Recognition and Content Summarization William Underwood Persistent Archives

ITTL.ppt-11

Information Technology & Telecommunications Laboratory

Document Type Recognition

• Convert document format to ASCII or HTML

• Use Information Extraction Technology to Markup Different Document Types.

• Machine Learning of Document Type through Grammatical Inference

• Evaluate Performance

• Use for Recognizing Document Types of other Records

Page 12: ITTL.ppt-1 Information Technology & Telecommunications Laboratory Document Type Recognition and Content Summarization William Underwood Persistent Archives

ITTL.ppt-12

Information Technology & Telecommunications Laboratory

Annotated White House Correspondence

<date>March 27, 1990</date>

<greeting>Dear</greeting><person>Mr. Allen</person>

<p>Thank you very much for your letter of <date>March 15, 1990</date> which

stated your concerns and suggestions regarding the Americans with Disabilities Act.</p>

<p>In order to fulfill <person>President Bush's</name> campaign promise of bringing

Americans with handicaps into the mainstream of American life, the

Bush Administration supports the objectives of the A.D.A.</p>

<p>As you may know, the bill is still in <organization>House Committee</organization>

for consideration and change. You can be sure that your thoughts have been

fully noted and are appreciated.</p>

<formula of respect>Sincerely,</formula of respect>

<person>Doug Wead</person>

<job title>Special Assistant to the President for Public Liaison</job title>

<address><person>Ray Allen</person>, <job title>President</job title>

<organization>American Cultural Traditions</organization>

<postal address>P.O. Box 1895</postal address>

<location>Washington, D.C.</location> <zipcode>20013</zipcode></address>

Page 13: ITTL.ppt-1 Information Technology & Telecommunications Laboratory Document Type Recognition and Content Summarization William Underwood Persistent Archives

ITTL.ppt-13

Information Technology & Telecommunications Laboratory

Regular Grammar for the Layout of White House Correspondence

Letter <date></date>A

A <greeting></greeting>B

B <p></p>B

B <p></p>C

C <formula of respect></formula of respect>D

D <person></person>E

E <job title></job title>F

F <address></address>

Page 14: ITTL.ppt-1 Information Technology & Telecommunications Laboratory Document Type Recognition and Content Summarization William Underwood Persistent Archives

ITTL.ppt-14

Information Technology & Telecommunications Laboratory

Scope and Content Note for John Sununu’s Files

These files contain correspondence from senior level staff in the Executive Office of the President, and from every member of the Cabinet. The material covers issues that faced the Bush Administration from 1989 to 1990, including abortion / fetal research, the Exxon Valdez oil spill, the savings and loan industry, the Clean Air Act, the White House Conference on Global Climate Change, relations with China following the student demonstrations in Tiananmen Square, the National Drug Control Strategy, the 1990 Bipartisan Budget Agreement, the spotted owl issue, the Americans with Disabilities Act, and the nomination of Supreme Court Justice David Souter. It includes correspondence, routine reports, press releases, press clippings, papers produced by organizations outside the Administration, and speech drafts.

Page 15: ITTL.ppt-1 Information Technology & Telecommunications Laboratory Document Type Recognition and Content Summarization William Underwood Persistent Archives

ITTL.ppt-15

Information Technology & Telecommunications Laboratory

Relationship to Persistent Archives Testbed

• Information extraction, document type learning and recognition and series summarization will be provided as Archival Services within the NARA Persistent Archives Prototype, and could be provided within the PAT.

Page 16: ITTL.ppt-1 Information Technology & Telecommunications Laboratory Document Type Recognition and Content Summarization William Underwood Persistent Archives

ITTL.ppt-16

Information Technology & Telecommunications Laboratory

Additional Information

• http://perpos.gtri.gatech.edu• Archival Processing Tools: User Manual• An Analysis of the Knowledge Required to

Perform FOIA and PRA Review, PERPOS Technical Report ITTL/CSITD 04-1,Mar 2004.

• PERPOS: Results of Laboratory Experiments and Use by Archivists, Nov 2003

• Recognizing Named Entities in Presidential Electronic Records, PERPOS Technical Report ITTL/CISTD 04-4, June, 2004