LREC 2008, Marrakech Morocco - May 30 2008 New Resources for Document Classification, Analysis and Translation Technologies Stephanie Strassel, Lauren

LREC 2008, Marrakech Morocco - May 30 2008

New Resources for Document Classification, Analysis and Translation

Technologies

Stephanie Strassel, Lauren Friedman, Safa Ismael,

David Lee, Kazuaki Maeda, Linda Brandschain {strassel, lf, safa, david4, maeda, brndschn}@ldc.upenn.edu

Linguistic Data Consortium

http://projects.ldc.upenn.edu/MADCAT


Presentation OutlineMADCAT Program OverviewTechnology ChallengesRoadmapData Creation

Phase 1 Data Profile Processing Collection Annotation

Data FormatEvaluationConclusions and Future Work


MADCAT OverviewMADCAT: Multilingual Document Classification

Analysis and TranslationA 5-year DARPA program MADCAT technologies will convert foreign

language document images into English text, enabling English speakers to extract, assess, and respond to information in a timely manner

Multiple input types and domains Hard-copy, PDF, camera-captured Newspapers, letters, signs, graffiti, how-to manuals,

memos, postcards, forms, diaries, ledgers, etc.


Technology Challenges Extract relevant metadata about the document structure Integrate and optimize page segmentation, metadata

extraction, OCR and translation technologies Create end-to-end system for deployment at program’s

end with over 90% accuracy Current baseline is ~2%

Primary evaluation metric is edit distance: HTER Same protocols as used in the GALE program

Limited focus in Phase 1 Arabic > English High resolution (600 dpi) images of handwritten newspaper and

web text Topics primarily news, current events and commentary Manual segmentation provided


Pre-MADCAT: State of the Art

Phase 1: Add handwriting

Phase 4-5: New genres, topics, quality conditions

Newswire

Broadcast

Talk Shows

Weblogs

Newswire

Broadcast

Talk Shows

Weblogs

Printed Printed

Handwritten

Phase 2-3: New data types

Personal Identif.

Instructns

Books

Training Manuals

Letters

Forms

Ledgers

DiariesCalendar

Maps

Poems

Verdicts

Letters

Forms

Ledgers

Diaries

News

Commentary

News

Commentary

Commentary

Science

Engineering

Personal

Science

Engineering

Personal

Religious

Military

Other

ControlledControlledControlled

Uncontrolled Uncontrolled

Calendars

Instructns

Ge

nre

To

pic

Me

diu

mS

ou

rce

Da

ta

Qu

ali

ty

Printed

Handwritten

Printed

Handwritten

Ph

as

eRoadmap


Phase 1 Data Profile In Phase 1, data drawn from DARPA GALE program

New collection to acquire handwritten versions Genres: Formal text (newswire) and informal text (weblogs)

Benefits Eliminates domain mismatch between GALE state of the art MT

models and MADCAT test sets Allows developers to focus on primary challenge: handwriting Data characteristics well understood, cost and time factors are

reasonably well known Training data costs controlled since translations exist Production begins immediately, training data available sooner Provides controlled test sets for evaluation across programs

Subsequent phases will add new data types, genres and other challenge elements


Training and DevTest Training

Minimum 2000 unique pages• Half formal (newswire), half informal (web text)• 100-250 words per page

Minimum 100 unique scribes in training pool 5 scribes per page At minimum 10,000 manuscripts (scribe-pages) in Phase 1 training

set

DevTest 320 unique pages

• Half formal (newswire), half informal (web text)• 125 words/page

50 scribes in devtest pool• 25 from training, 25 previously unseen

2 scribes per page, ~7 pages per scribe Total of 640 manuscripts; 80,000 words


Evaluation Data

320 unique pages from GALE P3 Eval set Half formal (newswire), half informal (web

text) 125 words/page

50 scribes in eval partition 25 from training, 25 previously unseen

6 scribes per page, ~40 pages per scribeTotal of 1920 manuscripts, 240,000 wordsSubset of eval set designated for pilot

evaluation in September 2008


Data Preparation Start with electronic text from GALE

Whole documents collected from newswire or web Segmented into SUs (semantic/sentence units) Each segment manually translated

Pre-processing prior to handwriting Tokenization to words for later stages Segments reordered and formatting added to create optimal pages

for handwriting assignment• Roughly 5 words/line to avoid line wrapping• No more than 25 lines/page to avoid page breaks

After handwriting, images scanned at high resolution (600 dpi, greyscale)

Images are ground truth annotated at line, word level Major challenge is logical storage of many layers of

information across multiple versions of the same data


Collection New human subjects collection required to produce

handwritten versions of existing data Pilot collection currently underway at LDC in Philadelphia

• LDC Arabic staff and recent Iraqi immigrants in Philly

Additional collections planned with partner sites in Lebanon, Morocco and possibly Egypt

Regional variety necessary to capture stylistic writing differences• E.g. use of Indic vs. Arabic numbers

Assignment and tracking of data and scribes controlled through centralized LDC database and assignment protocol Scribe partition (train only, test only, both) Writing conditions Regional variation Genre, topic and source balance


Writing Conditions

Implement 90% ballpoint pen (I) 10% pencil (P)

Paper 75% unlined white paper (U) 25% lined paper (L)

Writing speed 90% normal (N) 5% fast (F) 5% careful (C)


Scribe visits public URL, contacts site coordinator

Site coordinator schedules appointment

Scribe comes in, takes writing sample test

Site coordinator verifies scribe eligibility

Site coordinator logs in to secure website via login page

Scribe completes registration via registration page

Scribe verifies info via confirmation page

Site coordinator prints out subject ID and instructions for subject via assignment page

Coordinator pulls kit for this subject ID

Scribe returns completed kit to site

Coordinator verifies kit completeness and arranges payment

Scribe leaves with kit and instructions

Coordinator files completed kit for scanning/delivery

Site scans completed kit(s) as safeguard

Site ships completed paper kit(s) to LDC for archiving

LDC selects source data

LDC generates kits (documents + writing conditions)

Sites publicize study and recruit participants

LDC delivers data kits to collection sites

Site uploads image file to LDC

LDC processes completed kits for subsequent tasks

Collection Workflow


Scribe DemographicsScribes register in person at collection site and take

writing test To assess literacy and ability to follow instructions

Enter demographic info on LDC's secure server Name, address (for payment purposes only) Age, gender, level of education, occupation Where born, where raised Primary language of educational instruction Handedness

After registration, scribes receive brief tutorial No line wrapping, no page breaks Copy text exactly: no omissions or insertions, no

corrections to source text


Scribe AssignmentsAssignments are in the form of printed "kits"

50 printed pages to be copied plus assignment table• Assignment table specifies page order and writing conditions

Multiple scribes/kit, so conditions and order vary

Printed pages labeled with page and kit ID Scribes affix label with scribe, page and kit ID to back of

completed manuscript• To facilitate data tracking during scanning and post-processing

Scribes supply paper and writing instrument To sample natural variation

Payment per completed kit Exhaustive check on first assignment (completeness

and accuracy) Spot check on remainder of assignments


Ground TruthingZones created at word level only for Phase 1

Lines can be extrapolated from annotation Other zone types possible in future phases

• Structural elements (e.g. signature block)

Explicit reading order preservedLocations are polygons

Restricted to upright rectangles in the first phaseEach zone contains a unique ID, the contents,

location (coordinates)Status tags to accommodate scribe mistakes

extra, missing, typo nextZoneID tag to indicate reading orderIn Phase 1, ground truthing primarily by partner site

(Applied Media Analysis)


GEDI ToolkitGroundTruth - Editor and Document Interface (GEDI)

created by Applied Media Analysis (AMA)


Data FormatMADCATUnifier Process takes multiple data streamsand generates single xmloutput file which contains allrequired information

1) Text layer*Source Text

*Tokenization

*SU Segmentation

*Translation

2) Image layer*zone bounding boxes

3) Scribe demographics

4) Document metadata


EvaluationInput: (segmented) Arabic handwritten imageOutput: segmented English text HTER is primary evaluation metric (edit

distance) Manual post-editing task corrects MT output one

segment at a time until it has the same meaning as the reference translation, making as few edits as possible

NIST-developed MTPostEditor GUI• Editors review segment-aligned MT and gold standard

translation No access to original Arabic text or handwritten image file

No official separate evaluation of OCR or processing components


Conclusions; Future Work

LDC is creating a set of new linguistic resources for image processing, document classification and translation on a scale not previously available Phase 1: Large collection of Arabic handwritten,

translated, segmented, ground truthed text Infrastructure for collection, annotation and data

management• Including a unified, extensible data format

Extended to new data types, domains, languages, annotations in future phases

Resources will be available through LDC


Acknowledgements

This work was supported in part by the Defense Advanced Research Projects Agency, MADCAT Program Grant No. HR0011-08-1-004. The content of this paper does not necessarily reflect the position or the policy of the Government, and no official endorsement should be inferred.

Thank you to Audrey Le and Mark Przybocki at NIST for helping to define data and format requirements

Documents

LREC 2008, Marrakech Morocco - May 30 2008 New Resources for Document Classification, Analysis and Translation Technologies Stephanie Strassel, Lauren