20
Data Mining Applied to Document Imaging Jeff Rekoske Jeff Rekoske

Data Mining Applied to Document Imaging Jeff Rekoske

Embed Size (px)

Citation preview

Page 1: Data Mining Applied to Document Imaging Jeff Rekoske

Data Mining Applied to Document Imaging

Jeff RekoskeJeff Rekoske

Page 2: Data Mining Applied to Document Imaging Jeff Rekoske

Agenda

IntroductionIntroduction Problem DefinitionProblem Definition Solution and MethodologySolution and Methodology Progress ReportProgress Report ToolsTools Techniques Applied from CSC-288Techniques Applied from CSC-288 Lessons Learned/ReinforcedLessons Learned/Reinforced SummarySummary

Page 3: Data Mining Applied to Document Imaging Jeff Rekoske

Introduction

Employed as SW Developer and DBA on Employed as SW Developer and DBA on document imaging projectdocument imaging project

Access to OCR statisticsAccess to OCR statistics Management staff has a few questions that Management staff has a few questions that

can be answered by analysis of existing datacan be answered by analysis of existing data

Page 4: Data Mining Applied to Document Imaging Jeff Rekoske

Problem Definition

Two PartsTwo Parts Management questionsManagement questions Data mining demonstrationData mining demonstration

Page 5: Data Mining Applied to Document Imaging Jeff Rekoske

Management Questions

Result of interviewsResult of interviews Fairly basicFairly basic

What forms are processed the most?What forms are processed the most? What are the recognition rates for the top What are the recognition rates for the top

forms?forms? What is the percentage of forms that were What is the percentage of forms that were

presented to an operator for keying?presented to an operator for keying?

Page 6: Data Mining Applied to Document Imaging Jeff Rekoske

Data Mining Demonstration

Purpose is to show the usefulness of data Purpose is to show the usefulness of data mining techniques.mining techniques. Prediction of rates for new formsPrediction of rates for new forms Characteristics of highly recognized Characteristics of highly recognized

formsforms Use mined data to develop new formsUse mined data to develop new forms

Page 7: Data Mining Applied to Document Imaging Jeff Rekoske

Solution

Data mart Data mart Answer management questionsAnswer management questions Provide data for mining activitiesProvide data for mining activities

Page 8: Data Mining Applied to Document Imaging Jeff Rekoske

Data Mart Schema (Snowflake)

DOCUMENT_FACTJob_Id

Julian_Date Scanner_IdField_Key

Field_Recognition_Status

JOB_DIMENSIONJob_Id

DescriptionJob_Type

FIELD_DIMENSIONField_Key

Return_KeyField_NameField_FormatLine_Number

OCR_Field_LengthX_PositionY_Position

WidthHeight

FIELD_FORMAT DIMENSIONField_FormatDescription

TIME_DIMENSIONJulian_Date

YearMonth

Quarter

RETURN_DIMENSIONReturn_KeyDescription

Machine_Generated_Flag

FIELD_RECONITION_DIMENSIONField_Recognition_Status

DescriptionOperator_Show_Flag

Page 9: Data Mining Applied to Document Imaging Jeff Rekoske

ETL and Data Mining Dataflow

ProductionTables

(Oracle 8i)

Data Mart(Oracle 8i)

ETL Script(Oracle PL/SQL)

WEKA FileCreation Script

(Oracle PL/SQL)WEKA

Page 10: Data Mining Applied to Document Imaging Jeff Rekoske

Methodology

Choose a small timeframe to sample dataChoose a small timeframe to sample data September – October 2004September – October 2004

Use ETL to load dataUse ETL to load data Relatively “clean” process due to data Relatively “clean” process due to data

locationlocation Apply SQL statements to data mart to Apply SQL statements to data mart to

answer management questionsanswer management questions

Page 11: Data Mining Applied to Document Imaging Jeff Rekoske

Methodology (continued)

Extract data from data mart to create Extract data from data mart to create WEKA filesWEKA files Attribute-Relation File Format (ARFF)Attribute-Relation File Format (ARFF)

Use WEKA to create classifier model using Use WEKA to create classifier model using C4.5 algorithm (pass/fail recognition)C4.5 algorithm (pass/fail recognition)

Validate model with 10-fold cross Validate model with 10-fold cross validationvalidation

Page 12: Data Mining Applied to Document Imaging Jeff Rekoske

Progress Report

First part (management questions) completeFirst part (management questions) complete 14,210 imaged documents14,210 imaged documents 865,409 OCR fields865,409 OCR fields

View created that joins tablesView created that joins tables Allows for non-technical personnel to Allows for non-technical personnel to

create basic queriescreate basic queries Management is pleased with resultsManagement is pleased with results

Page 13: Data Mining Applied to Document Imaging Jeff Rekoske

Progress Report (continued)

Part Two (WEKA –classifier) in progressPart Two (WEKA –classifier) in progress ARFF generation scripts completeARFF generation scripts complete Need to run ARFF files through WEKANeed to run ARFF files through WEKA Need to cross validate resultsNeed to cross validate results

Page 14: Data Mining Applied to Document Imaging Jeff Rekoske

Tools

Oracle 8i RDBMSOracle 8i RDBMS Oracle PL/SQL scripting languageOracle PL/SQL scripting language WEKA implementation of C4.5 classifierWEKA implementation of C4.5 classifier WEKA cross validationWEKA cross validation

Page 15: Data Mining Applied to Document Imaging Jeff Rekoske

Techniques Applied from CSC-288

Data MartData Mart Snowflake SchemaSnowflake Schema ETLETL OLAP OperationsOLAP Operations

Page 16: Data Mining Applied to Document Imaging Jeff Rekoske

Techniques Applied (continued)

ClassificationClassification C4.5 AlgorithmC4.5 Algorithm Supervised LearningSupervised Learning

CredibilityCredibility Cross-ValidationCross-Validation

Page 17: Data Mining Applied to Document Imaging Jeff Rekoske

Lessons Learned/Reinforced

Get firm requirements (if possible)Get firm requirements (if possible) Data marts can get large quicklyData marts can get large quickly OLAP operations should be performed OLAP operations should be performed

offline (from the OLTP system)offline (from the OLTP system) Demonstrations are useful for explaining Demonstrations are useful for explaining

concepts concepts

Page 18: Data Mining Applied to Document Imaging Jeff Rekoske

Summary

Application of knowledge from CSC-288 to Application of knowledge from CSC-288 to my workmy work

Data mart can be used to answer multiple Data mart can be used to answer multiple questions without effecting OLTP questions without effecting OLTP processingprocessing

Hopefully demonstrate using the data mart Hopefully demonstrate using the data mart for creating a classification modelfor creating a classification model

Page 19: Data Mining Applied to Document Imaging Jeff Rekoske

References

““Data Mining: Concepts and Techniques,” Data Mining: Concepts and Techniques,” by Jiawei Han and Micheline Kamber, by Jiawei Han and Micheline Kamber, Morgan Kaufmann, San Francisco, 2001Morgan Kaufmann, San Francisco, 2001

"Data Mining: Practical machine learning "Data Mining: Practical machine learning tools with Java implementations," by Ian H. tools with Java implementations," by Ian H. Witten and Eibe Frank, Morgan Kaufmann, Witten and Eibe Frank, Morgan Kaufmann, San Francisco, 2000. San Francisco, 2000.

Page 20: Data Mining Applied to Document Imaging Jeff Rekoske

Questions?