Upload
claude-gardner
View
212
Download
0
Embed Size (px)
Citation preview
Data Mining Applied to Document Imaging
Jeff RekoskeJeff Rekoske
Agenda
IntroductionIntroduction Problem DefinitionProblem Definition Solution and MethodologySolution and Methodology Progress ReportProgress Report ToolsTools Techniques Applied from CSC-288Techniques Applied from CSC-288 Lessons Learned/ReinforcedLessons Learned/Reinforced SummarySummary
Introduction
Employed as SW Developer and DBA on Employed as SW Developer and DBA on document imaging projectdocument imaging project
Access to OCR statisticsAccess to OCR statistics Management staff has a few questions that Management staff has a few questions that
can be answered by analysis of existing datacan be answered by analysis of existing data
Problem Definition
Two PartsTwo Parts Management questionsManagement questions Data mining demonstrationData mining demonstration
Management Questions
Result of interviewsResult of interviews Fairly basicFairly basic
What forms are processed the most?What forms are processed the most? What are the recognition rates for the top What are the recognition rates for the top
forms?forms? What is the percentage of forms that were What is the percentage of forms that were
presented to an operator for keying?presented to an operator for keying?
Data Mining Demonstration
Purpose is to show the usefulness of data Purpose is to show the usefulness of data mining techniques.mining techniques. Prediction of rates for new formsPrediction of rates for new forms Characteristics of highly recognized Characteristics of highly recognized
formsforms Use mined data to develop new formsUse mined data to develop new forms
Solution
Data mart Data mart Answer management questionsAnswer management questions Provide data for mining activitiesProvide data for mining activities
Data Mart Schema (Snowflake)
DOCUMENT_FACTJob_Id
Julian_Date Scanner_IdField_Key
Field_Recognition_Status
JOB_DIMENSIONJob_Id
DescriptionJob_Type
FIELD_DIMENSIONField_Key
Return_KeyField_NameField_FormatLine_Number
OCR_Field_LengthX_PositionY_Position
WidthHeight
FIELD_FORMAT DIMENSIONField_FormatDescription
TIME_DIMENSIONJulian_Date
YearMonth
Quarter
RETURN_DIMENSIONReturn_KeyDescription
Machine_Generated_Flag
FIELD_RECONITION_DIMENSIONField_Recognition_Status
DescriptionOperator_Show_Flag
ETL and Data Mining Dataflow
ProductionTables
(Oracle 8i)
Data Mart(Oracle 8i)
ETL Script(Oracle PL/SQL)
WEKA FileCreation Script
(Oracle PL/SQL)WEKA
Methodology
Choose a small timeframe to sample dataChoose a small timeframe to sample data September – October 2004September – October 2004
Use ETL to load dataUse ETL to load data Relatively “clean” process due to data Relatively “clean” process due to data
locationlocation Apply SQL statements to data mart to Apply SQL statements to data mart to
answer management questionsanswer management questions
Methodology (continued)
Extract data from data mart to create Extract data from data mart to create WEKA filesWEKA files Attribute-Relation File Format (ARFF)Attribute-Relation File Format (ARFF)
Use WEKA to create classifier model using Use WEKA to create classifier model using C4.5 algorithm (pass/fail recognition)C4.5 algorithm (pass/fail recognition)
Validate model with 10-fold cross Validate model with 10-fold cross validationvalidation
Progress Report
First part (management questions) completeFirst part (management questions) complete 14,210 imaged documents14,210 imaged documents 865,409 OCR fields865,409 OCR fields
View created that joins tablesView created that joins tables Allows for non-technical personnel to Allows for non-technical personnel to
create basic queriescreate basic queries Management is pleased with resultsManagement is pleased with results
Progress Report (continued)
Part Two (WEKA –classifier) in progressPart Two (WEKA –classifier) in progress ARFF generation scripts completeARFF generation scripts complete Need to run ARFF files through WEKANeed to run ARFF files through WEKA Need to cross validate resultsNeed to cross validate results
Tools
Oracle 8i RDBMSOracle 8i RDBMS Oracle PL/SQL scripting languageOracle PL/SQL scripting language WEKA implementation of C4.5 classifierWEKA implementation of C4.5 classifier WEKA cross validationWEKA cross validation
Techniques Applied from CSC-288
Data MartData Mart Snowflake SchemaSnowflake Schema ETLETL OLAP OperationsOLAP Operations
Techniques Applied (continued)
ClassificationClassification C4.5 AlgorithmC4.5 Algorithm Supervised LearningSupervised Learning
CredibilityCredibility Cross-ValidationCross-Validation
Lessons Learned/Reinforced
Get firm requirements (if possible)Get firm requirements (if possible) Data marts can get large quicklyData marts can get large quickly OLAP operations should be performed OLAP operations should be performed
offline (from the OLTP system)offline (from the OLTP system) Demonstrations are useful for explaining Demonstrations are useful for explaining
concepts concepts
Summary
Application of knowledge from CSC-288 to Application of knowledge from CSC-288 to my workmy work
Data mart can be used to answer multiple Data mart can be used to answer multiple questions without effecting OLTP questions without effecting OLTP processingprocessing
Hopefully demonstrate using the data mart Hopefully demonstrate using the data mart for creating a classification modelfor creating a classification model
References
““Data Mining: Concepts and Techniques,” Data Mining: Concepts and Techniques,” by Jiawei Han and Micheline Kamber, by Jiawei Han and Micheline Kamber, Morgan Kaufmann, San Francisco, 2001Morgan Kaufmann, San Francisco, 2001
"Data Mining: Practical machine learning "Data Mining: Practical machine learning tools with Java implementations," by Ian H. tools with Java implementations," by Ian H. Witten and Eibe Frank, Morgan Kaufmann, Witten and Eibe Frank, Morgan Kaufmann, San Francisco, 2000. San Francisco, 2000.
Questions?