19
Capture with OCR Capture with OCR Technology: Technology: Ghana’s Ghana’s Experience Experience Presented at the UNSD Regional Presented at the UNSD Regional Workshop on Census Data Processing Workshop on Census Data Processing Dar es Salaam, Tanzania Dar es Salaam, Tanzania 9 – 13 June, 2008 9 – 13 June, 2008 Presenter: K.B. Danso-Manu, Presenter: K.B. Danso-Manu, Ghana Statistical Service Ghana Statistical Service

Census Data Capture with OCR Technology: Ghana’s Experience Presented at the UNSD Regional Workshop on Census Data Processing Dar es Salaam, Tanzania 9

Embed Size (px)

Citation preview

Page 1: Census Data Capture with OCR Technology: Ghana’s Experience Presented at the UNSD Regional Workshop on Census Data Processing Dar es Salaam, Tanzania 9

Census Data Capture Census Data Capture with OCR Technology:with OCR Technology: Ghana’s Experience Ghana’s Experience

Presented at the UNSD Regional Workshop Presented at the UNSD Regional Workshop on Census Data Processingon Census Data Processing

Dar es Salaam, TanzaniaDar es Salaam, Tanzania

9 – 13 June, 20089 – 13 June, 2008

Presenter: K.B. Danso-Manu, Presenter: K.B. Danso-Manu, Ghana Statistical ServiceGhana Statistical Service

Page 2: Census Data Capture with OCR Technology: Ghana’s Experience Presented at the UNSD Regional Workshop on Census Data Processing Dar es Salaam, Tanzania 9

Ghana uses scanning Ghana uses scanning technology for 2000 Censustechnology for 2000 Census

Ghana used the Optical Character Recognition Ghana used the Optical Character Recognition (OCR) technology to capture the 2000 census (OCR) technology to capture the 2000 census forms. forms.

Three Kodak 9500 document scanners used for Three Kodak 9500 document scanners used for 12 months.12 months.

About 4.5 million census forms captured.About 4.5 million census forms captured.

Total population 18,912,079 persons.Total population 18,912,079 persons. Males = 9,357,382 males (49.5%). Males = 9,357,382 males (49.5%). Females = 9,554,697 females (50.5%).Females = 9,554,697 females (50.5%).

Page 3: Census Data Capture with OCR Technology: Ghana’s Experience Presented at the UNSD Regional Workshop on Census Data Processing Dar es Salaam, Tanzania 9

Geographical CodingGeographical Coding System System

The 2000 Ghana census was The 2000 Ghana census was conducted at the household level. conducted at the household level.

A 15-digit reference code was used A 15-digit reference code was used to uniquely identify each household.to uniquely identify each household.

The hierarchical coding system used The hierarchical coding system used

was as follows:was as follows:

Page 4: Census Data Capture with OCR Technology: Ghana’s Experience Presented at the UNSD Regional Workshop on Census Data Processing Dar es Salaam, Tanzania 9

The hierarchical coding system The hierarchical coding system usedused

Item Position No. of Digits Valid values

Region 1-2 2 01 - 10

District 3-4 2 01 - 18

Locality 5-7 3 001 - 999

EA Num 8-10 3 001 - 999

Structure/Building Number 11-13 3 001 - 999

Household Number 14-15 2 01 - 99

Page 5: Census Data Capture with OCR Technology: Ghana’s Experience Presented at the UNSD Regional Workshop on Census Data Processing Dar es Salaam, Tanzania 9
Page 6: Census Data Capture with OCR Technology: Ghana’s Experience Presented at the UNSD Regional Workshop on Census Data Processing Dar es Salaam, Tanzania 9

Capturing of census dataCapturing of census data

Steps:Steps:

Office editing Office editing

Opening and preparation of formsOpening and preparation of forms

ScanningScanning

ValidationValidation

Page 7: Census Data Capture with OCR Technology: Ghana’s Experience Presented at the UNSD Regional Workshop on Census Data Processing Dar es Salaam, Tanzania 9

Office EditingOffice Editing The scanners could not recognize light crossed marks, The scanners could not recognize light crossed marks,

responses had to be crossed again, deep enough for the responses had to be crossed again, deep enough for the scanner to recognize. scanner to recognize.

In rural scattered EAs, some enumerators gave the same In rural scattered EAs, some enumerators gave the same locality (A06) code as that of the base locality. This had to locality (A06) code as that of the base locality. This had to be corrected before scanning.be corrected before scanning.

Some enumerators used wrong EA codes for the Some enumerators used wrong EA codes for the questionnaire.questionnaire.

In many instances, after copying the codes, marking them In many instances, after copying the codes, marking them on the questionnaires was done wrongly or not marked at on the questionnaires was done wrongly or not marked at all.all.

The front-page of some questionnaires especially The front-page of some questionnaires especially supplementary forms was blank.supplementary forms was blank.

Page 8: Census Data Capture with OCR Technology: Ghana’s Experience Presented at the UNSD Regional Workshop on Census Data Processing Dar es Salaam, Tanzania 9

Opening and Preparation of Opening and Preparation of Questionnaires for ScanningQuestionnaires for Scanning

After editing, questionnaires were opened, After editing, questionnaires were opened, separated and prepared for scanning. separated and prepared for scanning.

To ensure that the crosses were dark.To ensure that the crosses were dark.

The 15-digit reference number was on the The 15-digit reference number was on the inner sheet with the household inner sheet with the household identification on the outer sheet (this was identification on the outer sheet (this was the only link the two forms had) for each the only link the two forms had) for each household. household.

Continuation forms follow the original.Continuation forms follow the original.

Page 9: Census Data Capture with OCR Technology: Ghana’s Experience Presented at the UNSD Regional Workshop on Census Data Processing Dar es Salaam, Tanzania 9

OCR OCR ScanningScanning

• Type of scanners:Type of scanners: Kodak 9500 DKodak 9500 D

• Optical Resolution:Optical Resolution: 300 dpi300 dpi

• Speed:Speed: 120 ppm (A4)120 ppm (A4)

• Interface:Interface: SCSI-2 (8-bit)SCSI-2 (8-bit)

• Software:Software: Readsoft’s Eyes & Readsoft’s Eyes & HandsHands

Page 10: Census Data Capture with OCR Technology: Ghana’s Experience Presented at the UNSD Regional Workshop on Census Data Processing Dar es Salaam, Tanzania 9

OCR OCR Scanning Scanning (cont.)(cont.)

The data capture involved scanning The data capture involved scanning of the questionnaire, interpretation of of the questionnaire, interpretation of the scanned marks, transfer of the the scanned marks, transfer of the data and loading the scanned data data and loading the scanned data into an oracle database. into an oracle database.

Periodic backups of the data and Periodic backups of the data and images were made on compact tapes images were made on compact tapes – DLT Tapes 40GB. – DLT Tapes 40GB.

Page 11: Census Data Capture with OCR Technology: Ghana’s Experience Presented at the UNSD Regional Workshop on Census Data Processing Dar es Salaam, Tanzania 9

OCR OCR Scanning Scanning (cont.)(cont.)

Three 8-hour shifts, 7 days a week for 4 Three 8-hour shifts, 7 days a week for 4 months, then 6 days a week for 8 months.months, then 6 days a week for 8 months.

Forms fed manually to avoid paper jams.Forms fed manually to avoid paper jams.

Forms scanned in batches of EA within Forms scanned in batches of EA within districts and regions.districts and regions.

Scanned batch stamped, bagged and Scanned batch stamped, bagged and returned to the documents room.returned to the documents room.

Page 12: Census Data Capture with OCR Technology: Ghana’s Experience Presented at the UNSD Regional Workshop on Census Data Processing Dar es Salaam, Tanzania 9

ValidationValidation Validation of scanned data was Validation of scanned data was

undertaken to correct structural and undertaken to correct structural and inconsistency problems identified in the inconsistency problems identified in the dataset. dataset.

For every household that failed the For every household that failed the structural and/or consistency checks, the structural and/or consistency checks, the image and dataset were recalled.image and dataset were recalled.

Necessary corrections made to the dataset Necessary corrections made to the dataset but not the images.but not the images.

Page 13: Census Data Capture with OCR Technology: Ghana’s Experience Presented at the UNSD Regional Workshop on Census Data Processing Dar es Salaam, Tanzania 9

Validation Validation (cont.)(cont.)

There was no direct mechanism to There was no direct mechanism to retrieve images of the questionnaires retrieve images of the questionnaires form tapes.form tapes.

The validation process became very The validation process became very slow and tedious.slow and tedious.

The validation teams worked the The validation teams worked the same shift as the scanning teams.same shift as the scanning teams.

Page 14: Census Data Capture with OCR Technology: Ghana’s Experience Presented at the UNSD Regional Workshop on Census Data Processing Dar es Salaam, Tanzania 9

Difficulties and ChallengesDifficulties and Challenges

Paper weightPaper weight

- different grammage of paper used to print - different grammage of paper used to print the questionnaires (80g/m2, 100g/m2 and the questionnaires (80g/m2, 100g/m2 and 120g/m2).120g/m2).

- sheets got jammed up in the system.- sheets got jammed up in the system.

- Scanners fed manually.- Scanners fed manually.

Page 15: Census Data Capture with OCR Technology: Ghana’s Experience Presented at the UNSD Regional Workshop on Census Data Processing Dar es Salaam, Tanzania 9

Difficulties and Challenges Difficulties and Challenges (cont.)(cont.)

No Barcodes on census formsNo Barcodes on census forms

An 8-page questionnaire, consisting An 8-page questionnaire, consisting of two A3 sheets was used to design of two A3 sheets was used to design the OCR readable census the OCR readable census questionnaire. questionnaire.

The company printing them could not The company printing them could not print unique barcodes on them. print unique barcodes on them.

Page 16: Census Data Capture with OCR Technology: Ghana’s Experience Presented at the UNSD Regional Workshop on Census Data Processing Dar es Salaam, Tanzania 9

Difficulties and Challenges Difficulties and Challenges (cont.)(cont.)

Number of scanners usedNumber of scanners used Three out of the six scanners Three out of the six scanners

planned for the data capture were planned for the data capture were purchased.purchased.

All three scanners were used to scan All three scanners were used to scan but became idle during interpretation but became idle during interpretation and transfer of data. and transfer of data.

Page 17: Census Data Capture with OCR Technology: Ghana’s Experience Presented at the UNSD Regional Workshop on Census Data Processing Dar es Salaam, Tanzania 9

Difficulties and Challenges Difficulties and Challenges (cont.)(cont.)

Output from scanned questionnairesOutput from scanned questionnaires The generated ASCII data file was all The generated ASCII data file was all

numeric and left justified. numeric and left justified.

Fields with 3 or 4-digit had their Fields with 3 or 4-digit had their leading zeros truncated. leading zeros truncated.

The scanners could not pick the 15-The scanners could not pick the 15-digit reference number correctly.digit reference number correctly.

Page 18: Census Data Capture with OCR Technology: Ghana’s Experience Presented at the UNSD Regional Workshop on Census Data Processing Dar es Salaam, Tanzania 9

Difficulties and Challenges Difficulties and Challenges (cont.)(cont.)

Power Power interruptionsinterruptions Power fluctuations, power cuts and low voltage Power fluctuations, power cuts and low voltage

disturbed the flow of work to the extent that it disturbed the flow of work to the extent that it sometimes became impossible to scan during the sometimes became impossible to scan during the day. day.

This led to the destruction of two motherboards of This led to the destruction of two motherboards of the scanners and damage to a couple of the scanners and damage to a couple of computers and printers.computers and printers.

This problem was however resolved when a This problem was however resolved when a

100kVA generator and a stabilizer were installed. 100kVA generator and a stabilizer were installed.

Page 19: Census Data Capture with OCR Technology: Ghana’s Experience Presented at the UNSD Regional Workshop on Census Data Processing Dar es Salaam, Tanzania 9

Thank YouThank You

… … for your timefor your time

and attention.and attention.