16
Enabling Efficient Chinese Jiapu Information Extraction Stephen W. Liddle BYU Information Systems Department Derek Dobson, David W. Embley, Chuck Liu FamilySearch

Enabling Efficient Chinese Jiapu Information Extraction Stephen W. Liddle BYU Information Systems Department Derek Dobson, David W. Embley, Chuck Liu FamilySearch

Embed Size (px)

Citation preview

Page 1: Enabling Efficient Chinese Jiapu Information Extraction Stephen W. Liddle BYU Information Systems Department Derek Dobson, David W. Embley, Chuck Liu FamilySearch

Enabling Efficient ChineseJiapu Information Extraction

Stephen W. LiddleBYU Information Systems Department

Derek Dobson, David W. Embley, Chuck LiuFamilySearch

Page 2: Enabling Efficient Chinese Jiapu Information Extraction Stephen W. Liddle BYU Information Systems Department Derek Dobson, David W. Embley, Chuck Liu FamilySearch

2

Chinese Jiapu: Clan Family Record

• 12,871,979 Jiapu images• ~ half billion fact assertions

Page 3: Enabling Efficient Chinese Jiapu Information Extraction Stephen W. Liddle BYU Information Systems Department Derek Dobson, David W. Embley, Chuck Liu FamilySearch

3

COMETClick-Only (or at least Mostly) Extraction Tool

Page 4: Enabling Efficient Chinese Jiapu Information Extraction Stephen W. Liddle BYU Information Systems Department Derek Dobson, David W. Embley, Chuck Liu FamilySearch

4

COMET with Jiapu

Page 5: Enabling Efficient Chinese Jiapu Information Extraction Stephen W. Liddle BYU Information Systems Department Derek Dobson, David W. Embley, Chuck Liu FamilySearch

5

COMET with Jiapu

Page 6: Enabling Efficient Chinese Jiapu Information Extraction Stephen W. Liddle BYU Information Systems Department Derek Dobson, David W. Embley, Chuck Liu FamilySearch

6

COMET with Jiapu

Page 7: Enabling Efficient Chinese Jiapu Information Extraction Stephen W. Liddle BYU Information Systems Department Derek Dobson, David W. Embley, Chuck Liu FamilySearch

7

Structured Data Stored for Query and Search

Page 8: Enabling Efficient Chinese Jiapu Information Extraction Stephen W. Liddle BYU Information Systems Department Derek Dobson, David W. Embley, Chuck Liu FamilySearch

8

Requirements for COMET to work well with Jiapu

• Good alignment

• Good OCR

Page 9: Enabling Efficient Chinese Jiapu Information Extraction Stephen W. Liddle BYU Information Systems Department Derek Dobson, David W. Embley, Chuck Liu FamilySearch

9

Requirements for COMET to work well with Jiapu

• Good alignment

• Good OCR

OK

not OK

Page 10: Enabling Efficient Chinese Jiapu Information Extraction Stephen W. Liddle BYU Information Systems Department Derek Dobson, David W. Embley, Chuck Liu FamilySearch

10

Requirements for COMET to work well with Jiapu

• Good alignment

• Good OCR

four characters as one

missing OCR

bad OCR

many incorrect characters

??

Page 11: Enabling Efficient Chinese Jiapu Information Extraction Stephen W. Liddle BYU Information Systems Department Derek Dobson, David W. Embley, Chuck Liu FamilySearch

11

OCR Resolution• Currently insufficient:– Acrobat Professional– Tesseract– Abbyy

• Is there a better OCR engine for Chinese?• Can we do better with what we have?– Image enhancement– Resize glyphs– Use an ensemble of OCR engines– Train Tesseract for Jiapu peculiarities

Page 12: Enabling Efficient Chinese Jiapu Information Extraction Stephen W. Liddle BYU Information Systems Department Derek Dobson, David W. Embley, Chuck Liu FamilySearch

12

AlignmentResolution

• Currently– Open source PDFBox: needs work for Acrobat Pro– Tesseract & Abbyy interact incorrectly with PDFBox

• Engineering solution (but lots of work)

Page 13: Enabling Efficient Chinese Jiapu Information Extraction Stephen W. Liddle BYU Information Systems Department Derek Dobson, David W. Embley, Chuck Liu FamilySearch

13

Automating Jiapu Data ExtractionFuture Work

Page 14: Enabling Efficient Chinese Jiapu Information Extraction Stephen W. Liddle BYU Information Systems Department Derek Dobson, David W. Embley, Chuck Liu FamilySearch

14

Verify & CorrectFuture Work

Page 15: Enabling Efficient Chinese Jiapu Information Extraction Stephen W. Liddle BYU Information Systems Department Derek Dobson, David W. Embley, Chuck Liu FamilySearch

15

Conclusions• “Failed” experiment ?• A way out– Alignment engineering– OCR: find or do R&D

• Potential– Click-Only (Mostly) Extraction – a win– Semi-automatic extraction – a big win

Page 16: Enabling Efficient Chinese Jiapu Information Extraction Stephen W. Liddle BYU Information Systems Department Derek Dobson, David W. Embley, Chuck Liu FamilySearch

16

Conclusions• “Failed” experiment ?• A way out– Alignment engineering– OCR: find or do R&D

• Potential– Click-Only (Mostly) Extraction – a win– Semi-automatic extraction – a big win