40
A centre of expertise in digital information management www.ukoln.ac.uk UKOLN is supported by: Approaches to automated metadata extraction : FixRep Project Emma Tonkin [email protected] www.bath.ac.uk

Approaches to automated metadata extraction : FixRep Project

Embed Size (px)

DESCRIPTION

Presentation given at the Text Mining for Scholarly Communications and Repositories Joint Workshop, 28-29 Oct 2009 (http://www.nactem.ac.uk/tm-ukoln.php)

Citation preview

Page 1: Approaches to automated metadata extraction : FixRep Project

                                                             

A centre of expertise in digital information management

www.ukoln.ac.uk

UKOLN is supported by:

Approaches to automated metadata extraction : FixRep Project

Emma Tonkin

[email protected]

www.bath.ac.uk

Page 2: Approaches to automated metadata extraction : FixRep Project

                                                             

A centre of expertise in digital information management

www.ukoln.ac.uk

Wouldn't it be nice if...

• ...computers could author our metadata for us, thus saving a lot of hassle?

• Mechanical metadata extraction vs manual metadata input

Page 3: Approaches to automated metadata extraction : FixRep Project

                                                             

A centre of expertise in digital information management

www.ukoln.ac.uk

But...

• Automated tools are fallible

• There's never quite enough information available

• Templates change, different domains have different standards

• In short, computers are often wrong– and so are people

Page 4: Approaches to automated metadata extraction : FixRep Project

                                                             

A centre of expertise in digital information management

www.ukoln.ac.uk

• Hybrid approach:– Get what metadata you can– Ask the user to check and clean it if

necessary

• Philosophy:– If the computer gets it wrong, we can fix

it later

The 'half a loaf' hypothesis

Page 5: Approaches to automated metadata extraction : FixRep Project

                                                             

A centre of expertise in digital information management

www.ukoln.ac.uk

Wouldn’t it be nice if…

• …computers could fix our metadata for us?

• Or, more realistically, help us do this work for ourselves.

Page 6: Approaches to automated metadata extraction : FixRep Project

                                                             

A centre of expertise in digital information management

www.ukoln.ac.uk

• All about ‘fixing it later’, doing what we can with what we have

• Automated metadata extraction + metadata consistency assessment

• Metadata generation, evaluation, characterisation: enabling metadata triage

Page 7: Approaches to automated metadata extraction : FixRep Project

                                                             

A centre of expertise in digital information management

www.ukoln.ac.uk

1)Challenges in automated metadata extraction

2)Manual metadata generation

3)Metadata extraction in brief

4)Practical use as part of a repository deposit workflow

5)A user study comparing manual and hybrid input

6)Towards metadata triage

Page 8: Approaches to automated metadata extraction : FixRep Project

                                                             

A centre of expertise in digital information management

www.ukoln.ac.uk

Whatever can go wrong...

• PDFs can be:– Encrypted– Corrupted– Oddly encoded– An image file without embedded text– Occurrence: ~3-6%

Page 9: Approaches to automated metadata extraction : FixRep Project

                                                             

A centre of expertise in digital information management

www.ukoln.ac.uk

Character sets

• Ligatures,• Accents,• Symbols -

may not always be extractable from PDFs

Image © Daniel Ullrich

Page 10: Approaches to automated metadata extraction : FixRep Project

                                                             

A centre of expertise in digital information management

www.ukoln.ac.uk

Document formats/layouts

• Many possible formats

• Some formats not widely supported

• Document layouts vary widely, esp. by discipline

Page 11: Approaches to automated metadata extraction : FixRep Project

                                                             

A centre of expertise in digital information management

www.ukoln.ac.uk

1)Challenges in metadata extraction

2)Manual metadata generation

3)Metadata extraction in brief

4)Practical use as part of a repository deposit workflow

5)A user study comparing manual and hybrid input

6)Towards metadata triage

Page 12: Approaches to automated metadata extraction : FixRep Project

                                                             

A centre of expertise in digital information management

www.ukoln.ac.uk

Whatever can go wrong... (II)

• Function following form – interface • Model adapted to suit unique user needs• Data model incompletely supported• Input validation issues• Systematic error; typos; localisation;

encoding; etc.• Lots of past work in characterising manual

input errors

Page 13: Approaches to automated metadata extraction : FixRep Project

                                                             

A centre of expertise in digital information management

www.ukoln.ac.uk

1)Challenges in metadata extraction

2)Manual metadata generation

3)Metadata extraction in brief

4)Practical use as part of a repository deposit workflow

5)A user study comparing manual and hybrid input

Page 14: Approaches to automated metadata extraction : FixRep Project

                                                             

A centre of expertise in digital information management

www.ukoln.ac.uk

Image segmentation, templating & OCR

Page 15: Approaches to automated metadata extraction : FixRep Project

                                                             

A centre of expertise in digital information management

www.ukoln.ac.uk

Working from text

• There are a number of possible states (ie. title, author, email, affiliation, abstract)

• Directed graph with probabilities

– Markov chain: for example,

Title Author Email Affil.

Page 16: Approaches to automated metadata extraction : FixRep Project

                                                             

A centre of expertise in digital information management

www.ukoln.ac.uk

Hidden Markov Model

• We cannot directly see these states – only the words

• But we can gather statistics on the correlation between the words and the underlying states, to inform guesses as to how the data should be segmented

• This may be expressed in terms of an HMM

• Bayesian statistics used across term appearance

Page 17: Approaches to automated metadata extraction : FixRep Project

                                                             

A centre of expertise in digital information management

www.ukoln.ac.uk

Example parse

• Confirmation-Guided Discovery of First-Order Rules, PETER A. FLACH, NICOLAS LACHICHE

• Confirmation-Guided Discovery of First-Order Rules, PETER A. FLACH, NICOLAS LACHICHE

• ...

• Confirmation-Guided Discovery of First-Order Rules, PETER A. FLACH, NICOLAS LACHICHE

• Self-correcting, to the extent that the knowledge base grows as new papers are added to the collection

Page 18: Approaches to automated metadata extraction : FixRep Project

                                                             

A centre of expertise in digital information management

www.ukoln.ac.uk

1)Challenges in metadata extraction

2)Manual metadata generation

3)Metadata extraction in brief

4)Practical use as part of a repository deposit workflow

5)A user study comparing manual and hybrid input

6)Towards metadata triage

Page 19: Approaches to automated metadata extraction : FixRep Project

                                                             

A centre of expertise in digital information management

www.ukoln.ac.uk

Aims

• Adaption of existing interfaces

• Enhancing rather than rewriting

• Cross-platform, accessible interface

• Simple reusable REST API, metadata as DC/XML

Page 20: Approaches to automated metadata extraction : FixRep Project

                                                             

A centre of expertise in digital information management

www.ukoln.ac.uk

Sample interfaces

Page 21: Approaches to automated metadata extraction : FixRep Project

                                                             

A centre of expertise in digital information management

www.ukoln.ac.uk

Sample interfaces

Page 22: Approaches to automated metadata extraction : FixRep Project

                                                             

A centre of expertise in digital information management

www.ukoln.ac.uk

Architecture

Page 23: Approaches to automated metadata extraction : FixRep Project

                                                             

A centre of expertise in digital information management

www.ukoln.ac.uk

Using what we know...

Page 24: Approaches to automated metadata extraction : FixRep Project

                                                             

A centre of expertise in digital information management

www.ukoln.ac.uk

1)Challenges in metadata extraction

2)Manual metadata generation

3)Metadata extraction in brief

4)Practical use as part of a repository deposit workflow

5)A user study comparing manual and hybrid input

6)Towards metadata triage

Page 25: Approaches to automated metadata extraction : FixRep Project

                                                             

A centre of expertise in digital information management

www.ukoln.ac.uk

Question:

• “Do people accept ‘hybrid’ interfaces?”

• Here’s one we did earlier…

Page 26: Approaches to automated metadata extraction : FixRep Project

                                                             

A centre of expertise in digital information management

www.ukoln.ac.uk

Hypotheses• Correcting extracted metadata is faster than

entering or cutting-and-pasting metadata.

• The resulting metadata has fewer errors when the user is provided with already extracted metadata to correct.

• User satisfaction with a system is higher if it 'tries' to extract metadata, even if it fails.

• Measured: speed and accuracy of entering information manually versus hybrid entry, and qualitatively, the user-satisfaction

Page 27: Approaches to automated metadata extraction : FixRep Project

                                                             

A centre of expertise in digital information management

www.ukoln.ac.uk

Results: Timing

• Hybrid faster under both conditions

• (Summary of mediantimes)

Page 28: Approaches to automated metadata extraction : FixRep Project

                                                             

A centre of expertise in digital information management

www.ukoln.ac.uk

Results: Accuracy• Tested against ground-truth

• Keyword accuracy: First keyword listed was relevant for 46% of the publications. The top two were relevant in 66%; the top-5 cover 81% of all desired keywords.

• Manual metadata accuracy:

– Few users use cut and paste

– Capitalisation, punctuation frequently differs

– Synonyms are accidentally substituted

• Hybrid closer to ground-truth, and more complete, but results not clear-cut.

Page 29: Approaches to automated metadata extraction : FixRep Project

                                                             

A centre of expertise in digital information management

www.ukoln.ac.uk

Qualitative results

• Most users preferred the hybrid mode

• Most perceived it to be faster than manual data entry

• Few believed the hybrid approach to be more accurate; in practice, there was no significant difference in quality between hybrid and manual approach

• Both were good - quality

Page 30: Approaches to automated metadata extraction : FixRep Project

                                                             

A centre of expertise in digital information management

www.ukoln.ac.uk

Discussion

• Results support hypotheses

• People prefer the hybrid interface, and found it more satisfying to use

• Accessibility issues exist, but can be overcome

• The punchline: one subject actually preferred manual entry because the hybrid system filled in metadata fields that he preferred to leave empty – ie. it did more than the subject wanted!

Page 31: Approaches to automated metadata extraction : FixRep Project

                                                             

A centre of expertise in digital information management

www.ukoln.ac.uk

1)Challenges in metadata extraction

2)Manual metadata generation

3)Metadata extraction in brief

4)Practical use as part of a repository deposit workflow

5)A user study comparing manual and hybrid input

6)Towards metadata triage

Page 32: Approaches to automated metadata extraction : FixRep Project

                                                             

A centre of expertise in digital information management

www.ukoln.ac.uk

MetRe prototype (2008)

• Characteristic classes of individual/systematic error highlighted

• Nb. local and general best practice. Uses: ranking, browsing, correcting systematic error

• Uses info from intra-/inter-repository harvested metadata to identify patterns, rank occurrences and co-occurrences

Page 33: Approaches to automated metadata extraction : FixRep Project

                                                             

A centre of expertise in digital information management

www.ukoln.ac.uk

v

Page 34: Approaches to automated metadata extraction : FixRep Project

                                                             

A centre of expertise in digital information management

www.ukoln.ac.uk

Page 35: Approaches to automated metadata extraction : FixRep Project

                                                             

A centre of expertise in digital information management

www.ukoln.ac.uk

Issues

• Discipline/domain-specific issues

• Lots of information required to do this right (see metadata schema/terminology registry)

• Some APs present particular difficulties, such as SWAP (FRBR structure, linking objects by ‘Scholarly Work’)

Page 36: Approaches to automated metadata extraction : FixRep Project

                                                             

A centre of expertise in digital information management

www.ukoln.ac.uk

Approach

• Generally dependent on heuristics over available data

• Powered by very specific functions (classifiers, validation, etc…)

• Potentially expensive, not always domain-independent

Page 37: Approaches to automated metadata extraction : FixRep Project

                                                             

A centre of expertise in digital information management

www.ukoln.ac.uk

Future work

• More! – Data– Filters (input/output formats)– Methods– Evaluation– Service availability (mail me for

announcements!)

Page 38: Approaches to automated metadata extraction : FixRep Project

                                                             

A centre of expertise in digital information management

www.ukoln.ac.uk

Conclusion

• Metadata creation can be supported through software

• Specific problem sets in metadata triage

• Work continues in the FixRep project

Page 39: Approaches to automated metadata extraction : FixRep Project

                                                             

A centre of expertise in digital information management

www.ukoln.ac.uk

Conclusion (II)

• Formal Metadata Extraction/evaluation

• Metadata review process

• Accessibility metadata

• Entity extraction (named entities, geographical, temporal [k-int!])

• Repository integration

Page 40: Approaches to automated metadata extraction : FixRep Project

                                                             

A centre of expertise in digital information management

www.ukoln.ac.uk

• Thanks!

• Comments/Questions?

• www.ukoln.ac.uk/projects/fixrep