Approaches to automated metadata extraction : FixRep Project

A centre of expertise in digital information management

www.ukoln.ac.uk

UKOLN is supported by:

Approaches to automated metadata extraction : FixRep Project

Emma Tonkin

[email protected]

www.bath.ac.uk


www.ukoln.ac.uk

Wouldn't it be nice if...

• ...computers could author our metadata for us, thus saving a lot of hassle?

• Mechanical metadata extraction vs manual metadata input


www.ukoln.ac.uk

But...

• Automated tools are fallible

• There's never quite enough information available

• Templates change, different domains have different standards

• In short, computers are often wrong– and so are people


www.ukoln.ac.uk

• Hybrid approach:– Get what metadata you can– Ask the user to check and clean it if

necessary

• Philosophy:– If the computer gets it wrong, we can fix

it later

The 'half a loaf' hypothesis


www.ukoln.ac.uk

Wouldn’t it be nice if…

• …computers could fix our metadata for us?

• Or, more realistically, help us do this work for ourselves.


www.ukoln.ac.uk

• All about ‘fixing it later’, doing what we can with what we have

• Automated metadata extraction + metadata consistency assessment

• Metadata generation, evaluation, characterisation: enabling metadata triage


www.ukoln.ac.uk

1)Challenges in automated metadata extraction

2)Manual metadata generation

3)Metadata extraction in brief

4)Practical use as part of a repository deposit workflow

5)A user study comparing manual and hybrid input

6)Towards metadata triage


www.ukoln.ac.uk

Whatever can go wrong...

• PDFs can be:– Encrypted– Corrupted– Oddly encoded– An image file without embedded text– Occurrence: ~3-6%


www.ukoln.ac.uk

Character sets

• Ligatures,• Accents,• Symbols -

may not always be extractable from PDFs

Image © Daniel Ullrich


www.ukoln.ac.uk

Document formats/layouts

• Many possible formats

• Some formats not widely supported

• Document layouts vary widely, esp. by discipline


www.ukoln.ac.uk

1)Challenges in metadata extraction







www.ukoln.ac.uk

Whatever can go wrong... (II)

• Function following form – interface • Model adapted to suit unique user needs• Data model incompletely supported• Input validation issues• Systematic error; typos; localisation;

encoding; etc.• Lots of past work in characterising manual

input errors


www.ukoln.ac.uk







www.ukoln.ac.uk

Image segmentation, templating & OCR


www.ukoln.ac.uk

Working from text

• There are a number of possible states (ie. title, author, email, affiliation, abstract)

• Directed graph with probabilities

– Markov chain: for example,

Title Author Email Affil.


www.ukoln.ac.uk

Hidden Markov Model

• We cannot directly see these states – only the words

• But we can gather statistics on the correlation between the words and the underlying states, to inform guesses as to how the data should be segmented

• This may be expressed in terms of an HMM

• Bayesian statistics used across term appearance


www.ukoln.ac.uk

Example parse

• Confirmation-Guided Discovery of First-Order Rules, PETER A. FLACH, NICOLAS LACHICHE


• ...


• Self-correcting, to the extent that the knowledge base grows as new papers are added to the collection


www.ukoln.ac.uk








www.ukoln.ac.uk

Aims

• Adaption of existing interfaces

• Enhancing rather than rewriting

• Cross-platform, accessible interface

• Simple reusable REST API, metadata as DC/XML


www.ukoln.ac.uk

Sample interfaces


www.ukoln.ac.uk

Sample interfaces


www.ukoln.ac.uk

Architecture


www.ukoln.ac.uk

Using what we know...


www.ukoln.ac.uk








www.ukoln.ac.uk

Question:

• “Do people accept ‘hybrid’ interfaces?”

• Here’s one we did earlier…


www.ukoln.ac.uk

Hypotheses• Correcting extracted metadata is faster than

entering or cutting-and-pasting metadata.

• The resulting metadata has fewer errors when the user is provided with already extracted metadata to correct.

• User satisfaction with a system is higher if it 'tries' to extract metadata, even if it fails.

• Measured: speed and accuracy of entering information manually versus hybrid entry, and qualitatively, the user-satisfaction


www.ukoln.ac.uk

Results: Timing

• Hybrid faster under both conditions

• (Summary of mediantimes)


www.ukoln.ac.uk

Results: Accuracy• Tested against ground-truth

• Keyword accuracy: First keyword listed was relevant for 46% of the publications. The top two were relevant in 66%; the top-5 cover 81% of all desired keywords.

• Manual metadata accuracy:

– Few users use cut and paste

– Capitalisation, punctuation frequently differs

– Synonyms are accidentally substituted

• Hybrid closer to ground-truth, and more complete, but results not clear-cut.


www.ukoln.ac.uk

Qualitative results

• Most users preferred the hybrid mode

• Most perceived it to be faster than manual data entry

• Few believed the hybrid approach to be more accurate; in practice, there was no significant difference in quality between hybrid and manual approach

• Both were good - quality


www.ukoln.ac.uk

Discussion

• Results support hypotheses

• People prefer the hybrid interface, and found it more satisfying to use

• Accessibility issues exist, but can be overcome

• The punchline: one subject actually preferred manual entry because the hybrid system filled in metadata fields that he preferred to leave empty – ie. it did more than the subject wanted!


www.ukoln.ac.uk








www.ukoln.ac.uk

MetRe prototype (2008)

• Characteristic classes of individual/systematic error highlighted

• Nb. local and general best practice. Uses: ranking, browsing, correcting systematic error

• Uses info from intra-/inter-repository harvested metadata to identify patterns, rank occurrences and co-occurrences


www.ukoln.ac.uk

v


www.ukoln.ac.uk


www.ukoln.ac.uk

Issues

• Discipline/domain-specific issues

• Lots of information required to do this right (see metadata schema/terminology registry)

• Some APs present particular difficulties, such as SWAP (FRBR structure, linking objects by ‘Scholarly Work’)


www.ukoln.ac.uk

Approach

• Generally dependent on heuristics over available data

• Powered by very specific functions (classifiers, validation, etc…)

• Potentially expensive, not always domain-independent


www.ukoln.ac.uk

Future work

• More! – Data– Filters (input/output formats)– Methods– Evaluation– Service availability (mail me for

announcements!)


www.ukoln.ac.uk

Conclusion

• Metadata creation can be supported through software

• Specific problem sets in metadata triage

• Work continues in the FixRep project


www.ukoln.ac.uk

Conclusion (II)

• Formal Metadata Extraction/evaluation

• Metadata review process

• Accessibility metadata

• Entity extraction (named entities, geographical, temporal [k-int!])

• Repository integration


www.ukoln.ac.uk

• Thanks!

• Comments/Questions?

• www.ukoln.ac.uk/projects/fixrep

Education

Approaches to automated metadata extraction : FixRep Project