16
Text Mining: Opportunities and Barriers John McNaught Deputy Director National Centre for Text Mining [email protected]

Text Mining: Opportunities and Barriers John McNaught Deputy Director National Centre for Text Mining [email protected]

Embed Size (px)

Citation preview

Page 1: Text Mining: Opportunities and Barriers John McNaught Deputy Director National Centre for Text Mining John.McNaught@manchester.ac.uk

Text Mining: Opportunities and Barriers

John McNaught

Deputy Director

National Centre for Text [email protected]

Page 2: Text Mining: Opportunities and Barriers John McNaught Deputy Director National Centre for Text Mining John.McNaught@manchester.ac.uk

Topics

• What is text mining? (briefly)• What can it offer? (selectively)• What are the obstacles? (mostly)

Page 3: Text Mining: Opportunities and Barriers John McNaught Deputy Director National Centre for Text Mining John.McNaught@manchester.ac.uk

NaCTeM

• First publicly-funded (JISC) national text mining centre in the world

• Remit: provide services to research community

• Initial focus on biology, then social sciences, medicine, chemistry, …

• Processing on a large scale, e.g. for UKPMC (Wellcome T.+17 other funders)

• www.nactem.ac.uk

Page 4: Text Mining: Opportunities and Barriers John McNaught Deputy Director National Centre for Text Mining John.McNaught@manchester.ac.uk

What is text mining?

• Goal: Discover new knowledge from old• How:

– Process very large amounts of text• Millions of documents, the more the better

– Identify and extract information– (Link extracted information to already curated

knowledge)– Mine to discover implicit significant associations– Flag (unknown) associations for researcher to

investigate further– Spin-off on the way: render information explicit

Page 5: Text Mining: Opportunities and Barriers John McNaught Deputy Director National Centre for Text Mining John.McNaught@manchester.ac.uk

From text to new knowledge

Page 6: Text Mining: Opportunities and Barriers John McNaught Deputy Director National Centre for Text Mining John.McNaught@manchester.ac.uk

What does it offer?• Finds unsuspected knowledge

– E.g. Disease-gene associations

• Enables discoveries human effort could not achieve (information overload/overlook)

• Enables better search/navigation of literature– Semantic search via extracted semantic metadata

• Reduces time spent searching– 15-48% of researcher time spent on classic

search, 20-50% of classic searches unsatisfied

• E.g. Systematic reviews: months to weeks

Page 7: Text Mining: Opportunities and Barriers John McNaught Deputy Director National Centre for Text Mining John.McNaught@manchester.ac.uk

What does it offer?

• Text mining boosts research– Makes research possible that would otherwise

be impossible or unfeasible

• Research drives growth and innovation• Research produces more information• More information is available for text

mining• Text mining boosts research …

Page 8: Text Mining: Opportunities and Barriers John McNaught Deputy Director National Centre for Text Mining John.McNaught@manchester.ac.uk

Barriers

• Access to the literature• Format issues (tied to next point…)

– “PDF is evil” (Lynch)

• Main blocks: copyright and licensing issues– <8% of scientific claims found in full article

appear in its abstract (Blake)– Abstracts deficient on argumentation,

discussion, methods, background, …– Full texts needed to realise full benefits of TM

Page 9: Text Mining: Opportunities and Barriers John McNaught Deputy Director National Centre for Text Mining John.McNaught@manchester.ac.uk

Barriers• Need to copy documents to analyse them• Licences typically not favourable to TM• Licences established on per institution basis

– Prevents community-oriented services• Results only for internal use by institutional users

– Hinders mining over collections of content from different providers

• Inconsistency: human can search and manually analyse, but cannot use machine to do same job on same data already subscribed to

Page 10: Text Mining: Opportunities and Barriers John McNaught Deputy Director National Centre for Text Mining John.McNaught@manchester.ac.uk

Barriers

• Problem even with liberal OA licences– Author attribution required

• Author attribution in a data mining environment is impossible/unfeasible– Association finding: cannot track positive, negative,

neutral individual author contributions

• Derived works in a TM environment– Every author of every text processed to produce

new derived knowledge may have a claim…– Rights clearance thus an effective barrier

Page 11: Text Mining: Opportunities and Barriers John McNaught Deputy Director National Centre for Text Mining John.McNaught@manchester.ac.uk

Barriers

• Laudable effort 1: NESLi2 model licence (JISC Collections) allows TM– Publisher <> single institution– But how many publishers retain TM provisions?– But cannot display annotations produced by TM on

document itself

• Laudable effort 2: NPG licence for self-archived content allows TM– But “content must be destroyed when experiment

complete” is vague. So services for community?

Page 12: Text Mining: Opportunities and Barriers John McNaught Deputy Director National Centre for Text Mining John.McNaught@manchester.ac.uk

Conclusion

• Copyright and licensing restrictions block full realisation of TM benefits– Economic savings and potential for growth are

stifled

• Japan has introduced an information analysis exception to copyright law– National Diet Library (= British Library) has

recently changed its motto to:

“Through knowledge we prosper”– Can we say the same in the UK?

Page 13: Text Mining: Opportunities and Barriers John McNaught Deputy Director National Centre for Text Mining John.McNaught@manchester.ac.uk

Extras

Page 14: Text Mining: Opportunities and Barriers John McNaught Deputy Director National Centre for Text Mining John.McNaught@manchester.ac.uk

Info=degree of surprise

Finding unknown associations: reproducing a discovery reported 5 days ago in Nature Medicine

Page 15: Text Mining: Opportunities and Barriers John McNaught Deputy Director National Centre for Text Mining John.McNaught@manchester.ac.uk

UKPMC EvidenceFinder by NaCTeM: Questions generated by deep analysis, with known answers

Page 16: Text Mining: Opportunities and Barriers John McNaught Deputy Director National Centre for Text Mining John.McNaught@manchester.ac.uk

Click on a question to see relevant extracted evidence(from OA subset of the archive)