Text mining 101 what you should know

TEXT MINING 101: WHAT YOU SHOULD KNOW

Ethan Pullman (Carnegie Mellon University)

Denise Novak (Carnegie Mellon University)

Kristen Garlock (Ithaka.org)Patricia Cleary (Springer US)

NASIG annualSaturday, June 11, 2016 10:30 am

Working with Your ConstituentsEthan Pullman Humanities Liaison & Library Instruction CoordinatorCarnegie Mellon University

Audience SurveyOn a scale from 1-5, novice to experienced, how familiar are you with text mining?

1 -- 2 -- 3 -- 4 -- 5What is your role in providing Text Mining (TM) Services?

A. We have an expert librarian/service centerB. I work directly with my department(s)C. Other: (Please describe)

My TM population mainly consists of: Faculty PhDs/TAs Undergrads

How many of you have used TM for your own research/project?

Text Mining Briefly● What it is?● What is its purpose?

Photo adapted from Text Mine ‘01

Text mining is the automated processing of large amounts of structured digital texts

Purpose: retrieval, analysis, and interpretation of texts.

_______________________________________________Note: Mining non-textual information falls under “data mining”. Although often included with Text Mining as “Text & Data Mining”, data mining is different and requires tools and methodologies that are distinct from text mining.

http://web.eecs.utk.edu/events/tmw01

Text Mining ExamplesVisualization tools build word clouds from words mined from large texts.

SDFB mines British early modern texts to trace “social connections” between individuals from that period (read more)

A class project that used text mining to analyze case documents and briefs submitted by Authors’ Guild in Authors’ Guild vs. Google. The analysis shed light on the rhetorical strategy used by Authors’ Guild lawyers and informed outcome prediction.

It Ain’t About the Money, Money, Money…

or is it? Authors’ Guild vs. Google Books: A Rhetorical Analysis

http://www.cmu.edu/news/stories/archives/2015/october/francis-bacon-launch.html

The Role of Library LiaisonsWhat is new?Acquiring texts?Providing access?

Librarians need to understand: > how texts are used in the digital age > what tools are available > issues impacting acquisition and access

How I stay informed ….

Stay Informed:

Faculty Profiles● Curriculum Vitae● Publications● Syllabi

How I stay informed● Attend departmental lectures ● Visit research showcases

● Read about campus initiatives

How I stay informed ...● Maintaining our Text-Mining Website:

● Professional participation:● Organizations & Conferences: for example, Text Analytics World; ● Social Networks/Email lists, blogs● Seek continuing education opportunities

● Collaborate with our acquisition and data services librarians

http://www.textanalyticsworld.com/

Acquisitions Point of View Denise Novak Acquisitions LibrarianCarnegie Mellon University

Supporting Text Mining of the JSTOR Digital Library

Kristen GarlockAssociate Director of Education and Outreach -- JSTORIthaka

What is Data for Research?

Data for Research is a self-service website for generating datasets from the content on JSTOR.

http://dfr.jstor.org

http://dfr.jstor.org/

How it worksService is free, permitted under Terms & Conditions.● Data for Research: Researcher creates free

account on site, defines parameters of dataset, submits request, downloads dataset.

● Full-text datasets: Letter agreement (may be established with individuals or libraries). Datasets not limited by licenses or institutional affiliation.

Support for Text MiningWhy? ● Supporting new types of scholarship is part of our

mission● Opportunities to build beneficial partnerships● Increasing value of publications; corpus in and

of itself has value as a scholarly tool NOTE: For a bibliography of projects and research that incorporated datasets from JSTOR, please contact Kristen Garlock ([email protected])<mailto:[email protected])>.

ChallengesBiggest challenges: ● Staffing and support● Keeping up with evolving researcher needsTrends: ● Increasing numbers of requests● Requests for larger and more complex datasets● Interest from non-technologists● Scholars not anticipating/understanding gaps or data issues in

datasets● Desire to combine datasets from multiple sources

Springer TDM policy Patricia ClearyGlobal eProduct Development Manager Springer US

Springer TDM Policy Update (June 2016)

• This presentation provides an overview of the current Springer TDM policy.

• Springer is currently working on a new combined TDM policy for Springer Nature.

• The new TDM policy will be announced sometime in the near future.

Springer is currently working on a new combined TDM policy for Springer Nature

Springer’s TDM policy was introduced in 2014

• The volume of scientific publications is increasing and TDM software tools continue to improve

• Springer acknowledges the need for a more formalized process to enable TDM

• Strive to make it as simple as possible for researchersSpringer grants text- and data-mining rights to subscribed content, provided the purpose is

non-commercial research

For researchers with subscription access

• Individual researchers can download subscription and open access content for TDM purposes directly from the SpringerLink platform

• No registration or API key is required

• Full-text content can be accessed easily and programmatically at friendly URLs based on the content’s Digital Object Identifier (DOI)

For researchers with no subscription access

• Researchers who do not have subscription access to SpringerLink can send requests for TDM access to a contact within Springer

• These inquiries will be considered on a case by case basis

Implementation by academic and government institutions

• For subscribers at academic and government institutions, these rights will be included in all new and renewed SpringerLink subscription agreements as an additional TDM clause

• Existing subscribers may also add the TDM clause before their agreement is up for renewal

Use of text and data mining results and research output

• Publications or analyses resulting from TDM of subscribed content may include quotations from the original text of up to 200 characters, or 20 words, or 1 complete sentence

• Should cite the original Springer content in the form of a DOI link

• Permission to reproduce images may be granted on a case-by-case basis

• For Open Access (OA) publications from Springer, BioMed Central and SpringerOpen, TDM is usually allowed without restrictions since the majority of Springer's OA content is licensed under CC-BY

Technical guide to downloading content

• For TDM researchers interested in cross-publisher automated downloading, the CrossRef TDM initiative may be useful

• Springer is actively collaborating with CrossRef on this project and we expect Springer content to be fully supported soon

• Guidelines for performing TDM of Springer content are located on the Springer’s text- and data-mining policy page on Springer.com

Springer Metadata API

• Springer provides the free Springer Metadata API for searching within Springer content

• Provides rich searching for the vast majority of Springer, BioMed Central and SpringerOpen documents, including all journal content, book chapters and protocols

• The Springer Book Archives will soon be searchable through this API as well

Q&A

[Q] Do publishers prefer to sign agreements directly with researchers, or with the libraries that either have an active subscription or have purchased the corpus to be mined?

[A] So far, Springer has only signed licenses with libraries. We are currently focused on customers who have an active subscription with us. TDM access to content is for researchers have access to through their institutional subscription, and OA content.

Q&A (cont’d)

[Q] If libraries do sign agreements on behalf of researchers, does Springer expect libraries to track or monitor researcher activities, either for compliance to terms of the agreement, or for reporting purposes?

[A] Springer doesn't expect libraries to directly monitor researcher TDM activities as separate from regular content access activities. TDM access is subject to the same restrictions as any regular content from a library-researcher relationship point of view.

Q&A (cont’d)

[Q] What drives publisher decisions to host data vs. send the data to libraries for hosting? What types of costs are associated with hosting? How can libraries support an infrastructure for text mining if the data is sent on drives, and do publishers mind if researchers get copies of this data (sort of like a dataset that we buy for them?)

[A] This is different per publisher. Since Springer provides content that is DRM-free, we can host content on our native site SpringerLink, or offline at the library.

The advantage of SpringerLink is that the library does not have to constantly receive updated data from us, and doesn’t have to build a GUI or API to query the dataset. Researchers can perform a search using any tool they prefer, and request content access from SpringerLink, which is the same as downloading individual articles.

Useful links

Springer's Text and Data Mining Policy

https://www.springer.com/gp/rights-permissions/springer-s-text-and-data-mining-policy/29056

Springer / BioMed Central API Portal

https://dev.springer.com/

CrossRef TDM Initiative

http://tdmsupport.crossref.org/



https://dev.springer.com/

http://tdmsupport.crossref.org/

Thank You!

[email protected]