32
TEXT MINING 101: WHAT YOU SHOULD KNOW Ethan Pullman (Carnegie Mellon University) Denise Novak (Carnegie Mellon University) Kristen Garlock (Ithaka.org) Patricia Cleary (Springer US) NASIG annual Saturday, June 11, 2016 10:30 am

Text mining 101 what you should know

  • Upload
    nasig

  • View
    198

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Text mining 101 what you should know

TEXT MINING 101: WHAT YOU SHOULD KNOW

Ethan Pullman (Carnegie Mellon University)

Denise Novak (Carnegie Mellon University)

Kristen Garlock (Ithaka.org)Patricia Cleary (Springer US)

NASIG annualSaturday, June 11, 2016 10:30 am

Page 2: Text mining 101 what you should know

Working with Your ConstituentsEthan Pullman Humanities Liaison & Library Instruction CoordinatorCarnegie Mellon University

Page 3: Text mining 101 what you should know

Audience SurveyOn a scale from 1-5, novice to experienced, how familiar are you with text mining?

1 -- 2 -- 3 -- 4 -- 5What is your role in providing Text Mining (TM) Services?

A. We have an expert librarian/service centerB. I work directly with my department(s)C. Other: (Please describe)

My TM population mainly consists of: Faculty PhDs/TAs Undergrads

How many of you have used TM for your own research/project?

Page 4: Text mining 101 what you should know

Text Mining Briefly● What it is?● What is its purpose?

Photo adapted from Text Mine ‘01

Text mining is the automated processing of large amounts of structured digital texts

Purpose: retrieval, analysis, and interpretation of texts.

_______________________________________________Note: Mining non-textual information falls under “data mining”. Although often included with Text Mining as “Text & Data Mining”, data mining is different and requires tools and methodologies that are distinct from text mining.

Page 5: Text mining 101 what you should know

Text Mining ExamplesVisualization tools build word clouds from words mined from large texts.

SDFB mines British early modern texts to trace “social connections” between individuals from that period (read more)

A class project that used text mining to analyze case documents and briefs submitted by Authors’ Guild in Authors’ Guild vs. Google. The analysis shed light on the rhetorical strategy used by Authors’ Guild lawyers and informed outcome prediction.

It Ain’t About the Money, Money, Money…

or is it? Authors’ Guild vs. Google Books: A Rhetorical Analysis

Page 6: Text mining 101 what you should know

The Role of Library LiaisonsWhat is new?Acquiring texts?Providing access?

Librarians need to understand: > how texts are used in the digital age > what tools are available > issues impacting acquisition and access

Page 7: Text mining 101 what you should know

How I stay informed ….

Stay Informed:

Faculty Profiles● Curriculum Vitae● Publications● Syllabi

Page 8: Text mining 101 what you should know

How I stay informed● Attend departmental lectures ● Visit research showcases

● Read about campus initiatives

Page 9: Text mining 101 what you should know

How I stay informed ...● Maintaining our Text-Mining Website:

● Professional participation:● Organizations & Conferences: for example, Text Analytics World; ● Social Networks/Email lists, blogs● Seek continuing education opportunities

● Collaborate with our acquisition and data services librarians

Page 10: Text mining 101 what you should know

Acquisitions Point of View Denise Novak Acquisitions LibrarianCarnegie Mellon University

Page 11: Text mining 101 what you should know
Page 12: Text mining 101 what you should know
Page 13: Text mining 101 what you should know
Page 14: Text mining 101 what you should know

Supporting Text Mining of the JSTOR Digital Library

Kristen GarlockAssociate Director of Education and Outreach -- JSTORIthaka

Page 15: Text mining 101 what you should know

What is Data for Research?

Data for Research is a self-service website for generating datasets from the content on JSTOR.

http://dfr.jstor.org

Page 16: Text mining 101 what you should know

How it worksService is free, permitted under Terms & Conditions.● Data for Research: Researcher creates free

account on site, defines parameters of dataset, submits request, downloads dataset.

● Full-text datasets: Letter agreement (may be established with individuals or libraries). Datasets not limited by licenses or institutional affiliation.

Page 17: Text mining 101 what you should know

Support for Text MiningWhy? ● Supporting new types of scholarship is part of our

mission● Opportunities to build beneficial partnerships● Increasing value of publications; corpus in and

of itself has value as a scholarly tool NOTE: For a bibliography of projects and research that incorporated datasets from JSTOR, please contact Kristen Garlock ([email protected])<mailto:[email protected])>.

Page 18: Text mining 101 what you should know

ChallengesBiggest challenges: ● Staffing and support● Keeping up with evolving researcher needsTrends: ● Increasing numbers of requests● Requests for larger and more complex datasets● Interest from non-technologists● Scholars not anticipating/understanding gaps or data issues in

datasets● Desire to combine datasets from multiple sources

Page 19: Text mining 101 what you should know

Springer TDM policy Patricia ClearyGlobal eProduct Development Manager Springer US

Page 20: Text mining 101 what you should know

Springer TDM Policy Update (June 2016)

• This presentation provides an overview of the current Springer TDM policy.

• Springer is currently working on a new combined TDM policy for Springer Nature.

• The new TDM policy will be announced sometime in the near future.

Springer is currently working on a new combined TDM policy for Springer Nature

Page 21: Text mining 101 what you should know

Springer’s TDM policy was introduced in 2014

• The volume of scientific publications is increasing and TDM software tools continue to improve

• Springer acknowledges the need for a more formalized process to enable TDM

• Strive to make it as simple as possible for researchersSpringer grants text- and data-mining rights to subscribed content, provided the purpose is

non-commercial research

Page 22: Text mining 101 what you should know

For researchers with subscription access

• Individual researchers can download subscription and open access content for TDM purposes directly from the SpringerLink platform

• No registration or API key is required

• Full-text content can be accessed easily and programmatically at friendly URLs based on the content’s Digital Object Identifier (DOI)

Page 23: Text mining 101 what you should know

For researchers with no subscription access

• Researchers who do not have subscription access to SpringerLink can send requests for TDM access to a contact within Springer

• These inquiries will be considered on a case by case basis

Page 24: Text mining 101 what you should know

Implementation by academic and government institutions

• For subscribers at academic and government institutions, these rights will be included in all new and renewed SpringerLink subscription agreements as an additional TDM clause

• Existing subscribers may also add the TDM clause before their agreement is up for renewal

Page 25: Text mining 101 what you should know

Use of text and data mining results and research output

• Publications or analyses resulting from TDM of subscribed content may include quotations from the original text of up to 200 characters, or 20 words, or 1 complete sentence

• Should cite the original Springer content in the form of a DOI link

• Permission to reproduce images may be granted on a case-by-case basis

• For Open Access (OA) publications from Springer, BioMed Central and SpringerOpen, TDM is usually allowed without restrictions since the majority of Springer's OA content is licensed under CC-BY

Page 26: Text mining 101 what you should know

Technical guide to downloading content

• For TDM researchers interested in cross-publisher automated downloading, the CrossRef TDM initiative may be useful

• Springer is actively collaborating with CrossRef on this project and we expect Springer content to be fully supported soon

• Guidelines for performing TDM of Springer content are located on the Springer’s text- and data-mining policy page on Springer.com

Page 27: Text mining 101 what you should know

Springer Metadata API

• Springer provides the free Springer Metadata API for searching within Springer content

• Provides rich searching for the vast majority of Springer, BioMed Central and SpringerOpen documents, including all journal content, book chapters and protocols

• The Springer Book Archives will soon be searchable through this API as well

Page 28: Text mining 101 what you should know

Q&A

[Q] Do publishers prefer to sign agreements directly with researchers, or with the libraries that either have an active subscription or have purchased the corpus to be mined?

[A] So far, Springer has only signed licenses with libraries. We are currently focused on customers who have an active subscription with us. TDM access to content is for researchers have access to through their institutional subscription, and OA content.

Page 29: Text mining 101 what you should know

Q&A (cont’d)

[Q] If libraries do sign agreements on behalf of researchers, does Springer expect libraries to track or monitor researcher activities, either for compliance to terms of the agreement, or for reporting purposes?

[A] Springer doesn't expect libraries to directly monitor researcher TDM activities as separate from regular content access activities. TDM access is subject to the same restrictions as any regular content from a library-researcher relationship point of view.

Page 30: Text mining 101 what you should know

Q&A (cont’d)

[Q] What drives publisher decisions to host data vs. send the data to libraries for hosting? What types of costs are associated with hosting? How can libraries support an infrastructure for text mining if the data is sent on drives, and do publishers mind if researchers get copies of this data (sort of like a dataset that we buy for them?)

[A] This is different per publisher. Since Springer provides content that is DRM-free, we can host content on our native site SpringerLink, or offline at the library.

The advantage of SpringerLink is that the library does not have to constantly receive updated data from us, and doesn’t have to build a GUI or API to query the dataset. Researchers can perform a search using any tool they prefer, and request content access from SpringerLink, which is the same as downloading individual articles.

Page 31: Text mining 101 what you should know

Useful links

Springer's Text and Data Mining Policy

https://www.springer.com/gp/rights-permissions/springer-s-text-and-data-mining-policy/29056

Springer / BioMed Central API Portal

https://dev.springer.com/

CrossRef TDM Initiative

http://tdmsupport.crossref.org/

Page 32: Text mining 101 what you should know

Thank You!

[email protected]