34
Discovering the World’s Research Ron Snyder Director of Advanced Technology, ITHAKA/JSTOR NASIG Annual Conference - 2012 June 9, 2012

Discovery and analysis of the world's research collections: JSTOR and Summon under the hood

  • Upload
    nasig

  • View
    2.483

  • Download
    0

Embed Size (px)

DESCRIPTION

In the age of networked information, we've seen major changes to the expectation of how bibliographic data is searched and serves research. Summon is a web-scale discovery service that indexes and provides relevancy ranking across 1 Billion items from thousands of collections and makes them accessible to researches from a single search box in 450 institutions in over 40 countries. JSTOR is a not-for-profit provider of high quality scholarly content spanning more than 300 years and covering nearly 60 disciplines. JSTOR provides on-line access to nearly 1,600 journals for more than 7,500 institutions in 166 countries. This presentation will discuss similarities in the mission and differences in the scope of these two services, including how they work together. We'll delve into the inner workings of each including treatment of data, analysis of search, and challenges each service faces in their mission. Presenters: Laura Robinson, Serials Solutions and Ron Snyder, ITHAKA

Citation preview

Page 1: Discovery and analysis of the world's research collections: JSTOR and Summon under the hood

Discovering the World’s Research

Ron Snyder

Director of Advanced Technology, ITHAKA/JSTOR

NASIG Annual Conference - 2012

June 9, 2012

Page 2: Discovery and analysis of the world's research collections: JSTOR and Summon under the hood

ITHAKA is a not-for-profit organization that helps the academic

community use digital technologies to preserve the scholarly

record and to advance research and teaching in sustainable ways.

We pursue this mission by providing innovative services that aid in

the adoption of these technologies and that create lasting impact.

Who we are

JSTOR is a research platform that enables

discovery, access, and preservation of scholarly

content.

Page 3: Discovery and analysis of the world's research collections: JSTOR and Summon under the hood

• Started in 1997

• Journals online: 1,604

• Articles online: 7.5 million

• Disciplines covered: 60

• Participating institutions: 7,800

• Countries with participating institutions: 167

JSTOR Factoids

Page 4: Discovery and analysis of the world's research collections: JSTOR and Summon under the hood

JSTOR site activity

User Sessions (visits)

» New Sessions (per hour):

70k peak, 38k average

» Simultaneous Sessions:

44k peak, 21k average

Page Views

» 3.5M per day, 6.7M peak

Content Accesses

» 430k per day, 850K peak

Searches

» 456k per day, 1.13M peak

Page 5: Discovery and analysis of the world's research collections: JSTOR and Summon under the hood

• Overhaul of JSTOR Search Infrastructure

• Coming Soon (Summer 2012), watch for it…

• Analytics and data warehouse

• Capability for ingesting, organizing, and analyzing

billions of usage events since JSTOR inception

• Improved external discoverability

• Various SEO, Google/GS, MS-Academic projects

• Local Discovery Integration (LDI) Pilot

• Machine-based document classification

ITHAKA/JSTOR Discovery Initiatives

Page 6: Discovery and analysis of the world's research collections: JSTOR and Summon under the hood

JSTOR Search Data AnalysisSome Early Findings

Page 7: Discovery and analysis of the world's research collections: JSTOR and Summon under the hood

Site Search Activity, by type

Basic, 78.20%

Advanced, 21.49%

Locator, 0.27%

Saved, 0.05%

6.3M Sessions

19.8M Searches(from Mar-5 to Apr-16, 2012)

2009 2010 2011 2012

Basic 68.8% 71.3% 77.4% 78.2%

Advanced 30.5% 28.1% 22.3% 21.5%

Locator 0.6% 0.6% 0.3% 0.2%

Page 8: Discovery and analysis of the world's research collections: JSTOR and Summon under the hood

Search Pages Viewed

1, 85.1%

2, 9.4%

3, 2.6%4, 1.2% 5+, 1.7%

1.3 Search Results Pages Viewed per Search

Page 9: Discovery and analysis of the world's research collections: JSTOR and Summon under the hood

Average # of Terms Entered 3.9

Use of phrases 6.9%

Use of boolean expressions 7.0%

Use of fielded expressions 3.5%

Search Expression Profile

Page 10: Discovery and analysis of the world's research collections: JSTOR and Summon under the hood

Searches Per Search Session

0%

10%

20%

30%

40%

50%

60%

1 2 3 4 5+

Searches Performed

Successful

Unsuccessful

Average Searches per session:

• Overall 3.1

• Successful 3.8

• Unsuccessful 2.1

Persistence pays off…

Page 11: Discovery and analysis of the world's research collections: JSTOR and Summon under the hood

Click Thru Rates by Search Position

0.00%

5.00%

10.00%

15.00%

20.00%

25.00%

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25

JSTOR – 20M searches from March 5 – Apr 16

Page 12: Discovery and analysis of the world's research collections: JSTOR and Summon under the hood

Local Discovery Integration PilotJSTOR and Summon

Page 13: Discovery and analysis of the world's research collections: JSTOR and Summon under the hood

» Research has shown time and again that both students and faculty are

beginning their research at places other than the library OPAC, most

notably Google/Google Scholar and discipline-specific electronic

databases, and that the trend is continuing

Starting point for research, identified by faculty in 2003, 2006, and 2009 (2009 Faculty Study, ITHAKA)

Problem Statement:

0%

10%

20%

30%

40%

50%

60%

70%

80%

90%

100%

The library buildingYour online library catalogA general-purpose search engineA specific electronic research resource

2003

2006

2009

Page 14: Discovery and analysis of the world's research collections: JSTOR and Summon under the hood

Where is discovery happening?

Where JSTOR ‘sessions’ originated | Jan 2011 – Dec 2011

Page 15: Discovery and analysis of the world's research collections: JSTOR and Summon under the hood

» As web-scale discovery services are being purchased and

implemented by institutions, the value of those implementations

are somewhat limited because they are (for the most part) only

addressing that limited population of researchers who begin at

a library-designated starting point (e.g. OPAC)

Problem Statement:

76%

16%

6%

2%

9%

JSTOR usage | Australia | 2010 Nov.

JSTOR Google/Google Scholar Known Linking Partner Library

Page 16: Discovery and analysis of the world's research collections: JSTOR and Summon under the hood

Research Behavior: Students

0 10 20 30 40 50 60 70

Google

Library Databases

What is the easiest place to start research according to students?

Source: ProQuest survey of student research habits, 2007

Page 17: Discovery and analysis of the world's research collections: JSTOR and Summon under the hood

Research Behavior: Faculty

0%

10%

20%

30%

40%

50%

60%

70%

80%

90%

100%

The library buildingYour online library catalogA general-purpose search engineA specific electronic research resource

2003

2006

2009

Starting Point for Research, identified by faculty in 2003, 2006, and 2009

Source: ITHAKA 2009 Faculty Survey, 2010

Page 18: Discovery and analysis of the world's research collections: JSTOR and Summon under the hood

» If we can more effectively reach the users at the place(s) where

they normally begin their research, then we can begin to more

effectively build their awareness of the resources that the

institution has licensed/purchased for their purposes

» The local discovery integration (LDI) pilot study will attempt to

measure changes in the student/faculty research experience by

„embedding‟ the institution‟s selected web-scale discovery

service in strategically-selected places in the JSTOR interface

where – we believe – the user would naturally want to „cast a

wider net‟ for discovery

Concept:

2010 JSTOR Usage Highlights

Total Significant Accesses 594,888,001

Articles Downloaded 74,901,344

Articles Viewed 112,751,906

Searches Performed 168,720,887

Inbound Links from Licensed Partners 13,013,904

Inbound Links from Google/Scholar 157,903,053

Page 19: Discovery and analysis of the world's research collections: JSTOR and Summon under the hood

How it works

We placed links at various places along the research workflow in JSTOR to allow students and researchers to “Cast a wider net”

Links Out

• Search Results

Advanced Search Page

Search Results View

• 3rd Page “Lightbox” pop-up

• Article View - Incoming from Google

• Article View - All other non-Google

• Zero Results Page

Page 20: Discovery and analysis of the world's research collections: JSTOR and Summon under the hood

» JSTOR may not be the most appropriate starting place in every

instance, but it is a trusted and familiar interface. This will allow

the user to „flowback‟ to another starting place (e.g. the library)

Search results page

• Uses the familiar

university logo to grab

attention

• Inserts search terms into

link text to notify user of

customized behavior

• Positioned proximate to

search results; relevant

during the search result

evaluation phase

Page 21: Discovery and analysis of the world's research collections: JSTOR and Summon under the hood

» In this instance, the user has found nothing and the most typical web

response is to hit the „Back‟ button. If we allow the user – at this point –

to execute a search in the local discovery interface, we might improve

the user experience

Empty results page

• One of the key places

where a user is likely to

want to try a different,

broader search

• Larger placement takes

advantage of available real

estate and cognitive space

• Users typically do not

spend time on this page so

it is important to increase

notice-ability and self-

explanation

Page 22: Discovery and analysis of the world's research collections: JSTOR and Summon under the hood

» In 2010, over 32M Google/Google Scholar searches brought users

directly to an article page. They may or may not have found what they

really wanted, so we‟d like to give them an alternative discovery choice

Article page after Google search

• Visible when coming

from a Google or Google

Scholar search

• Captures basic search

terms from the search

• Provides an opportunity

to convert a user from a

Google/Google Scholar

user to a Summon user

Page 23: Discovery and analysis of the world's research collections: JSTOR and Summon under the hood

» In 2010, almost 113M articles were viewed in JSTOR. Again, they may

not have found what they really wanted, so we‟d like to give them an

alternative discovery choice

Article page after JSTOR search

• Visible when coming

from a JSTOR search

• Raises visibility of the

feature by exposing it to a

large number of users

• Inserts search terms

into link text to notify user

of customized behavior

Page 24: Discovery and analysis of the world's research collections: JSTOR and Summon under the hood

Results View: All Pages

Link out from the bottom of all pages of the search results view. This will allow more opportunities to link out for students/ researchers combing through large sets of results.

Page 25: Discovery and analysis of the world's research collections: JSTOR and Summon under the hood

Results View: 3rd Page

Pop-up on the third page of search results Prompts the student/ researcher to indicate whether they wish to link out through the LDI. This will enable us to measure whether students wish to “cast a wider net” or not. In the other link scenarios we don’t have a baseline of how many students do not notice the link vs. choose not to use it

Page 26: Discovery and analysis of the world's research collections: JSTOR and Summon under the hood

Link out to Discovery Platform

Page 27: Discovery and analysis of the world's research collections: JSTOR and Summon under the hood

What we intend to track:

1. To which Summon domain JSTOR is sending the user

(usyd, ncsu, asu, etc)

2. Where the local discovery request originated within

JSTOR (search results page, null results page, article view

page)

• outpage=searchresults

• outpage=noresults

• outpage=pageview

3. If the user is identified as having come from Google or

Google Scholar and clicks on the “pageview” link, we will

capture that information

4. We will be providing the following identifiers to Summon

to allow tracking on their end:

• origin=JSTORpagesummon

• origin=JSTORsearchsummon

• origin=JSTORnoresultssummon

Page 28: Discovery and analysis of the world's research collections: JSTOR and Summon under the hood

Some Preliminary Results

Data shown is for all institutions participating in Summon LDIDate range: July 2011 – February 2012

» Highest usage occurred in Zero Results scenario

Page 29: Discovery and analysis of the world's research collections: JSTOR and Summon under the hood

Machine-Based Article ClassifierAssigning Articles to Disciplines

Page 30: Discovery and analysis of the world's research collections: JSTOR and Summon under the hood

JSTOR Corpus

• 60 disciplines

• 1,600 journals

• Nearly 8 million articles

• Disciplines are associated at the Journal level

• All articles in a Journal inherit the Journal assigned

disciplines

• Using this approach many articles have incomplete

and/or incorrect discipline tagging hindering discovery

• How to assign disciplines to articles?

The Problem

Page 31: Discovery and analysis of the world's research collections: JSTOR and Summon under the hood

• Human classification and tagging is not feasible

• A machine-based classification process is desired

• Topic models are a way of finding structure in a set of

documents

• They allow is to find “latent” themes

• A topic model is not a topic map

• Some topic modeling approaches include

• Latent Semantic Analysis (LSI/LSA) (Deerwester 1990)

• Probabilistic LSA (Hoffmann 1999)

• Latent Dirichlet Allocation (LDA) (Blei 2003)

Topic Models

Page 32: Discovery and analysis of the world's research collections: JSTOR and Summon under the hood

LDA – Latent Dirichlet Allocation

• A generative probabilistic model for analyzing

collections of documents

• A Bayesian model where each document is modeled

as a mixture of topics (disciplines)

• Models semantic relationships between documents

based on word co-occurrences

Topic Modeling – our approach

Page 33: Discovery and analysis of the world's research collections: JSTOR and Summon under the hood

• We select the most representative documents from

each JSTOR discipline to build a topic model (from

the vocabulary of the document sample)

• This sampling and vocabulary modeling is the most important part

of the process!

• We’re still experimenting with this, but find the citation network

provides a good means for identifying core documents in a

discipline

• Also considering whether usage data might be leveraged here

• Each document in the corpus is then analyzed and

compared to the topic model to determine how well it

matches each topic

• A probability distribution is generated providing discipline weights

• The top weighted discipline(s) are associated with each article

The Process

Page 34: Discovery and analysis of the world's research collections: JSTOR and Summon under the hood

• On-site discovery

• Will be a key element of our overhauled search

infrastructure, tentatively scheduled for beta release mid-summer

• Use in article-level discipline/subject/topic mappings

for better integration with aggregated indexes

• Will support a richer data feed for Summon, for instance

Application