Upload
clifton-hunt
View
223
Download
1
Embed Size (px)
Citation preview
IDENTIFYING OPEN ACCESS ARTICLES:
VALID AND INVALID METHODS
David Goodman Palmer School of Library and Information Science, Long Island University
Kristin AntelmanAssociate Director for Information Technology, NCSU Libraries
Nisa Bakkalbasi General Science Librarian, Yale Library
XXV Charleston Conference, Nov. 4, 2005
Nov. 4, 2005 Charleston 2
We gratefully acknowledge the courtesy of ISI to Stevan Harnad in supplying their citation data
the courtesy of Stevan Harnad in supplying his analyzed data to us
his acceptance of our offer to carry out a manual evaluation
the assistance of Chawki Hajjem and Stevan Harnad in explaining the details of their methodology, and
their helpful comments on our measurements
Nov. 4, 2005 Charleston 3
Why do we want to identify OA?
So users can find it (findability)
So people can link to it (linkability)
To measure % of articles OA To measure OA Advantage (OAA)
(increase in citations by being published OA)
Nov. 4, 2005 Charleston 4
Specific Fields
Lawrence 2001 OA of conference proceedings in electrical engineering and computer science
Location by Research Index using Google
Matched pairs
Kurtz 2003-5 Astronomy papers in ADS(Astrophysics Data System)
Nov. 4, 2005 Charleston 5
Specific Journals
Wren 2005 References to articles in selected medical journals
Other individual journal studies...
Nov. 4, 2005 Charleston 6
Multiple fields
Antelman, 2004 selected subject areas in many academic fields
manual identification OAA = 15% - 40%, depending on subject
Nov. 4, 2005 Charleston 7
Multiple fields
Brody, Harnad et al, 2004-5 selected subject areas in science
automated identification in arXiv by algorithm
automated citation check in WoS OAA = up to 300%, depending on subject
Nov. 4, 2005 Charleston 8
Multiple fields
Brody, Stamerjohanns, Harnad et al, 2004-5 selected subject areas in social science
automated identification in arXiv or web by algorithm
automated citation check in WoS OAA = up to 200%, depending on subject
Nov. 4, 2005 Charleston 9
Multiple fields
Hajjem, Harnad et al., 2005- all subject areas in science & social science
automated identification on web by algorithm
automated citation check in WoS OAA = up to 200%, depending on subject
(This data has been posted by Hajjem at Soton, but is still unpublished)
Nov. 4, 2005 Charleston 10
Our Purpose
to confirm validity of algorithmic OA/non-OA determinations,
to verify measurements of OA and OAA
Nov. 4, 2005 Charleston 11
Our Technique
Selected years and subject fields fromHajjem, Harnad, et al., with OA determination and citations
Sample from algorithm's OA and non-OA Manual check in the web to either confirm OA or not find OA(Google, author's sites, etc.)
Tabulation of ISI citations to determine OA Advantage
(more complete details forthcoming)
Nov. 4, 2005 Charleston 12
OA articles include
Published authentic text in OA Journals
Posted authentic text--publisher's PDF
Posted author's corrected manuscript
Nov. 4, 2005 Charleston 13
Dubious OA items include:
Embargoed published articles, after embargo ends
Embargoed author's manuscripts, after embargo ends
Editorials, Letters, Review articles
Abstract-only publication
Nov. 4, 2005 Charleston 14
Non-OA articles include:
Published authentic text in subscription journals
Abstracts on publisher's site Listing in
title pages alerting services blogs course notes references in other articles links from a posting to publisher's site
Nov. 4, 2005 Charleston 15
Set One Examined
Articles from year 2002(Classical) Biology (ISI category)
1% sample
Nov. 4, 2005 Charleston 16
Decision Table (biology)
Manual detection
OA non-OA TOTAL
Algorithm
detection
OA 106 160 266non-OA 32 239 271
TOTAL 138 399 537
Nov. 4, 2005 Charleston 17
Interpretation (biology):
• Of the 266 items labeled "OA" by the algorithm
only 106 were actually OA
• Of the 138 items actually OA,
the algorithm missed 32
• Of the total 537 items,
the algorithm got 345 right, and 192 wrong.
•
Nov. 4, 2005 Charleston 19
Set Two Examined
Articles from year 2000Sociology (ISI category)
8% sample
Nov. 4, 2005 Charleston 20
Decision Table (sociology)
Manual detection
OA non-OA TOTAL
Algorithm
detection
OA 29 148 177non-OA 25 151 176
TOTAL 54 299 353
Nov. 4, 2005 Charleston 21
Interpretation (sociology):
• Of the 177 items labeled "OA" by the algorithm only 29 were actually OA
• Of the 54 items actually OA, the algorithm missed 25
• Of the total 353 items, the algorithm got 180 right, and 173 wrong.
•
Nov. 4, 2005 Charleston 23
Our Determinations:
Comparison: note that the apparent similarity is due to compensating errors: There are many more non-OA articles than OA, so the small error on missed OA cancels out the big error on over-coded OA.
Biology 2002 %OA
Sociology 2000
%OA
Biology 2002 OAA
Algorithm's value
14% 23% 51%
Our Value 16% 15% 63%
(Sociology OAA not measurable due to error in matching ISI data)
Nov. 4, 2005 Charleston 24
All Determinations (including ours): Unavoidable Systematic Errors
Problematic material types Delayed OA articles Articles posted long after publication Variation in titles Different publications with same titles
Articles removed from web Invisibility to the search engines Errors due to ISI inaccuracy
Nov. 4, 2005 Charleston 25
All Determinations based on Soton data (including ours): Source of possible confusion
OA Journals consistently omitted(all journals with 100% OA)*
Journals without OA consistently omitted(all journals with 0% OA)
* thus, all his OA and OAA determinations are for "Green" self-archiving only, not including "Gold" OA Journals
Nov. 4, 2005 Charleston 26
At least some Soton Data: other known possible sources of confusion or error
All journals given equal weight regardless of size
Data averaged by journal Google not usually used in search Just arXiv used in some searches(whether or not appropriate)
Inadequate testing of algorithms
Nov. 4, 2005 Charleston 27
Determinations of % OA and OAA
Depend on the accuracy of identificationof individual items
Therefore, algorithmic determinations at best accurate only by accident
Nov. 4, 2005 Charleston 28
Conclusions:
I. Accuracy is now still only possiblea. with manual determinations (which are too
tedious for practical use) orb. with well-defined searches in well-defined
fields (such as particular journals or repositories)
II. Generalized algorithmic search engines ofacceptable accuracy have yet to be developed (and tested)