IDENTIFYING OPEN ACCESS ARTICLES: VALID AND INVALID METHODS David Goodman Palmer School of Library and Information Science, Long Island University Kristin

IDENTIFYING OPEN ACCESS ARTICLES:

VALID AND INVALID METHODS

David Goodman Palmer School of Library and Information Science, Long Island University

Kristin AntelmanAssociate Director for Information Technology, NCSU Libraries

Nisa Bakkalbasi General Science Librarian, Yale Library

XXV Charleston Conference, Nov. 4, 2005

Nov. 4, 2005 Charleston 2

We gratefully acknowledge the courtesy of ISI to Stevan Harnad in supplying their citation data

the courtesy of Stevan Harnad in supplying his analyzed data to us

his acceptance of our offer to carry out a manual evaluation

the assistance of Chawki Hajjem and Stevan Harnad in explaining the details of their methodology, and

their helpful comments on our measurements


Why do we want to identify OA?

So users can find it (findability)

So people can link to it (linkability)

To measure % of articles OA To measure OA Advantage (OAA)

(increase in citations by being published OA)


Specific Fields

Lawrence 2001 OA of conference proceedings in electrical engineering and computer science

Location by Research Index using Google

Matched pairs

Kurtz 2003-5 Astronomy papers in ADS(Astrophysics Data System)


Specific Journals

Wren 2005 References to articles in selected medical journals

Other individual journal studies...


Multiple fields

Antelman, 2004 selected subject areas in many academic fields

manual identification OAA = 15% - 40%, depending on subject


Multiple fields

Brody, Harnad et al, 2004-5 selected subject areas in science

automated identification in arXiv by algorithm

automated citation check in WoS OAA = up to 300%, depending on subject


Multiple fields

Brody, Stamerjohanns, Harnad et al, 2004-5 selected subject areas in social science

automated identification in arXiv or web by algorithm



Multiple fields

Hajjem, Harnad et al., 2005- all subject areas in science & social science

automated identification on web by algorithm


(This data has been posted by Hajjem at Soton, but is still unpublished)


Our Purpose

to confirm validity of algorithmic OA/non-OA determinations,

to verify measurements of OA and OAA


Our Technique

Selected years and subject fields fromHajjem, Harnad, et al., with OA determination and citations

Sample from algorithm's OA and non-OA Manual check in the web to either confirm OA or not find OA(Google, author's sites, etc.)

Tabulation of ISI citations to determine OA Advantage

(more complete details forthcoming)


OA articles include

Published authentic text in OA Journals

Posted authentic text--publisher's PDF

Posted author's corrected manuscript


Dubious OA items include:

Embargoed published articles, after embargo ends

Embargoed author's manuscripts, after embargo ends

Editorials, Letters, Review articles

Abstract-only publication


Non-OA articles include:

Published authentic text in subscription journals

Abstracts on publisher's site Listing in

title pages alerting services blogs course notes references in other articles links from a posting to publisher's site


Set One Examined

Articles from year 2002(Classical) Biology (ISI category)

1% sample


Decision Table (biology)

Manual detection

OA non-OA TOTAL

Algorithm

detection

OA 106 160 266non-OA 32 239 271

TOTAL 138 399 537


Interpretation (biology):

• Of the 266 items labeled "OA" by the algorithm

only 106 were actually OA

• Of the 138 items actually OA,

the algorithm missed 32

• Of the total 537 items,

the algorithm got 345 right, and 192 wrong.

•


Signal theory Distributions:


Set Two Examined

Articles from year 2000Sociology (ISI category)

8% sample


Decision Table (sociology)

Manual detection

OA non-OA TOTAL

Algorithm

detection

OA 29 148 177non-OA 25 151 176

TOTAL 54 299 353


Interpretation (sociology):

• Of the 177 items labeled "OA" by the algorithm only 29 were actually OA

• Of the 54 items actually OA, the algorithm missed 25

• Of the total 353 items, the algorithm got 180 right, and 173 wrong.

•


Signal theory Distributions:


Our Determinations:

Comparison: note that the apparent similarity is due to compensating errors: There are many more non-OA articles than OA, so the small error on missed OA cancels out the big error on over-coded OA.

Biology 2002 %OA

Sociology 2000

%OA

Biology 2002 OAA

Algorithm's value

14% 23% 51%

Our Value 16% 15% 63%

(Sociology OAA not measurable due to error in matching ISI data)


All Determinations (including ours): Unavoidable Systematic Errors

Problematic material types Delayed OA articles Articles posted long after publication Variation in titles Different publications with same titles

Articles removed from web Invisibility to the search engines Errors due to ISI inaccuracy


All Determinations based on Soton data (including ours): Source of possible confusion

OA Journals consistently omitted(all journals with 100% OA)*

Journals without OA consistently omitted(all journals with 0% OA)

* thus, all his OA and OAA determinations are for "Green" self-archiving only, not including "Gold" OA Journals


At least some Soton Data: other known possible sources of confusion or error

All journals given equal weight regardless of size

Data averaged by journal Google not usually used in search Just arXiv used in some searches(whether or not appropriate)

Inadequate testing of algorithms


Determinations of % OA and OAA

Depend on the accuracy of identificationof individual items

Therefore, algorithmic determinations at best accurate only by accident


Conclusions:

I. Accuracy is now still only possiblea. with manual determinations (which are too

tedious for practical use) orb. with well-defined searches in well-defined

fields (such as particular journals or repositories)

II. Generalized algorithmic search engines ofacceptable accuracy have yet to be developed (and tested)


Documents

IDENTIFYING OPEN ACCESS ARTICLES: VALID AND INVALID METHODS David Goodman Palmer School of Library and Information Science, Long Island University Kristin