43
Guest lecture: Library and data? www.slideshare.net/hugobesemer (use on WURNET Chrome, Firefox) 20160920, Hugo Besemer

Library and data lecture for inf21306

Embed Size (px)

Citation preview

Page 1: Library and data lecture for  inf21306

Guest lecture: Library and data?

www.slideshare.net/hugobesemer (use on WURNET Chrome, Firefox)

20160920, Hugo Besemer

Page 2: Library and data lecture for  inf21306

Two different things

●An example of data modelling challenges for the library or if you wish: data is dirty ....

●Data management planning at Wageningen University

2

Page 3: Library and data lecture for  inf21306

Data is dirty

3

Page 4: Library and data lecture for  inf21306

The problem

I am in the tenure track, the university wants me publish in “Q1” journals

My research is funded by NWO/EU/.... And they want me to publish in “Open access” journals

Page 5: Library and data lecture for  inf21306

Journals catalogue

Open_access

QuartilesSelect title,issn from Journals where topics=“mine” INNER JOIN open_access.status=“yes” INNER JOIN Quartiles.quartile=“Q1” UNION ALL

topicstitle

Open access status

(boolean)

quartile

issn

issn

issn

Page 6: Library and data lecture for  inf21306

Let’s look in Nottingham for online status’

6

Page 7: Library and data lecture for  inf21306

But we can also go to Lund

7

Page 8: Library and data lecture for  inf21306

Confusion from Amsterdam

8

Page 9: Library and data lecture for  inf21306

Things change all the time

9

Page 10: Library and data lecture for  inf21306

So we have learned....

ISSN (primary key) is ambiguous●so you need to harmonize data

Open access status is ambiguous ●Gold, Green or Hybrid●Discussion: which one do we take

There are several sources for online status●Discussion: which one do we take?

10

Page 11: Library and data lecture for  inf21306

Journals catalogue

Quartiles

topicstitle

Romeo Sherpa (colours)

quartile

issn

issn

issn

Romeo Sherpa (colours)

DOAJ (Romeo gold)

issn

issnAPC

Hybrid publisher

issnAPC

issn

Page 12: Library and data lecture for  inf21306

Now for the quartiles

12

Page 13: Library and data lecture for  inf21306

Q1

Q2

Q3

Q4

Page 14: Library and data lecture for  inf21306

How do we compare numbers

Scientist Z. Math has a publication from 2003 with 17 citations

Scientist M. Biology has a publication from 2009 with 24 citations

Page 15: Library and data lecture for  inf21306

Baselines for Mathematics

Page 16: Library and data lecture for  inf21306

Baselines for Molecular Biology

0

100

200

300

400

0 2 4 6 8 10 12

Years after publication

Cum

ulat

ive

no. c

itatio

ns

Baselinetop 10%top 1%

Page 17: Library and data lecture for  inf21306

What does that mean for our E-R diagram?

Quartile distribution depends on topic

17

Page 18: Library and data lecture for  inf21306

Journals catalogue

Quartiles

topicstitle

Romeo Sherpa (colours)

quartile

issn

issn

issn

Romeo Sherpa (colours)

DOAJ (Romeo gold)

issn

issnAPC

Hybrid publisher

issnAPC

issn

topics

Page 19: Library and data lecture for  inf21306

19

Datamagement planning at Wageningen University

Page 20: Library and data lecture for  inf21306

Wageningen UR policy – What’s in place

●Data management plan for PhD projects●Data management plans for research groups●Data management planning course●Options for data publishing●Code Repository●“Support hub”

20

Page 21: Library and data lecture for  inf21306

Wageningen UR data policy – What needs to be resolved

Registration and accessibility of data for ongoing research Storage (security, “getting rid of external hard drives”) Research notes Legal issues?

21

Page 22: Library and data lecture for  inf21306

Day-to-day issues (from a workshop for PE&RC)

We are human Synchronizing between different platforms Relationships between files What is a logical file / folder structure? Collaborating on files

22

Page 23: Library and data lecture for  inf21306
Page 24: Library and data lecture for  inf21306
Page 25: Library and data lecture for  inf21306
Page 26: Library and data lecture for  inf21306
Page 27: Library and data lecture for  inf21306
Page 28: Library and data lecture for  inf21306
Page 29: Library and data lecture for  inf21306
Page 30: Library and data lecture for  inf21306
Page 31: Library and data lecture for  inf21306
Page 32: Library and data lecture for  inf21306
Page 33: Library and data lecture for  inf21306

Some terminology: retention

Retention: obligation to produce upon request data underlying publications for a certain time

Verification purposes or as a basis for further work Often required by scientific organizations or publishers The “Netherlands Code of conduct for Academic

Practice” requires 10 years Rule is seldom enforced

33

Page 34: Library and data lecture for  inf21306

More terminology: ‘long term storage’’

‘Long term storage’ used in the DMP format ‘Long term’ meaning

●With sufficient documentation on project, file and parameter / variable level

●In a format that is usable in the future (so preferably “ flat files”)

34

Page 35: Library and data lecture for  inf21306

More terminology: ‘publishing data’

We prefer “Data Publishing” as it implies making the data persistently accessible

That’s only possible in a service with a long-term mission It should come with a persistent identifier

independent of its current of future location

35

Page 36: Library and data lecture for  inf21306

Persistent identifiers

http://hdl.handle.net/ 1902.1/UOVMCPSWOL

http://dx.doi.org/10.1594/PANGAEA.701380

36

Scheme / ResolverPrefix (identifying institution)Suffix (identifying this dataset)

To get a persistent identifier for your dataset you need to store it with a service, and the resolver will redirect users there

Page 37: Library and data lecture for  inf21306

An example

37

Page 38: Library and data lecture for  inf21306

An example (continued)

38

Page 39: Library and data lecture for  inf21306

An example (continued 2)

39

Page 40: Library and data lecture for  inf21306

Publish all data?

40

Page 41: Library and data lecture for  inf21306

Services (1)

Discplinary services with a specific data model●EBI, NCBI (bioinformatics) example SRA●Pangaea (spatial)●GBIF (Biodiversity)

Generic (multidisciplinary) services

41

Page 42: Library and data lecture for  inf21306

Services - (2)

42

*  DANS 3TU Datacentrum

Dryad Figshare Zenodo

URL http://www.dans.knaw.nl/en/

http://datacentrum.3tu.nl/en/home/

http://datadryad.org/ https://figshare.com/

http://www.zenodo.org/

Single file size

unknown - 2GB 5GB 2GB

Total disk space n.a. n.a. Extra charge for

larger sets20 GB “Please be aware that we

cannot offer infinite space for free, so donations from heavy users towards sustainability are encouraged”

Paid € 2.85 per GB (WUR covers first 500 GB)

€ 3.50 per GB (WUR covers first 500 GB)

$120 (> 20 GB extra charge)

N N

Private/public

Public (part of royal Dutch Academy for Sciences – KNAW)

Public, owned by Dutch Technical Universities

Not-for-profit company governed by members

Private, Macmillan inc.

Public, CERN

Special relationships

Wageningen UR Library acts as front office

Wageningen UR Library acts as front office

Reduced fee or free for certain journals, see http://datadryad.org/pages/journalLookup

Embedded in PLOS article submission workflow

EU (output of the Openaire plus project and used for data in the EU data management pilot)

Page 43: Library and data lecture for  inf21306

That’s all

43