HKU Data Curation MLIM7350 Class 7

Embed Size (px)

Citation preview

Hong Kong: an Open Access update

Class 8making things FAIR'if I have seen further it is by standing on the shoulders of giants'.Scott Edmunds, HKU Data Curation MLIM7350

Communicating in-classChat channel: http://backchannelchat.com/chat/dw131Let me know to slow down/speed up

2

https://osf.io/cgpzb/

Open Science (Open Access & Open Data) survey of Hong KongReading/ReflectionMost people mentioned training of librarians:Tak Hei Lam: Training should be provided to librarians so that they have adequate knowledge about data curation and provide professional support and advice for the researchers to sharing of data. Also, librarians can provide training and workshop to change the mindset of the researcher not to rely on the impact factor but on other to other comprehensive research metrics such as PlumX Lijia Yu: At the same time, in big data era, the research will be increasingly migrating to the cloud, so this should be done in an organized manner. Lots of talk on incentive systems & policy, but little on infrastructure other than:NEED FOR A PLAN/LEADERSHIP

HKU Repeatability in HK Research Experiment (homework)Feedback?

What have we found?

4

HKU Repeatability in HK Research Experiment (homework)

5

Interesting examples

http://hub.hku.hk/handle/10722/208585 Is data in a HKU thesis sufficient?

Interesting examplesSeveral examples of restrictions with ID data

http://journals.plos.org/plosone/article?id=10.1371/journal.pone.0165978

Interesting examplesSeveral examples of restrictions with ID datahttp://www.vox.com/2015/6/17/8796225/mers-virus-data-sharing

Interesting examplesLots of data in Dryad, but 1 H7N9 example isnt resolving

http://journals.plos.org/plosone/article?id=10.1371/journal.pone.0148506

Story so farHKU publishing a lot of survey based research in PLOS

3 examples from Children of 1997 birth cohort. Access to data involves emailing DAC

External databases: 2 examples in Dryad data (one not working), 1 example in OSF, 1 example in scholarhub, lots in figshare

So far 2 have data with broken URLs, 1/3 are controlled access, 1/4 have summary but not raw data

What exactly is research data"?

Research Data 1665?Scholarly articles are merely advertisement of scholarship . The actual scholarly artefacts, i.e. the data and computational methods, which support the scholarship, remain largely inaccessible --- Jon B. Buckheit and David L. Donoho, WaveLab and reproducible research, 1995

Esoteric formats, poorly structured,

Tabular, often spreadsheet based

Issues open data community well used to (data cleaning, scraping, etc.,) The long tail of scientific data?

(A) Cumulative base pairs in INSDC over time, excluding the Trace Archive (raw data from capillary sequencing platforms). (B) Base pairs in INSDC over time since 1980, broken down into selected data components. Cumulative data volume in base pairs broken down into assembled sequence (whole genome shotgun methods and others) and raw next-generation-sequence data.

Science Data VolumesExabytesPetabytes100s of Petabytes

SequencingMass SpecAstrophysicsHE PhysicsBiology

Imaging

Square Kilometer Array

Large Hadron Collider

Big Data in Healthcare

http://dx.doi.org/10.1186/s13742-016-0117-6

Big Data in Healthcare: challenges80% of health data unstructured (100s of forms/formats)

Medical Imaging archives increasing 20-40% per year

Genomics data will increase data volumes exponentially

Patients expect extra privacy protection if they are going to fully participate in data driven researchSource: https://www.healthcare.siemens.com/magazine/mso-big-data-and-healthcare-2.html

Open Data in Physics

1961 CERN pre-prints shelfhttp://cerncourier.com/cws/article/cern/28654http://arxiv.org/

1991-date arXiv

Open Data in Earth Sciences

https://pangaea.de/

Established 1987 (online since 1995)

Open Data in Earth Sciences

#Climategate UAE emails scandal

Is it possible to be too open?

Closed Data in Chemistry

Open Data in Biology1934: newsletter era

1987: online era1980: database era

2010s: bioinformatics bingo era

BGI HK Chamber OIlluminasThe LHC of Biology?20PB of storage

Post-Human Genome Project

1st Gen2nd (next) GenSource: http://www.genome.gov/sequencingcosts/ (with apologies)

Omes & more omes!

Other Ome(s): mass spectrometry datahttps://en.wikipedia.org/wiki/Mass_spectrometry

Nadina Wirkiewicz

Rise of mass spectrometry datahttps://doi.org/10.1093/nar/gkv1352

Challenges: Rise of big imaging datahttp://www.nature.com/nmeth/journal/v12/n1/full/nmeth.3222.html

Challenges: Rise of big imaging datahttps://openi.nlm.nih.gov/detailedresult.php?img=PMC3171117_JCB_201108095_RGB_Fig2&req=4http://journals.sagepub.com/doi/10.1177/1087057114528537

HCS: High Content Screens

AKA High Throughput Screening: High volumes, growing uptake TBs of data

New ways of sharing/publishing data with OMERO/JCB data viewer

Imaging Challenges: 100s of formatshttp://www.openmicroscopy.org/site/products/bio-formats

V

Genomics: open-data success story?

Sharing/reproducibility helped by stability of:

Platforms

Repositories

Standards

1st Gen2nd Gen

:

Genomics Data Sharing Policies Automatic release of sequence assemblies within 24 hours. Immediate publication of finished annotated sequences. Aim to make the entire sequence freely available in the public domain for both research and development in order to maximise benefits to society.Bermuda Accords 1996/1997/1998:Sequence traces from whole genome shotgun projects are to be deposited in a trace archive within one week of production. Whole genome assemblies are to be deposited in a public nucleotide sequence database as soon as possible after the assembled sequence has met a set of quality evaluation criteria.Fort Lauderdale Agreement, 2003:The goal was to reaffirm and refine, where needed, the policies related to the early release of genomic data, and to extend, if possible, similar data release policies to other types of large biological datasets whether from proteomics, biobanking or metabolite research.Toronto International data release workshop, 2009:

https://doi.org/10.1093/gigascience/giw003

Three decades of sharing infrastructure: Genbank

Scaling up of sharing: 1000 genomes

http://www.internationalgenome.org/

Three decades of sharing infrastructure: INSDC

http://www.insdc.org/

Sharing aids individuals

Piwowar HA, Day RS, Fridsma DB (2007) PLoS ONE 2(3): e308. doi:10.1371/journal.pone.0000308Sharing Detailed Research Data Is Associated with Increased Citation Rate.

Every 10 datasets collected contributes to at least 4 papers in the following 3-years.Piwowar, HA, Vision, TJ, & Whitlock, MC (2011). Data archiving is a good investment Nature, 473 (7347), 285-285 DOI: 10.1038/473285a

Rice v Wheat: consequences of publically available genome data.Sharing aids fields

Sharing aids growth of databases

http://scienceblogs.com/digitalbio/2015/01/30/bio-databases-2015/

Sharing aids growth of standardsWhy do we need standards?

https://xkcd.com/927/

Sharing aids growth of standardsWhy do we need standards?http://www.biochemsoctrans.org/content/36/1/33

Checklists aid the growth of sharinghttp://www.equator-network.org/

There are over 860 databases & 675 standards in the life sciences

FormatsTerminologiesGuidelines

Some of these are created by formal standards organisations, often for a fee, others are community driven.

The point of all these standards is to structure, enrich and report the description of the datasets and the experimental context under which they were produced, i.e to captured the metadata surrounding the data in a consistent and controlled manner.

43

Guidelines = Minimum information reporting requirements, checklists Report the same core, essential information e.g. ARRIVE guidelines

Terminologies = Controlled vocabularies, taxonomies, thesauri, ontologies etc.Unambiguously refer to an entitye.g. Gene OntologyModels/Formats = Conceptual model, conceptual schema, exchange formatsAllow data to flow from one system to anothere.g. FASTA

Enablers: to better describe, share and query data

FormatsTerminologiesGuidelines

https://biosharing.org/

Need for databases of databases

Exercise: Use Biosharing to answer the following?To share your work are there standards you should follow? Are there specialized curated databases you can use?You work in the area of functional MRI imaging and are producing 100s of GBs of fMRI brain scan data.

You are an immunologist using flow cytometry to sort cells.

You are a chemist looking at the 3D crystal structure of proteins using NMRhttps://biosharing.org/ Potential collaborators would like to use your data.Sabban, Sari

SharingOpen DataExecutable

MethodsAnswerMetadata

software

Analysis

(Pipelines)Workflows/Environments

IdeaStudy

Rewarding the

DOI, etc.

PublicationPublicationPublicationData

gigagalaxy.netWorkflowsReward Sharing of Workflows

Visualisations & DOIs for workflowshttp://www.gigasciencejournal.com/series/Galaxy 50

Facilitate reproducibility, reuse & sharing & publish outputs of: Knitr, Sweave, Jupyter/iPython Notebook, etc.

Open DocumentsReward Open/Dynamic Workbooks

Virtual Machines/containers

http://dx.doi.org/10.1186/s13742-015-0087-0 :standardised containers

https://opensource.org/licenses

https://opensource.org/licenses Open Source v Open Data LicensesSame ethos (open source begat open data), different contexts OSS designed for continuing development, OD for making objects available

IP issues. Software can be patented, data (generally) cant

More business models for software than data (so far)

Wider selection of OSS licenses, and more options to fine-tune access (Linking, Distribution, Modification, Sublicensing, Patents/Trademarks, etc.)

Now researchers are producing such large & heterogeneous datasets, what do you think the challenges are for producers and users?

What are the legal implications of mixing data and software?

What do you think the security issues of accessing these complex combined research objects are?

Questions to ask?

Questions? | 15 minute break

Research Data: Pop QuizWhat was #climategate?

What is the INSDC, and who are the three INSDC partners?

What is the estimated yearly growth of medical imaging data?

What are bioboxes?

How many databases are currently listed in biosharing?

Which of the reporting guidelines/checklists are for A) animals, B) biological science, and C) clinical research: MIBBI, ARRIVE and Equator

Ethics & Data security IssuesRECAP

Ethics: needs approvalhttp://www.rss.hku.hk/integrity/ethics-compliance

Ethics: clinical trials need registrationhttp://www.hkuctr.com/

Ethics: need informed consenthttp://www.med.hku.hk/images/document/04research/institution/5QMH_IRB_GUIDANCE_NOTES_FOR_THE_PREPARATION_OF_PATIENT_CONSENT.pdf Where does data sharing fit into this?WILL MY TAKING PART IN THIS STUDY BE KEPT CONFIDENTIAL? You will need to obtain the patients permission to allow restricted access to their medical records and to the information collected about them in the course of the study. You should explain that all information collected about them will be kept strictly confidential. A suggested form of words is:

All information which is collected about you during the course of the research will be kept strictly confidential. Any information about you which leaves the hospital/surgery will have your name and address removed so that you cannot be recognised from it.

HKU Guideline Notes - for Preparation of Subject Information Sheet & Informed Consent Form:

Ethics: includes animal research

http://www.med.hku.hk/research/research-ethics/animal-ethics-culatr

Ethics: includes animal researchhttps://www.nc3rs.org.uk/arrive-guidelines

Lots of tools available: anonymisationhttps://www.ukdataservice.ac.uk/manage-data/tools-and-templates

Lots of tools available: encryptionhttps://www.brookes.ac.uk/Research/Research-ethics/Encrypting-files/

Lots of tools available: DAC & brokeringhttps://blog.repositive.io/getting-data-out-of-the-ega/

Lots of tools available: DAC & brokeringhttp://www.ckbiobank.org/site/

Lots of tools available: DAC & brokeringhttp://www.ckbiobank.org/site/

Kinds of identifying informationDirect identifiersNames, addresses, postcode information, telephone numbers or picturesIndirect identifiersIn combination with other information, would identify e.g. information on workplace, occupation or exceptional values of characteristics like salary or age

http://www.data-archive.ac.uk/create-manage/consent-ethics/anonymisationRECAP

71

De-identification #101

Anonymising audio-visual dataAnonymisation of audio-visual data, such as editing of digital images or audio recordings, should be done sensitively. Bleeping out real names or place names is acceptable, but disguising voices by altering the pitch in a recording, or obscuring faces by pixellating sections of a video image significantly reduces the usefulness of data. These processes are also highly labour intensive and expensive.If confidentiality of audio-visual data is an issue, it is better to obtain the participant's consent to use and share the data unaltered. Where anonymisation would result in too much loss of data content, regulating access to data can be considered as a better strategy.We urge researchers to consider and judge at an early stage the implications of depositing materials containing confidential information and to get in touch to consult on any potential issues.

https://www.ukdataservice.ac.uk/manage-data/legal-ethical/anonymisation/qualitative

Considerations for medical imaging

https://openfmri.org/de-identification/https://sourceforge.net/projects/privacyguard/ Need to also ensure DICOM (Digital Imaging and Communications in Medicine) metadata also passes through de-identification toolkit MRI brain scans first undergo skull stripping

Automated Defacing Tools required beyond this

Considerations for medical imageshttps://bmcmedgenet.biomedcentral.com/articles/10.1186/1471-2350-15-21https://bmcmedgenet.biomedcentral.com/articles/10.1186/1471-2350-11-26

Sharing of clinical images crucial in understanding phenotypesRequire consent to publish, but challenges doing this with ill people, children, elderly, and disadvantagedFurther challenges in era of social media, open access and wikipediaSecurity issues protecting signed consent forms

Not just a metadata problemhttp://science.sciencemag.org/content/339/6117/321

Extra considerations for HK

Hospital Authority restrictions on dataHave to apply to Hospital Authority to access public health data

Only approved 14 data requests (as of May 2016)

If approved requires data recovery charges (collect $250,000 HKD a year from this)

Can publish aggregate/summary data in journals, but not share data

Only approves academic use, not citizens or industry/pharmaVia FOI request: https://accessinfo.hk/en/request/request_for_statistics_on_data_c

Extra considerations for China

Human genetic data needs MOST approval

Article 2: The term "human genetic resources" in the Measures refers to the genetic materials such as human organs, tissues, cells, blood specimens, preparations of any types or recombinant DNA constructs, which contain human genome, genes or gene products as well as to the information related to such materials.

Second, any international collaborative project involving Chinese human genetic resources, for example international research cooperation and exporting human genetic resources or taking such resources outside of the territory of China should shall apply to MOST for examination and approval prior to entering into an official contract. And Chinese collaborating party shall be responsible for going through the due formalities of application for approval. (See Article 11)

http://www.chinadaily.com.cn/china/2010-08/12/content_11141879.htm Foshan, 2010Extra considerations for China

In 2010, just across the border in mainland China one of the worlds first major cases of genetic discrimination occurred. People studying for the civil service exam were given genetic tests without them knowing about it. Over 30 people lost jobs due to very basic misunderstandings of genetics and what constitutes a genetic disease.79

Can this data be easily de-identified & shared?http://journals.plos.org/plosone/article?id=10.1371/journal.pone.0152381

The individual in this manuscript has given written informed consent (as outlined in PLOS consent form) to publish their images. Following approval by the Institutional Review Board (IRB) of The University of Hong Kong and Hospital Authority Hong Kong West Cluster (UW 14159); 20 individuals, 10 male and 10 female volunteers, were properly instructed and gave consent to participate in this study by signing the appropriate informed consent paperwork.

FAIR or unfair? Principled publishing for data.

What is FAIR ()?AdverbWithout cheating or trying to achieve unjust advantage. no one could say he played fairAdjectiveTreating people equally without favouritism or discrimination. the group has achieved fair and equal representation for all its membersa fairer distribution of wealthfair /f/

475, 267 (2011)

http://www.nature.com/news/2011/110720/full/475267a.html

Wide distribution of information is key to scientific progress, yet traditionally, Chinese scientists have not systematically released data or research findings, even after publication.

There have been widespread complaints from scientists inside and outside China about this lack of transparency.

Usually incomplete and unsystematic, [what little supporting data released] are of little value to researchers and there is evidence that this drives down a paper's citation numbers.Is this FAIR? FAIR?

FAIR questions to ask?

Is the raw data publically available?

Are the reagents (plasmids, cells, antibodies, etc.) available?

Are detailed protocols available?

Can I access the processed data & results (supporting the figures)?

Was this all available BEFORE publication to the peer reviewers?

Can I inspect the peer reviews?

Can I publish/link +/-ve replication experiments to this?

A more FAIR approach: Open Data?

Research Objects: a concept & model

http://www.researchobject.org/ Supporting publication of more than just PDFs, making data, code, & other resources first class citizens of scholarship.Recognizing that there is often a need to publish collections of these resources together as one shareable, cite-able resource. Enriching these resources and collections with any & all additional information required to make research reusable, & reproducible!

Importance of metadata: context (& discoverability)

https://library.stanford.edu/research/data-management-services/data-best-practices/best-practices-file-naminghttps://twitter.com/AlisonMcNab/status/751375987624009728/photo/1

?

Novel tools/formats for data interoperability/handling: ISA

Importance of metadata: context (& discoverability)

Where do you set it?

Experiment(e.g. International Cancer Genome Consortium)Datasets(e.g. cancer type)Sample(e.g. specimen xyz)

e.g. doi:10.5524/100001e.g. doi:10.5524/100001-2 e.g. doi:10.5524/100001-2000or doi:10.5524/100001_xyzSmaller still?Importance of granularityPapersData/MicropubsNanopubsFacts/Assertions (~1013 in literature)

Importance of granularityhttp://www.nature.com/ng/journal/v43/n4/full/ng.785.html

Importance of granularityhttp://www.nature.com/ng/journal/v43/n4/full/ng.785.html

Assertion Nanopublication URLProvenancePublicationInfoassertionopm:wasDerivedFromhttp://rdf.biosemantics.org/profiles_matching_1980_2010opm:wasGene-ratedBythisnanopubdcterms:created2012-03-28T11:32^^xsd:dateTimepav:authored-Byassocia-tionasio:statis-ticalAssociationsio:has-measurementValueAssociation_1_p_valueaSio:probability-valuesio:has-value6.56 e-5^^xsd:floatsio:refers-tohttp://bio2rdf.org/omim:210600researcherid.com/rB-6035-2012dcterms:DOIhttp://dx.doi.org/.

http://bio2rdf.org/geneid:55835

Integrity KeyAn Individual association between concepts:statement or declarationmeasurementhypothetical inferencequantitative or qualitativeGuarantee immutabilityafter publicationUnique, persistent and resolvable identifierHow this assertion came to be, methods, evidence, context, etc.Detailed attribution for authors, institutions, lab technicians, curatorsLicense infoPublication dateA nanopub represents structured data along with its provenance in a single publishable & citable entity. http://nanopub.org/

Lots of models/standards/guidelinesWhere does that leave us?

?

5 open data

A mnemonic to remember: FAIR :FAIR

http://www.nature.com/articles/sdata201618http://www.datafairport.org/ Findable Accessible InteroperableReusableLots of models/standards/guidelinesWhere does that leave us?

A mnemonic to remember: FAIR

http://www.nature.com/articles/sdata201618http://www.datafairport.org/

A mnemonic to remember: FAIR

http://www.nature.com/articles/sdata201618http://www.datafairport.org/ To be Findable:

F1. (meta)data are assigned a globally unique and persistent identifier

F2. data are described with rich metadata (defined by R1 below)

F3. metadata clearly and explicitly include the identifier of the data it describes

F4. (meta)data are registered or indexed in a searchable resource

A mnemonic to remember: FAIR http://www.nature.com/articles/sdata201618http://www.datafairport.org/ To be Accessible:

A1. (meta)data are retrievable by their identifier using a standardized communications protocol

A1.1 the protocol is open, free, and universally implementable

A1.2 the protocol allows for an authentication and authorization procedure, where necessary

A2. metadata are accessible, even when the data are no longer available

A mnemonic to remember: FAIR http://www.nature.com/articles/sdata201618http://www.datafairport.org/ To be Interoperable:

I1. (meta)data use a formal, accessible, shared, and broadly applicable language for knowledge representation.

I2. (meta)data use vocabularies that follow FAIR principles

I3. (meta)data include qualified references to other (meta)data

A mnemonic to remember: FAIR http://www.nature.com/articles/sdata201618http://www.datafairport.org/ To be Reusable:

R1. meta(data) are richly described with a plurality of accurate and relevant attributes

R1.1. (meta)data are released with a clear and accessible data usage license

R1.2. (meta)data are associated with detailed provenance

R1.3. (meta)data meet domain-relevant community standards

Beyond a mnemonic: FAIR ecosystems FAIRifier tool

Beyond a mnemonic: FAIR ecosystems A particular class of FAIR Data System to provide support for data interoperability;Supports publication, search and access to FAIR data. Fosters an ecosystems of applications and services; Federated architecture: different FAIRports (and other FAIR Data Systems) are interconnectable;Supports citations of datasets and data items;Provides metrics for data usage and citation;

A FAIRpoint or FAIRport can be any specific data instance following FAIR data principles. http://www.datafairport.org/

Beyond a mnemonic: FAIR ecosystems http://www.datafairport.org/

?

Beyond a mnemonic: FAIR ecosystems https://www.fair-access.net.au/fair-statement

By 2020, Australian publicly funded researchers and research organisations will have in place policies, standards and practices to make publicly funded research outputs findable, accessible, interoperable and reusable.

DTL/ELIXIR-NLBring Your Own Data PartyGigaScience/BGI HKMetabolomics ISA-TAB athonv

More FAIR mnemonics: BYODs

FAIR Data in the wild

Taking a microscope to the publication process

http://journals.plos.org/plosone/article?id=10.1371/journal.pone.0127612

How FAIR can we get?FAIR?Data setsAnalyses

Linked toLinked toDOI

DOIOpen-PaperOpen-ReviewDOI:10.1186/2047-217X-1-18

>50,000 accesses& 885 citations

Open-Code

7 reviewers tested data in ftp server & named reports publishedDOI:10.5524/100044Open-PipelinesOpen-WorkflowsDOI:10.5524/100038Open-Data78GB CC0 dataCode in sourceforge under GPLv3: http://soapdenovo2.sourceforge.net/

>40,000 downloads

Enabled code to being picked apart by bloggers in wiki http://homolog.us/wiki/index.php?title=SOAPdenovo2

107

Can we reproduce results? SOAPdenovo2 S. aureus pipeline

The SOAPdenovo2 Case studySubject to and test with 3 models:

DataMethod/Experimental protocolFindingsTypes of resources in an ROISA-TAB/ISA2OWLNanopublicationWfdesc/ISA-TAB/ISA2OWLModels to describe each resource type

1. While there are huge improvements to the quality of the resulting assemblies, other than the tables it was not stressed in the text that the speed of SOAPdenovo2 can be slightly slower than SOAPdenovo v1. 2. In the testing an assessment section (page 3), based on the correct results in table 2, where we say the scaffold N50 metric is an order of magnitude longer from SOAPdenovo2 versus SOAPdenovo1, this was actually 45 times longer 3. Also in the testing an assessment section, based on the correct results in table 2, where we say SOAPdenovo2 produced a contig N50 1.53 times longer than ALL-PATHS, this should be 2.18 times longer.4. Finally in this section, where we say the correct assembly length produced by SOAPdenovo2 was 3-80 fold longer than SOAPdenovo1, this should be 3-64 fold longer.CORRECTION

Lessons Learned Most published research findings are false. Or at least have errorsWith enough effort is possible to push button(s) & recreate a result from a paper with current toolsBeing FAIR can be COSTLY. How much are you willing to spend? Who will build FAIR infrastructure? Much easier to make things FAIR before rather than after publication. BYODs useful intermediate here

http://www.nature.com/ng/journal/v48/n4/full/ng.3544.html

The question to ask in order to be a data steward, to handle data or to simplify a set of standards is the same: is it FAIR?

http://content.iospress.com/articles/information-services-and-use/isu824

Levels of FAIRness: A-F of FAIR dataIn class activity: How FAIR is this data?Data from: Live poultry exposure and public response to influenza A(H7N9) in urban and rural China during two epidemic waves in 2013-2014 http://hub.hku.hk/cris/dataset/dataset93128

Supporting data for "Genomic analyses revealFAM84B and the NOTCH pathway are associated with the progression of esophageal squamous cell carcinoma http://dx.doi.org/10.5524/100181

Linked Drug-Drug Interactions (LIDDI) https://datahub.io/dataset/linked-drug-drug-interactions-liddi

http://content.iospress.com/articles/information-services-and-use/isu824

Reflection: how fair is FAIR?Read the FAIR principles paper.

Do you think they are applicable and feasible for HK? If it is feasible, what is needed to implement them?http://www.nature.com/articles/sdata201618

Any questions?Does anyone have BYO data for the curation/cleaning workshop?

Final ProjectFor the final project for this course, you can choose from 3 assignment options. The assignment is due on the 15th May and it is worth 40% of your grade. Time will be set aside for presenting on this during the final class on the 24th April: covering why you chose the option, what discipline/dataset/topic you are covering, and what work you've done so far (5 mins per student including any group feedback)

Final Project: Option 1Write an Annotated Bibliography about data curation practices in an academic discipline of your choosing.

Choose a discipline (sciences, social sciences, & humanities) OR choose the topic of open data.Summarize data practices in your chosen discipline or topic. (5-7 sentences)Find 7-10 sources that relate that discipline or topic to data creation, management, and/or curation.Provide a citation for the source in APA style.Write a short annotation that summarizes the content of the source. You may include quotes from the source sparingly, but the annotations should be mostly, if not entirely, in your own words. (3-5 sentences)Explain the relevance of the source with relation to the data practices of your chosen discipline or topic. (1-2 sentences)Find a few example public datasets to demonstrate the above points. Cite the data in the relevant places in the Bibliography according to the Data Citation Principles.Refer to this guide for more information about annotated bibliographies: http://sites.umuc.edu/library/libhow/bibliography_tutorial.cfm. Your annotation should be in the Descriptive style.

Final Project: Option 2Using a relevant dataset (this can either be from the literature curation exercise, a BYO dataset, or one given to you), write a report that includes a description of the dataset, a Data Management Plan, and a guidelines document for the researcher(s).

Describe the dataset that explains the form of the data and the academic discipline in which it was created. This paragraph should provide context for the (3-5 sentences) 1-2 page Data Management Plan following the guidelines from HKU or granting body such as NSF.1 page guidelines document that could be presented to the researcher(s) that provides guidelines for their data (extant and forthcoming):PreservationAppraisalDocumentationFor the DMP and the guidelines document, you can extrapolate from the your dataset to imagine additional details about the research practices that created the dataset and will create more data in the future.Look for suitable data repositories that can host this data (institutional, general purpose, or subject specific), and if there is one relevant then publish the data if you have permission, and correctly cite the data in the relevant places in your report. [disclaimer: if have permission]

Final Project: Option 3Prepare a 30 minute data curation workshop that you could teach to researchers that would provide them the necessary details to understand why data curation is relevant to them and best practices they should follow.

Slide deck that introduces data curation for a researcher audience. (No more than 40 slides.)Presenter outline that describes the important points for each slide.Topics that might be addressed in your workshop: the value of data management, writing a data management plan, data repository options. You can assume your audience is researchers are at HKU. Make sure all of the content is copyright free, and share the final material openly (e.g. figshare, scholarhub, OER commons, etc.), and with sufficient metadata to make it discoverable.

Looking aheadSubmit 1 paragraph refection on FAIR principles through moodle forumNext class (22nd April) is hands-on curation workshop with Dr Chris HunterBring laptops and any data you may have for a data cleaning exerciseFinal project due 15th MayNeed to present preliminary version on 26th April to get feedback before completion. Send me slides by the 25th April so I can get them ready for the class

122