34
1 Preparing Scientific Data for Archiving and Sharing Part One Slide 1 [Preparing Scientific Data for Archiving & Sharing] Hello. Welcome to the training video on preparing scientific data for archiving and sharing. Slide 2 [Video Outline] This video is broken up into three separate parts. Part One covers topics that should be considered when planning a project with data sharing in mind. These include what the NIDILRR requirements for data sharing are, what we mean by “data,” why data sharing is important, the components of a data management plan, and language to use and avoid when crafting informed consent documents for your institutional review board. Part Two covers topics involved with preparing the data and documentation to be shared. This includes a discussion of the various formats of data that can be shared, how to structure your data files, what type of documentation should accompany your data, naming conventions for variables, and an overview of how to handle restricted or sensitive data. Part Three covers the actual data sharing process, including an overview of the Inter-university Consortium for Political and Social Research, or ICPSR, which is NIDILRR’s preferred repository for data sharing, how to identify a repository that follows best practices, a walkthrough of ICPSR’s data deposit process, and data citation or data usage features available through ICPSR.

Preparing Scientific Data for Archiving and Sharing  · Web viewSome examples of direct identifiers include: participants’ names, photos or videos of participants, their addresses

  • Upload
    vutruc

  • View
    212

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Preparing Scientific Data for Archiving and Sharing  · Web viewSome examples of direct identifiers include: participants’ names, photos or videos of participants, their addresses

1

Preparing Scientific Data for Archiving and SharingPart One

Slide 1 [Preparing Scientific Data for Archiving & Sharing]

Hello. Welcome to the training video on preparing scientific data for archiving and sharing.

Slide 2 [Video Outline]

This video is broken up into three separate parts. Part One covers topics that should be considered when planning a project with data sharing in mind. These include what the NIDILRR requirements for data sharing are, what we mean by “data,” why data sharing is important, the components of a data management plan, and language to use and avoid when crafting informed consent documents for your institutional review board.

Part Two covers topics involved with preparing the data and documentation to be shared. This includes a discussion of the various formats of data that can be shared, how to structure your data files, what type of documentation should accompany your data, naming conventions for variables, and an overview of how to handle restricted or sensitive data.

Part Three covers the actual data sharing process, including an overview of the Inter-university Consortium for Political and Social Research, or ICPSR, which is NIDILRR’s preferred repository for data sharing, how to identify a repository that follows best practices, a walkthrough of ICPSR’s data deposit process, and data citation or data usage features available through ICPSR.

Slide 3 [Part 1: Planning for Data Sharing]

Let’s begin with Part One: Planning for Data Sharing

Slide 4 [NIDILRR Data Sharing Requirements]

To start, we want to outline the data sharing requirements for NIDILRR grantees.

Beginning October 1, 2017, all research funded by NIDILRR falls within the Administration for Community Living (or ACL) requirements for public access to scientific data. The ACL Public Access Plan lists the following requirements:

Data must be publicly available no later than 24 months after an award’s end date.

Page 2: Preparing Scientific Data for Archiving and Sharing  · Web viewSome examples of direct identifiers include: participants’ names, photos or videos of participants, their addresses

2

Data must be stored in such a way that enables both retrieval and meaningful use by interested parties at no cost.

All scientific data that result from an award must be made publicly available. If an award funds more than one research project or a research project generates more than one type of scientific data, all data sets from all projects must be made publicly available.

Each data set must have a Digital Object Identifier (or DOI) for future reference and citation.

In the Final Report for your award, you must include the DOIs for all data sets, as well as the release date for your data if you choose to embargo them. This means that you must deposit your data at a repository before submitting your final report.

The public access plan identifies ICPSR as the preferred repository for NIDILRR-funded research. ICPSR is a data repository housed at the University of Michigan and is capable of satisfying the requirements for data sharing outlined in the ACL Public Access Plan, including providing a DOI for each data set and giving the option to researchers to embargo their deposited data for up to 24 months. More detailed information about ICPSR will be provided in Part Three of this video. A link to the website containing the ACL Public Access Plan can be found in the References section below this video.

Slide 5 [What do we mean by “data”?]

Now that you know the data sharing requirements for NIDILRR-funded research, it is important to understand what is meant by the term “data” for the purpose of public access, and particularly what data must be shared to meet the ACL requirements. According to the ACL Public Access Plan, scientific data are defined as “digitally recorded, factual material commonly accepted in the scientific community as necessary to validate research findings, including data sets used to support scholarly publications.” [Source: ACL Public Access Plan]

Slide 6 [What is shared in “data sharing”?]

In general, scientific data that should be shared include:

The raw data you collect from any NIDILRR-funded project Any data that you derive or construct from other original data. For example, if you

develop a composite variable by summing the values of several other variables in a data set, that new composite variable should be included in what is shared

The metadata should also be shared. Metadata refers to information and documentation that makes the data more useable or easier to understand. This includes documents like a codebook, a user guide, any descriptions of created or derived variables, information about sampling and weighting procedures, and information about the study itself. These elements will be discussed in greater detail in Part Two of this video.

It is important to note that personally identifiable information (or PII) is not typically included in data that are shared. There are some instances where certain identifying information is necessary so that the value of the data for secondary users is not compromised, but for the most part, only de-identified data should be shared and only de-identified data are required to be shared under

Page 3: Preparing Scientific Data for Archiving and Sharing  · Web viewSome examples of direct identifiers include: participants’ names, photos or videos of participants, their addresses

3

the ACL public access plan. Ways to de-identify a data set prior to sharing it will also be discussed in Part Two of this video.

Slide 7 [What does not need to be shared?]

Elements of the research process that do not constitute scientific data, and do not need to be shared include lab notebooks, preliminary analyses, drafts of scientific papers or manuscripts, plans for future research, peer review reports, communications with research team members or other colleagues, and physical objects, like lab specimens.

Slide 8 [What does not need to be shared?]

There are some data that are exceptions to the ACL public access plan. These include personally identifiable data, any proprietary trade data, and any other data whose release is limited by law, regulation, security requirements, or policy. If you have questions about whether your data are included in one of these exceptions, you should contact your program officer at NIDILRR.

Slide 9 [Reasons to Share Data]

There are many reasons that researchers would want to share their data. One reason to share your data is that it is required. Funding agencies within the government, academic institutions, and journals are all increasingly requiring affiliated researchers to share their data as a condition for continued funding or publication. In February of 2013, the Director of the Office of Science and Technology Policy, John Holdren, released a memo that directed all federal agencies with an annual research and development budget over $100 million to develop a public access plan for disseminating the results of their research.

Sharing data also improves the return on investment for public money. It allows more people to utilize data that have already been collected. It also broadens the impact of research funding and makes research funding more efficient.

Another reason to share data is to benefit science. The scientific process involves building off of previous studies and results. Data sharing encourages transparency, which is important to verify results and ensure their validity across space and time.

Data sharing is also quickly becoming the new norm in many disciplines. A search on “data sharing” in The New England Journal of Medicine returns 564 manuscripts from the past five years alone.

Slide 10 [Benefits to Original Researchers]

For the original researchers, sharing data allows more people to use their data set. This increases the impact of the original researcher’s work. Researchers who share data also receive credit or attribution for their work through data citation. Increasingly, institutions are allowing data products to be included in reports that demonstrate the impact of researchers’ work. This can be

Page 4: Preparing Scientific Data for Archiving and Sharing  · Web viewSome examples of direct identifiers include: participants’ names, photos or videos of participants, their addresses

4

especially important when applying for tenure and promotion, as well as to show current and prospective funding agencies the productivity of funding the researcher’s projects. Often, the data use agreement by which secondary users must abide includes a requirement that the data must be cited in the secondary user’s publications. Another benefit of sharing data at a repository like ICPSR is that it allows data usage metrics to be captured so that researchers can easily access information about secondary usage of their data.

Slide 11 [Benefits to Secondary Users]

For secondary users, having data already collected and available minimizes the resources they have to use for their own research projects. Since they do not have to spend time collecting the data, this may reduce the time for new findings and accelerate new discoveries. Additionally, as research budgets are being stretched more and more, and grants are becoming increasingly competitive, sharing data helps researchers who may not otherwise have the opportunity to gather the data necessary for their projects due to funding limitations.

Secondary users can also bring unique perspectives to the data to expand their usage and utilize the data in new and important ways. Sharing data helps reduce scientific inefficiency by allowing researchers to see what others have already done and avoid duplicating their efforts. By seeing what others have done, secondary users can build upon others’ work to advance science more rapidly.

Sharing data also allows researchers to increase sample sizes by combining data resources, which can be particularly valuable for studying smaller subpopulations. Lastly, instructors are able to access the data to incorporate into their lessons to enhance undergraduate and graduate training.

Slide 12 [Planning your Data Sharing]

Now that you know your requirements for sharing data and the benefits of data sharing, let’s talk about practical considerations to keep in mind as you plan to archive and share your data. Each of these points will be discussed in the following slides, but to give an overview, you will want to write and finalize your data management plan, include your intent and approach for data sharing with your institutional review board submission, and make sure that you add language about sharing data to your informed consent documents.

Slide 13 [Data Management Plan]

A Data Management Plan is a written document that describes how researchers will provide for long-term preservation of, and access to, scientific data in digital formats. As a NIDILRR grantee, you are required to submit a data management plan as outlined in the ACL public access plan.

Slide 14 [Elements of Data Management Plans]

What needs to be included in the data management plan that you submit to NIDILRR?

Page 5: Preparing Scientific Data for Archiving and Sharing  · Web viewSome examples of direct identifiers include: participants’ names, photos or videos of participants, their addresses

5

First, there needs to be a description of the types and format of the data that you will collect, including a description of how they will be organized, stored, and preserved. For example, will you be collecting data through surveys? Interviews? Focus groups? Will the data be stored as an Excel spreadsheet? Audio or video file? Will they be stored on your computer? External hard drive?

You must also include a description of the documentation and metadata that will be included as part of your submission to a repository. Again, documentation and metadata refer to information that describes the data, such as a codebook that describes your variables and how the data are coded, survey instruments or question text, user guides for the data, descriptions of weighting or sampling procedures, calculations for created or derived variables, and other related information that helps to understand the data and how they can be used.

If ICPSR is the repository you choose for data sharing, you must indicate this in your data management plan. If you do not choose ICPSR, you must provide a justification in the plan for why you chose a different repository. You must explain why the other repository would be more appropriate than ICPSR for the type of data you collected, and ensure that it is comparable to ICPSR in terms of industry standards.

Your data management plan must also include a description of the process for obtaining study participants’ consent, which will be discussed in the next few slides. You must also indicate the estimated cost to implement the data management plan, which is allowed as part of the award’s direct costs.

Slide 15 [Institutional Review Boards (IRBs)]

When you submit your study to your institutional review board (or IRB), you should indicate your intent to share the data. IRBs differ across institutions in what they require researchers to submit, but it is best practice to be as transparent as possible to your IRB about how data will be collected and shared. In addition to information in your data management plan, your IRB may ask you to provide answers to other questions about your methods to protect the identity of study participants. Ultimately, your IRB must approve your informed consent document, which provides information to study participants about how the information they provide to you as part of the study will be used and shared. It is important to ensure that you have IRB approval to share your data through a repository before you begin collecting data.

Slide 16 [Informed Consent]

To clarify informed consent a bit more, informed consent refers to the communication process that allows individuals to make informed choices about participating in a research study. As part of the document that you provide to participants, you will want to be sure to include an assurance that the data will be de-identified prior to sharing so that participants’ identities will be protected. You should convey as much information as possible about the process of handling data from the start to finish of the project. For example, tell participants who will have access to the raw data that may contain identifying information, such as members of the research team, how these data

Page 6: Preparing Scientific Data for Archiving and Sharing  · Web viewSome examples of direct identifiers include: participants’ names, photos or videos of participants, their addresses

6

will be stored, and explain that you will protect their identity before sharing the data with other researchers outside of the research team.

It is important to avoid using language that would unnecessarily prohibit or limit the sharing of data with sponsors or other researchers. The goal of this document is to inform participants about the study and how the information they provide will be used, stored, and shared. Overall, be honest with them that you will share the data with others, but assure them that you will take all the necessary actions to make sure their identity is protected.

Slide 17 [Example Language]

This is just one example of recommended language to include in a consent form. It is important to note that this is not the only information that should be included in the consent form, and there are many ways to phrase consent forms to still allow data sharing. The ICPSR website provides several recommendations on language to use and language to avoid when crafting these documents. Additionally, your own IRB is likely a good resource to ask questions to regarding what should and should not be included in your consent forms to ensure that you can share your data.

You can see in this example, in the first two sentences, that the researchers are explicit that the data may be shared with other researchers, but that the data will only be used in ways that protect participants’ identities. Importantly, the researchers do not use any language or make any guarantees that would prevent them from sharing the data.

[Example provided in the slide states: “The information in this study will only be used in ways that will not reveal who you are. You will not be identified in any publication from this study or in any data files shared with other researchers. Your participation in this study is confidential. Federal or state laws may require us to show information to university or government officials (or sponsors), who are responsible for monitoring the safety of this study.” Source: Pienta and Marz example at Qualitative Data Repository]

Slide 18 [Informed Consent and IRBs]

To recap, it is in your best interest to consult with your own IRB early on about any special requirements that they may have for sharing data. This can be done even before submitting your NIDILRR proposal so that you can prepare your data management plan accordingly. What we have provided to you are general guidelines, but ultimately, each IRB may vary in what information they require researchers to provide to both the IRB and to study participants to ensure data sharing is allowed.

ICPSR has a variety of online resources on topics like data management plans, data confidentiality, informed consent language, and IRBs. Links to these resources can be found in the References section below this video.

To conclude Part One of this video, we want to remind you that the sooner you consider data sharing in the research process, the easier it will be at the end. Consulting with NIDILRR, your

Page 7: Preparing Scientific Data for Archiving and Sharing  · Web viewSome examples of direct identifiers include: participants’ names, photos or videos of participants, their addresses

7

IRB, and ICPSR at the start of your project will minimize issues that may occur later on in the data collection or data sharing processes.

Part TwoSlide 19 [Part 2: Preparing Data & Documentation]

Part Two of this video focuses on preparing your data and the accompanying documentation for sharing.

Slide 20 [Part 2 Outline]

This part of the video will cover best practices for preparing your data for sharing, including details you should consider when creating variables, considerations for sharing derivative or constructed data sets, how to best organize your data files, and what information should be shared about the study itself. How to handle data that contain sensitive or identifying information will also be discussed.

Slide 21 [Data and Documentation]

The main idea to keep in mind when you are preparing your data for sharing is that a well-prepared data collection contains information intended to be complete and self-explanatory for future users. When preparing your data, it can be helpful to consider a secondary user’s perspective by asking yourself the question: “How should I organize the data so that secondary users can independently understand the data collection?”

Slide 22 [What Should My Deposit Include?]

Part One of the video outlined what are considered scientific data that need to be shared, and elements of the research process that do not need to be shared. As a reminder, the elements that should be included in a data deposit to a repository include the files that contain the data, accompanying documentation that explains the data, like the codebook or user guide, and documents that describe the study and the data, or what is known as metadata.

Slide 23 [Structuring Data for Sharing]

Early in the research process, even before you collect data, you should think about how to organize your data to make them understandable by secondary users. The recommendations for organizing your data in the next several slides are not only helpful when it comes to sharing your data with others, but these are also general best practices for data organization even for yourself and other members of the research team.

When you think through the organization of your data, it can be helpful to think about the data at three different levels: the variable level, the file level, and the study level.

Page 8: Preparing Scientific Data for Archiving and Sharing  · Web viewSome examples of direct identifiers include: participants’ names, photos or videos of participants, their addresses

8

We will discuss each of these levels in the following slides, but it is important to note that the precise organization of your data collection will be driven by the nature of your specific data and research.

Slide 24 [Variable-Level Structure]

Let’s start with the first level of your data collection: the variable level. There are several variable-level elements to think through prior to collecting data, including naming conventions, variable and value labels, missing data, and variable-level documentation. Each of these will be discussed in the following slides.

Importantly, at this level of the data, there are some differences between statistical and non-statistical data, which we will point out along the way. By statistical data, we mean any data that are numerical. Statistical data are most commonly arranged in tabular form, with rows and columns. By non-statistical data, we mean any data that are not numerical. These may include text data, images, audio, or video.

Slide 25 [Variable Naming Conventions]

At the start of your project, something you will want to consider is the variable naming convention that you will use throughout your data collection. This is true of both statistical data and text data. Here are some common naming conventions.

The first is one-up numbering, like V1 for the first variable, V2 for the second, and so on. These are simple, but they do not really convey any information or content about the data contained in each variable.

A similar convention is using question numbers. For example, Q1 corresponds to the first question on your survey, and so on. Again, this is simple to do, but has the same issue as the one-up numbers. Additionally, you may have multiple variables corresponding to the same survey question if there are multiple parts to the question. You may end up with Q1a, Q1b, and so on, which could get confusing. This can also be problematic if you have data from several different questionnaires and each questionnaire has questions with the same number. In this case, you would need to add yet another identifier to the variable name to differentiate between the various questionnaires, which again, can get confusing, especially for secondary users who are not as familiar with the questionnaires as you might be.

Another option is to use mnemonic names, or something that corresponds to or describes the variable directly. This can be a good approach, but keep in mind that names that make sense to you, may or may not make sense to other users, so you will want to be careful with the names you choose.

The last common convention is prefix, root, or suffix systems. This is common with more complex datasets. For example, multi-wave or multi-level data collections may use this type of naming convention. You might have data collected from both parents, and you may develop a

Page 9: Preparing Scientific Data for Archiving and Sharing  · Web viewSome examples of direct identifiers include: participants’ names, photos or videos of participants, their addresses

9

naming schema such that the letter or number in the first position indicates whether the variable comes from the mother or father, the letter or number in the second position might indicate the wave in which the variable was collected, and so on. If using this method, it is important to document to what each part of the variable name refers.

These are just some examples of naming conventions. There are several others that you may choose to use, but regardless of the exact method you choose, it is important that you remain consistent and apply the same naming convention throughout your entire data set.

Slide 26 [Variable Labels]

In addition to the naming of the variables, labeling the variables is also important for both statistical and text data. There are three pieces of information that are helpful to include in variable labels.

First, if the variable corresponds directly to a survey or questionnaire item, include the question or item number to which the variable corresponds.

Second, you should always include a clear description of the variable’s content.

Finally, it is helpful to indicate whether the variable is an original variable or a variable that has been derived or constructed by the research team through recoding or scaling of other variables. For example, your survey may ask a series of five questions related to the respondent’s health. You may want to sum the responses from these five questions to create a health index. You will want to indicate that this health index variable was constructed from other variables and not asked directly on the survey. An easy way to indicate that a variable has been constructed from other variables in the data set is to include “underscore C” at the end of whatever name you choose for the constructed variable.

You can see here in the first example for an original variable, the variable name is Rhealth and the variable label contains both the corresponding question number in the survey, as well as a complete description of the data contained in this variable.

[Example one on the slide states: “Rhealth: Q14 Self-Assessment of Respondent’s Health”]

In the second example, you can easily identify that the variable was constructed because the variable name includes “underscore C” at the end. The variable label then describes the content of that variable, which includes a clear indication that the variable was constructed.

[Example two on the slide states: “Rhealth_C: Index of Respondent’s Health (constructed)]

Again, you can choose other ways to indicate this, but whatever you choose should be documented somewhere and should be consistent throughout your data files. You should also document exactly how the variable was constructed. For example, what variables did you sum together to get this constructed variable?

Page 10: Preparing Scientific Data for Archiving and Sharing  · Web viewSome examples of direct identifiers include: participants’ names, photos or videos of participants, their addresses

10

Slide 27 [Value Labels]

In addition to the variable labels, value labels are also important. Variable labels describe the data contained in the variable, while value labels describe all the response choices for each variable and what they represent. Unless you are grouping text data into different categories and assigning a code to each category, value labels will typically only be necessary for statistical data.

Value labels should be mutually exclusive, exhaustive, and precisely defined.

Here is an example of what value labels look like for a variable that captures the respondent’s employment status. You can see, each of the three options that respondents could choose from has its own label and corresponds to how it is coded in the data. So if an individual sees a 1 for this variable, that means the respondent is unemployed. It is important that every possible response option has its own unique code and label.

[Example on the slide states: “R_Employ: Respondent’s Employment Status. Unemployed (1), Self-employed (2), Employed, not by self (3)”]

Slide 28 [Missing Data]

Something that the values for the employment variable in the previous slide did not capture is an instance where there might not be data for that variable for a particular respondent. When developing your coding scheme and value labels, it is important to consider whether there will be missing data and how these will be labeled. Researchers are encouraged to standardize missing data and to differentiate between the different types of missing data.

A variable might have missing data for a number of reasons. For example, the question might be inapplicable to a respondent, the respondent may not know the answer to the question, or the respondent may refuse to answer the question. Even if the differences between various types of missing data are not important to you, a secondary user might be interested in using this information in their analysis.

You can see in the example here that each type of missing data has its own unique value.

[Slide states: “Are there missing data? Are missing data labeled?” The example states: “77 = Inapplicable, 88 = Don’t Know, 99 = Refused to Answer”]

It can be helpful to code missing data in a slightly different way than non-missing data, for example, by having two or three of the same digit in a row. This helps to easily identify missing data and reduces mistaking missing data for real data.

However you choose to code your missing data, it is important to remain consistent throughout your data set. It is also important to note that if “don’t know” or “inapplicable” is a response option given to participants along with other response options, this can be coded on the same

Page 11: Preparing Scientific Data for Archiving and Sharing  · Web viewSome examples of direct identifiers include: participants’ names, photos or videos of participants, their addresses

11

scale as the other choices, and does not necessarily need to be declared as missing data like in the example here.

Slide 29 [Documentation]

No matter the data type, documentation is very important to include in what you submit to a repository to be shared. Documentation refers to any files with information that explains the data and the data collection processes to help secondary users independently understand the data.

The most common types of documents that are shared along with the data include codebooks or data dictionaries, which are documents that describe what the variables are and how they are organized and contain the variable names, variable labels, and value labels, a user guide that instructs secondary users how to use the data, and the original data collection instruments or questionnaires that include the complete text given to study participants, if applicable.

Generally, this information is submitted in Word or pdf format, but most repositories accept several other file types.

[Slide states: “Format: MS Word, PDF, ASCII, DDI XML”]

The more thoroughly documented your data are, the easier it will be for secondary users to understand your data and use them appropriately.

Slide 30 [Derivative/Constructed Data]

Now let’s consider an instance where you have constructed or derived data from a data set that was collected by other researchers. For example, you may have combined several data points from different researchers to produce a data set that is appropriate for your study, or you may have created several new index variables from variables in an existing data set. What should you share and what are you allowed to share in these instances?

First, recall that under the ACL Public Access Plan, you are required to share any scientific data that results from your study. This means that any variables that you constructed or derived from other data must be shared. Whether you can share the original data set from which your constructed variables came along with your data depends on the level of access to the original data set.

Before sharing any data from a data set that was collected by another researcher or organization, you will want to ensure that you have the rights to redistribute the original data. This is typically not a problem if the original data are unrestricted-use or public access data, but you should verify with the original source of the data before sharing them.

If the original data are restricted-use data or proprietary data that you do not have the rights to redistribute, you still need to deposit the data that you constructed or derived, as well as the source information about the original data. If this is the case, it is most helpful for secondary users if you also include the statistical code that you used in your statistical program to construct

Page 12: Preparing Scientific Data for Archiving and Sharing  · Web viewSome examples of direct identifiers include: participants’ names, photos or videos of participants, their addresses

12

the new data, and a document that explains how you constructed the new data in non-statistical terms. This way, if secondary users are interested in doing something similar, they can obtain the original data from the original data source on their own. They can then match that data up with the information that you provide about your new constructed variables or look at your statistical code to see what you did to recreate the derivative data set.

Slide 31 [Data Files]

Now let’s move on to the file-level structure of data. First, data and data files can come in a wide variety of formats and appearances. Again, with the nature of your data in mind, you will want to ask yourself what file structure will make the most sense for your data and for secondary users.

Slide 32 [Common Statistical Data File Formats]

There are several common file formats that can be accepted by repositories for sharing statistical data. It is easiest to share your data in a statistical package format, like SPSS, SAS, or Stata files. ASCII text format with statistical setups is also acceptable, and can easily be converted into other formats. Data in relational databases, Access, Excel, or text files that are comma delimited or tab delimited are also acceptable.

Slide 33 [Non-Statistical File Formats]

For non-statistical files, there are some formats that are easy to share and are more readily convertible to other formats. There is always some risk that a format may become obsolete over time as technology advances. This is particularly true for proprietary file formats. Repositories like ICPSR often differentiate formats that they can fully support from those that they cannot guarantee over time.

The formats listed on this slide are generally the easiest to share.

[Slide lists: Heading “Text:” Text, Rich Text, PDF (+OCR), MS Word; Heading “Image:” TIFF, JPEG2000, PNG, JPEG/JFIF, GIF; Heading “Audio:” MPEG, WAV, AIFF; Heading “Video:” JPEG2000, MOV, AVI]

Since technology advances quickly, it is a good idea to contact your repository before collecting data to discuss any changes in recommended formats that may have taken place recently in order to fully support access to the data over longer periods of time, especially if you are planning to collect non-statistical data.

For image and video formats in particular, file size and storage needs may be important to consider depending on the scale of your data collection. Again, you should consult with your repository before collecting data to ensure it can handle video and audio files at the scale your data will demand.

Page 13: Preparing Scientific Data for Archiving and Sharing  · Web viewSome examples of direct identifiers include: participants’ names, photos or videos of participants, their addresses

13

Slide 34 [File-Level Organization]

How you arrange your files can also affect how understandable your data collection is. There are several ways to organize your files. Again, this will largely depend on the nature of your data.

The information in this slide is relevant for both statistical and non-statistical data files. However, there are some special considerations for non-statistical data files, which will be discussed in the next slide.

You might have one large data file that contains all of your variables in one place. You may have one large file that separates your data into different sections, like using different sheets within an Excel file. You may also have several separate files that each contain a different type of data. You may put different data files in different folders, or you could nest folders within other folders, or arrange them hierarchically, if that makes sense for your data.

Any of these formats are acceptable, but the more complex the file structure is, the more information you will have to provide to secondary users about how to navigate your file or folder structure. This can be done in a separate Word document that also gets shared alongside your data files. [Slides states: If the structure of your data files is complex, consider creating a user guide.”]

Slide 35 [Importance of File Organization for Non-Statistical Data]

Unlike statistical data where you might have hundreds of data points in one spreadsheet or file, data points in formats such as video or image are often represented by an entire file. One of the best ways to ensure your data are easy to use by secondary users is to organize and label your files.

For smaller studies with 40-50 files or less, it is sufficient to select descriptive file names that help the user understand how the files are organized, and includes any descriptive information about the file that is easy to convey. For example, Female_1, Female_2, and so on. For larger studies, you should also consider using well-labeled folders and subfolders to group similar files together and make it easier for a user to navigate and find what they are looking for.

The best practice is to include a document called a “data listing” that lists all the files and indicates various characteristics associated with each file. For example, all of the image files in a study may be listed in a Microsoft Excel spreadsheet, and the columns can be used to indicate various demographic characteristics or treatment conditions for each image file. This kind of “data listing” can help users more easily identify a subset of files with relevant characteristics. For example, male participants at first interview who were age 45 and older.

Slide 36 [Study-Level Metadata]

Finally, let’s turn to discussing the study-level information that you should prepare to share along with your data. Study-level metadata is equally important and works in the same way for all data, regardless of format. When we talk about study-level metadata, we mean study

Page 14: Preparing Scientific Data for Archiving and Sharing  · Web viewSome examples of direct identifiers include: participants’ names, photos or videos of participants, their addresses

14

descriptions that explain the study purpose, collection methods, sample, and other information that enable the secondary user to help understand why the data were collected. The study-level metadata should focus on the broader study and the data collection efforts, so in that regard, it is not the same thing as a publication abstract.

Slide 37 [Study-Level Metadata]

When thinking about what information you should have ready to share about your study, it is helpful to think about the common “W” questions.

Who is involved in the study? This includes the principal investigators and other members of the research team, the participants of the study, and the funding organization of your project.

What is the study about? You should prepare a summary of the study that describes what you did throughout the project.

Where did the study and/or data collection take place?

When was the study conducted, or when were the data collected? To what time period do the data refer?

Why were the data collected? Describe the purpose, goals, and scope or aims of the study.

And finally, how did you collect the data? What design or methodology did you use? What was your sample and how was it chosen? Did you use any weighting techniques? And if so, what steps do secondary users need to take to apply the same weights?

You should have all of this information, as well as any additional information that you think would help others understand and use your data, in a separate document to share along with your data.

Slide 38 [Disclosure Risk]

Now that we have discussed setting up the structure of your data to enable data sharing, you may be concerned about how to share your data appropriately. You may be collecting health or educational data that are protected under federal statutes or laws, and you may be concerned about protecting the identity of your study’s participants. It is good to be concerned about these things and to think about how protecting participants fits in with data sharing.

When a data record contains information about an individual that can potentially identify them, we call this a disclosure risk. Some examples of data that have the potential to identify participants include detailed data on geography, exact birth date or other exact dates, exact occupations held, or a combination of several variables that secondary users may be able to use together to identify study participants.

Page 15: Preparing Scientific Data for Archiving and Sharing  · Web viewSome examples of direct identifiers include: participants’ names, photos or videos of participants, their addresses

15

Slide 39 [Direct Identifiers]

There are two types of identifying data: direct and indirect. Direct identifiers are variables that point explicitly to particular individuals or units. They may have been collected in the process of survey administration and are usually easily recognized.

Some examples of direct identifiers include: participants’ names, photos or videos of participants, their addresses and phone numbers, Social Security Numbers, or other unique numbers that are linked with an individual, like driver’s license numbers or employee ID numbers.

Slide 40 [Indirect Identifiers]

Indirect identifiers are not necessarily problematic on their own, but when combined with other variables, may reveal the identity of study participants. For instance, a United States ZIP code field may not be troublesome on its own, but when combined with other attributes like race and annual income, a ZIP code may identify unique individuals within that ZIP code, which means the answers the participant thought would be private are no longer private.

[Slide states: Geographic information (from state-level down, including zip and area codes)]

Other examples of indirect identifiers include information on employment or social organization membership, educational histories that include school names and/or years of attendance, and detailed income information.

Slide 41 [Do you have identifiable data?]

As the ACL Public Access Plan states, scientific data that must be shared only include de-identified data. This means that before sharing your data, you need to ensure that you have removed all identifying information from the data.

If your data are regulated or protected by certain laws or statutes, like HIPAA for health-related data, or FERPA for education-related data, these laws might identify specific variables that are considered personally identifiable information. For example, HIPAA has determined 18 identifiers that must be removed in order for a data set to be considered de-identified. This list includes several variables mentioned in the previous slides, including name, telephone number, and Social Security Number, but also things like exact age if you have participants aged 89 years or older.

[Slide states: “Example: HIPAA 18 identifiers that must be removed: name, telephone number, SSN, IP address, exact age if over 89, etc.”]

A link to the NIH website that lists all 18 identifiers under HIPAA can be found in the References section below this video.

Page 16: Preparing Scientific Data for Archiving and Sharing  · Web viewSome examples of direct identifiers include: participants’ names, photos or videos of participants, their addresses

16

If your data are protected by these types of laws, you should review them before sharing your data to ensure you understand what constitutes personally identifiable information.

Slide 42 [How to De-Identify your Data]

If your data include identifying information, there are various ways to de-identify your data set before sharing it. For statistical data with direct identifiers, like names or social security numbers, you should remove the variable from the data set completely. For text data, this information should be completely redacted.

For some identifying variables, however, you can recode the data. Age, for example, is only considered to be personally identifiable information if you have participants aged 89 years or older. You can recode your age variable so that all of the ages in your data set that are younger than 89 remain the same, but then all of the ages 89 or older are grouped together and represented by one value.

Other data types, such as video and image, can be very rich in context, and reducing disclosure risk may be more time consuming and complex than it is for statistical data. For example, images and video typically cannot be de-identified without affecting their use. However, you can still share de-identified transcripts or coded data instead of the raw data.

Part ThreeSlide 43 [Part 3: Sharing Data]

We have reached the third and final part of this video, which covers the practical elements of sharing your data.

Slide 44 [Part 3 Outline]

First, you will be introduced to ICPSR, which stands for the Inter-university Consortium for Political and Social Research. Following that, we will identify the elements of a high-quality data repository. We will then walk through the process of depositing your data at ICPSR, discuss data citation and data usage tracking, and end with some important reminders as you begin your preparation for data archiving and sharing.

Slide 45 [ICPSR]

What is ICPSR?

It is the preferred data repository for all scientific data that come from NIDILRR-funded projects. ICPSR is housed within the Institute for Social Research at the University of Michigan and was established in 1962.

Page 17: Preparing Scientific Data for Archiving and Sharing  · Web viewSome examples of direct identifiers include: participants’ names, photos or videos of participants, their addresses

17

ICPSR is one of the oldest and largest social science data archives in the world, and defines social science broadly to encompass and accept data from many disciplines.

Its current collection includes data from over 10,000 studies, with over 5 million variables. ICPSR’s website has a searchable bibliography that contains more than 75,000 citations to publications that use ICPSR-archived data.

Slide 46 [What ICPSR Does]

ICPSR takes in research data, archives them for long-term preservation, and shares them with secondary users. Any researcher, including all NIDILRR grantees, have two options for archiving and sharing data with ICPSR.

The first option is to use ICPSR’s basic archival and data sharing services at no cost. This is done through a service called openICPSR. This means that you can deposit your data and all accompanying documentation, and ICPSR will archive your data for the long-term and will share your data as-is with secondary users at no cost to you, or the users.

The second option is to add ICPSR’s curation services, which come at a cost to the researcher, but still at no cost to secondary users. Data curation means that ICPSR will enhance your data for usability. This service is added to the basic services of archiving your data and sharing them with secondary users.

Both ICPSR’s free services and curation services will satisfy the data sharing requirements in the ACL Public Access Plan, and are covered in more detail in later slides.

In addition to its data services, ICPSR also provides user support to all researchers who want to archive their data at ICPSR. Its staff can assist researchers at all stages of the research process, from helping to write data management plans, to walking researchers through the deposit process after the data have been collected. Additionally, ICPSR provides support to secondary users who wish to access data archived at ICPSR.

Finally, ICPSR provides a variety of education and training opportunities related to statistical methods and data management. Information on these resources are available on ICPSR’s website.

Slide 47 [What Data Fit at ICPSR]

What types of data are a good fit to be archived at ICPSR?

The short answer is almost all data on or about people. ICPSR accepts data from almost all disciplines, including the social and behavioral sciences, broadly defined, health and medical sciences, as well as rehabilitation sciences.

Page 18: Preparing Scientific Data for Archiving and Sharing  · Web viewSome examples of direct identifiers include: participants’ names, photos or videos of participants, their addresses

18

Data can be archived at ICPSR in almost all formats. ICPSR has the most experience with tabular, statistical data and text data, but it also accepts other types of data, including geospatial data, videos, images, audio files, and sensor data.

Data sets of all sizes are welcome at ICPSR, including from both small and large-scale studies.

Finally, ICPSR accepts both de-identified data for public use and sensitive data that may need to be restricted-use. If you have any questions about whether your data are a good fit at ICPSR, please contact them at [email protected], or 734-647-2200.

Slide 48 [Free ICPSR Services (openICPSR)]

As mentioned, ICPSR offers a number of free services to any researcher, including all NIDILRR grantees. This is done through openICPSR. The free services offered by openICPSR will satisfy all the requirements of the ACL Public Access Plan, including providing researchers the option to embargo their data for up to 24 months and providing a DOI for each data set produced. ICPSR will archive the data for the long-term and will provide access to the data to secondary users at no cost to them.

It is important to note that the data will be preserved and shared with secondary users in the same format and condition in which they are deposited at ICPSR. As mentioned earlier in the video, some data formats, particularly those in proprietary formats, have some risk of becoming unreadable over long periods of time as technology advances.

Slide 49 [Paid Curation Services at ICPSR]

What is paid curation?

At ICPSR, data curation refers to a variety of services that can be provided to enhance the data to make sure that people can find and use them, both now and long into the future.

For NIDILRR grantees, data must be made available for the public to use at no cost to the user. As mentioned, that can be accomplished by using ICPSR’s free services, but data shared through free ICPSR are not curated and are provided to users as-is, in the condition in which they are uploaded. To ensure the data are accessible at no cost to users and are enhanced for quality to optimize the use of those data, you may request ICPSR to provide their curation services at a cost.

The fee for curation is an allowable cost, and can be included in your proposed budget to NIDILRR. The fee is based on various attributes of your expected data, such as number of variables in the data set. ICPSR typically estimates a single fee that may cover: a review of the data for personally identifiable information and mitigation of this information to ensure the data set is de-identified, standardization of missing data codes, review and adjustments for outliers or data coding errors, review and enhancements to information about your study and its variables, development of a

Page 19: Preparing Scientific Data for Archiving and Sharing  · Web viewSome examples of direct identifiers include: participants’ names, photos or videos of participants, their addresses

19

standardized, enhanced codebook, dissemination of statistical data in multiple formats to secondary users, and ICPSR also takes numerous steps to mitigate data obsolescence as technology changes.

[Slide states: “Curation services: Review for PII, Missing Data Standardization, Check and Adjust Outliers, Coding Errors, Review of Study-level and Variable-level Metadata, Standardized ICPSR Codebook, Dissemination in SAS, SPSS, Stata, R, ASCII formats, Preservation”]

Slide 50 [Elements of a High-Quality Repository]

In the case that there is a valid reason as to why your data will not be appropriate for ICPSR, we provide some information about how to evaluate an alternative repository, and determine if it offers a similar level of high-quality data archiving and sharing.

Remember, if you choose a repository for your data that is not ICPSR, you must provide justification to NIDILRR as to why ICPSR is not a good fit for your data, as well as explain the alternative repository’s comparability to ICPSR for best practices, including the ability to provide an embargo to your data for up to 24 months, and the ability to issue DOIs for data sets.

The repository should adhere to the FAIR data principles, which are a set of guiding principles for scientific data management that promote maximum use of research data. FAIR stands for findable, accessible, interoperable, and reusable. A high-quality repository will adhere to, and have transparent policies and practices to ensure data are FAIR.

ICPSR and other best practice repositories typically follow the open archival information system, or OAIS, reference model.

The emerging certification for repositories is CoreTrustSeal. [Slide states: (formerly Data Seal of Approval)] CoreTrustSeal has 16 different requirements for repositories to meet, which equate to getting a designation as a trustworthy repository. You may consult CoreTrustSeal to learn more about these 16 requirements that the worldwide community of repositories has agreed to as being important.

Slide 51 [How to Deposit Data with ICPSR]

Now let’s talk about how to actually deposit your data at ICPSR. Depositing your data at ICPSR is all done online, through ICPSR’s data deposit manager.

To get to the data deposit manager, you can first go to ICPSR’s main webpage. The web address is simply www.icpsr.umich.edu or you can search for ICPSR in your search engine and follow the link from there.

Toward the top of ICPSR’s homepage, there will be a link to start sharing data. Note that the terms “sharing data” and “depositing data” are used somewhat interchangeably on ICPSR’s

Page 20: Preparing Scientific Data for Archiving and Sharing  · Web viewSome examples of direct identifiers include: participants’ names, photos or videos of participants, their addresses

20

website. Once you click on the link to start sharing data, you will be taken to the main deposit page.

On the data deposit homepage, there are instructions for what you will need to complete your data deposit. Once you click on the “start deposit” button, you must first create a MyData account if you do not already have one through ICPSR. A MyData account is free and easy to create. Alternatively, you can log in using your credentials from other sites like LinkedIn or Facebook.

Additional information for depositing your NIDILRR-funded data at ICPSR can be found through the website of one of ICPSR’s topical archives, ADDEP. ADDEP stands for the Archive of Data on Disability to Enable Policy and research. ADDEP’s website, www.icpsr.umich.edu/addep, is set up similar to the main ICPSR website.

On the “Deposit Data” page on the ADDEP website, you will find a link to a webpage that is dedicated to NIDILRR grantees. This is where ICPSR will update any information about the deposit process in the event that changes are made to the deposit form in the future. This is also where you will be able to find additional resources to assist you with preparing your data for sharing.

Slide 52 [Deposit Form]

Once you have created and logged in to your MyData account, you will be taken to the deposit form. There are a couple features to the deposit form that are important to note.

First, you do not need to complete your deposit in one sitting. Anything you enter or upload into the form will be saved so you can log out and return to the form at a later date to complete it.

Second, you can share the deposit form with others. For example, if you have colleagues who assisted you on the project, you can share the form with them before submitting it to ICPSR. This allows your colleagues to help you complete the form or upload any data or accompanying documentation that they may have. You can also share the deposit form with an ICPSR staff member to assist you in completing the deposit. If you need any assistance at this stage, simply contact ICPSR and someone there can inform you of the appropriate staff member with whom you can share your deposit prior to submitting it.

The form will include several sections that you will need to fill out, including PI information, a summary of the study, the dates of data collection, and sampling information.

For the field that asks for the funding source, you should type in “NIDILRR” exactly as it appears on the screen inside the quotation marks. Please also include your award number in the corresponding field.

Some items that you should have prepared to help make the deposit form quicker and easier to complete include the final version of each data set generated during your project, your codebook or data dictionary that describes your variables and values, and information about the study itself,

Page 21: Preparing Scientific Data for Archiving and Sharing  · Web viewSome examples of direct identifiers include: participants’ names, photos or videos of participants, their addresses

21

such as a summary, the purpose or goals of the study, and information about your sample. You should also include a copy of the data collection instrument, like the survey or telephone script used, and any other documents that could help secondary users understand and use your data.

If you have distinct data sets from more than one project under your award, you should start and submit a new, separate deposit form for each data set to ensure each one gets its own DOI.

Slide 53 [Deposit Agreement]

Once you have completed the deposit form by filling in all of the information asked about your study and uploading all of your data and accompanying documentation, you can proceed to submit the deposit to ICPSR.

As part of the submission process, you will be asked to agree to ICPSR’s deposit terms. This includes confirming that you have obtained all the necessary rights to share the data and make them available for the public. The terms also grant permission to ICPSR to share the data with secondary users and promote or advertise the data to the broader scientific community.

You also grant permission for ICPSR to preserve the data. ICPSR will permanently archive and continue to share all deposited files in perpetuity. There may be some exceptional circumstances under which you may request that your data no longer be shared with secondary users. For example, if a serious violation in the consent process is discovered or if a new law or policy is enacted that prevents your data from being shared, you may contact ICPSR to request that your data no longer be shared with secondary users.

Additionally, you must confirm that the data you submitted are de-identified.

You have the option to choose an embargo period, meaning the data you submit will not be released to secondary users until the date you specify on this form. As a reminder, the ACL Public Access Plan allows you up to 24 months to embargo your data after your award’s end date.

Slide 54 [Data Citation]

Once your data are deposited, each data collection will receive a DOI, or direct object identifier. A DOI is a link to your data that is persistent and stable over time. Each data collection submitted to ICPSR receives its own homepage and DOI, and each file submitted as part of the data collection can also receive its own DOI upon a user’s request. Having a DOI is important so that you and other researchers can cite your data with a persistent identifier, increasing the findability of your work.

Once your data are released for the public, a secondary user who wants to work with your data must agree to a set of terms through ICPSR as well. Part of the terms to which secondary users must agree requires the secondary user to reference your data collection’s citation, which includes the DOI, in any resulting publication. Secondary users are also required to send citations of their published work that result from your data to ICPSR. ICPSR will then include

Page 22: Preparing Scientific Data for Archiving and Sharing  · Web viewSome examples of direct identifiers include: participants’ names, photos or videos of participants, their addresses

22

these publications in the bibliography section on your data collection’s homepage, so that researchers interested in the same topic as your data can see all of the related publications in one place.

Slide 55 [Tracking Usage]

A benefit of depositing your data at ICPSR is that you are able to view some usage statistics for your data directly on your data collection’s homepage. When you visit your data collection’s homepage, you can view how many times your data collection has been viewed, how many times your data have been downloaded, and the number of related publications using your data.

Slide 56 [Recap]

This concludes this video on preparing scientific data for archiving and sharing. To recap, you should start thinking about data sharing from the conception of your project, as well as while writing your grant proposal. Information about what you will do with the data and how you will share them is important to include in your data management plan.

Review the instructions on data management plans and the data sharing requirements in the ACL Public Access Plan to make sure that you address all required elements. You should also contact your own institutional review board in the early stage of planning to make sure you understand what you are required to submit to them, to ensure you can share your data at the end of your project.

As you plan your project, review the best practices for data management that were discussed in this video. Not only are these important considerations for sharing your data with others, but they are also good practices for your own data management.

As a NIDILRR grantee, the ACL Public Access Plan requires you to share all scientific data that come from your project. ICPSR at the University of Michigan is the preferred repository to share your data. You may contact ICPSR at any time throughout your project if you need assistance, from getting help with writing your data management plan, to requesting a quote for curation services, to helping you with the online data deposit form. Contact information and many other additional resources can be found both at ICPSR’s main website and at the webpage dedicated to NIDILRR grantees available through the deposit page on ICPSR’s ADDEP website: www.icpsr.umich.edu/ADDEP.

Slide 57 [Credit Slide]

Thank you for watching!

[Slide states: “Video produced by the Inter-university Consortium for Political and Social Research (ICPSR); Narrated by Johanna Bleckman. The contents of this video were developed under a contract from the National Institute on Disability, Independent Living, and Rehabilitation Research (contract number HHSP233201860097A). NIDILRR is a Center within the

Page 23: Preparing Scientific Data for Archiving and Sharing  · Web viewSome examples of direct identifiers include: participants’ names, photos or videos of participants, their addresses

23

Administration for Community Living (ACL), Department of Health and Human Services (HHS).”]