Understanding Identifiable Data

Identifiable data is any information (personal or indirect) that can link a participant to a research study..

hero image

Researchers working with human subjects will often hear the phrase, “remove all identifiable data” or, “protect identifiable data with reliable security measures.” Identifiable data is vulnerable, as it includes information or records about the research participant that allows others to identify that person. If unauthorized individuals gain access to identifiable data, there could be a breach in confidentiality and privacy agreements. The  Health Insurance Portability and Accountability Act of 1996 (HIPAA) protects 18 types of personal identifiers. For most human subjects research at Teachers College (TC), personal identifiers include the following:

  • Social security numbers
  • Bank account information
  • Fingerprints
  • Telephone numbers
  • Home or email addresses
  • Medical record numbers
  • Codes that link de-identified data to identifiers (not stored separately from data)

Audio or video recordings of participants are also considered forms of personal identifiers and should be protected as such.

Data may also be considered identifiable if it is combined with enough information to potentially identify a participant. For example, indirect identifiers are instances where a researcher does not collect personal identifiers, such as names, but combines enough information that someone familiar with the participant’s background could potentially identify them. Indirect identifiers include:

  • City or state of residence
  • Occupation or role
  • Job function or title
  • Specific time, event, context, or occasion

While these identifiers alone may not be enough to deduce a participant from a study, a combination of them might make a participant identifiable. For example:

  • Demographic information and immigration status of ethnic minorities in a rural county
  • A study on workplace performance among individuals with depression recruited from a small organization
  • Graduates’ perspectives from a small high school coupled with their occupation

Data collection sources like Amazon’s MTurk are not completely anonymous as the workers’ IDs are linked and stored by Amazon.com. Researchers should clarify that they will not link any identifying information, including the workers’ IDs, to the data they obtain from MTurk. Please visit  Amazon’s MTurk Privacy Policy for more information.

Direct or indirect identifiable data is subject to the following privacy and security measures:

  • Store datasets on TC approved systems, such as TC Google Drive
  • Transmit data with identifiers over TC provided Virtual Private Network (VPN)
  • Configure systems with approved anti-virus softwares provided by TC Information Technologies (TC IT)
  • Encrypt datasets containing identifiers

Researchers should also consult TC Information Technologies (TC IT) on ways to collect, store, transmit, and secure data with identifiers. Please review our Data Sharing, Requests, & Encryption  guide for more information.

Any data that does not include identifiers (personal identifiers, indirect identifiers, audio recordings, video recordings) is considered anonymous. One way to gauge if data has no identifiers is if the researcher can determine the source of the data either through knowledge or inference. If the collected data is from an individual that the investigator cannot identify, even if pressed, the data collection can be considered anonymous.

There are instances when data will have identifiers, but can be stripped of identifiers and be stored, secured, maintained, analyzed as anonymized data. For example, a researcher can transcribe and code video recordings, securely removing all participant information, and then destroy the video with only the de-identified transcript as record of the data. A researcher may also receive identifiable data through a data sharing agreement (e.g.,  Data Sharing Form Template ) and upon receiving the transmission, de-identify it.

In these cases, The IRB needs to know how the researcher will receive identifiable data and how they plan to de-identify it.

If a secondary research study involving human subjects does not qualify for an exemption (review our Submitting a Protocol for Existing Data  for more information), the study must comply with the criteria for IRB approval of research at 45 CFR 46.111 (which includes the requirement for seeking informed consent from every prospective subject or legally authorized representative, unless informed consent is waived by the IRB). Under the revised Common Rule, there are options to conduct a secondary research study that involves human subjects but does not qualify for an exemption:

  • Apply for and obtain a waiver of the requirement for informed consent from the IRB
  • Seek and obtain the study-specific informed consent of each potential subject or legally authorized representative for the study in question

TC IRB will review each data type and source on a case-by-case basis and determine the review category that best fits the data collected.

— Dr. Myra Luna Lucero & Kailee Kodama Muscente

Published Monday, Jun 29, 2020

Institutional Review Board

Address: Russell Hall, Room 13

* Phone: 212-678-4105 * Email:   [email protected]

Appointments are available by request . Make sure to have your IRB protocol number (e.g., 19-011) available.  If you are unable to access any of the downloadable resources, please contact  OASID via email [email protected] .

Our websites may use cookies to personalize and enhance your experience. By continuing without changing your cookie settings, you agree to this collection. For more information, please see our University Websites Privacy Notice .

Office of the Vice President for Research

  • Service Units
  • Research Integrity & Compliance
  • Human Subjects Research
  • Researcher’s Guide

Guidance on Secondary Analysis of Existing Data Sets

The University of Connecticut Institutional Review Board (IRB) recognizes that some research projects involving existing data sets and archives may not meet the definition of “human subjects” research requiring IRB review; some may meet definitions of research that is exempt from the federal regulations at 45 CFR part 46; and some may require IRB review. This document is intended to provide guidance on IRB policies and procedures and to reduce burdens associated with IRB review for investigators whose research involves only the analysis of existing data sets and archives. The IRB acknowledges the guidance document prepared by the University of Chicago Social and Behavioral Sciences IRB as the model for this Guidance.

Although projects that only involve secondary data analysis do not involve interactions or interventions with humans, they may still require IRB review, because the definition of “human subject” at 45 CFR 46.102(f) includes living individuals about whom an investigator obtains identifiable private information for research purposes .

1. When does secondary use of existing data not require IRB review?

In general, the secondary analysis of existing data does not require IRB review when it does not fall within the regulatory definition of research involving human subjects.

A. Public Use Data Sets

Public use data sets are prepared with the intent of making them available for the public. The data available to the public are not individually identifiable and therefore analysis would not involve human subjects. The IRB recognizes that the analysis of de-identified, publicly available data does not constitute human subjects research as defined at 45 CFR 46.102 and that it does not require IRB review. The IRB no longer requires the registration or review of studies involving the analysis of public use data sets unless a project merges multiple data sets and in so doing enables the identification of individuals whose data is analyzed. An IRB review may be required for a research study that relies exclusively on secondary use of anonymous information BUT records data linkage or disseminates results in such a way that it generates identifiable information.

In addition to being identifiable, existing data must include “private information” in order to constitute research involving human subjects. Private information is defined as information which has been provided for specific purposes by an individual and which the individual can reasonably expect will not be made public (e.g., a medical or school record). For example, a study involving only analysis of the published salaries and benefits of university presidents would not need IRB review since this information is not private.

B. De-identified Data

If a dataset has been stripped of all identifying information and there is no way it could be linked back to the subjects from whom it was originally collected (through a key to a coding system or by other means), its subsequent use by the Principal Investigator or by another researcher would not constitute human subjects research, since the data is no longer identifiable. “Identifiable” means the identity of the subject is known or may be readily ascertained by the investigator or associated with the information. In general, information is considered to be identifiable when it can be linked to specific individuals by the researcher either directly or indirectly through coding systems, or when characteristics of the information obtained are such that a reasonably knowledgeable person could ascertain the identities of individuals. Even though a dataset has been stripped of direct identifiers (e.g., names, addresses, student ID numbers, etc.), it may still be possible to identify an individual through a combination of other characteristics (e.g., age, gender, ethnicity, place of employment).

Example: Many student research projects involve secondary analysis of data that belongs to, or was initially collected by, their faculty advisor or another investigator. If the student is provided with a de-identified, non-coded data set, the use of the data does not constitute research with human subjects because there is no interaction with any individual and no identifiable private information will be used.

Coded data: Secondary analysis of coded private information is not considered to be research involving human subjects and would not require IRB review IF the investigator(s) cannot readily ascertain the identity of the individuals to whom the coded private information pertains as a result of one of the following circumstances:

  • The investigators and the holder of the key have entered into an agreement prohibiting the release of the key to the investigators under any circumstances, until the individuals are deceased (HHS regulations for humans subjects research do not require the IRB to review and approve this agreement);
  • There are IRB-approved written policies and operating procedures for a repository or data management center that prohibit the release of the key to the investigator under any circumstances, until the individuals are deceased; or
  • There are other legal requirements prohibiting the release of the key to the investigators, until the individuals are deceased.

For more information on when analysis of coded data is or is not human subjects research, see the HHS Office for Human Research Protections Guidance on Research Involving Coded Private Information or Biological Specimens at http://www.hhs.gov/ohrp/policy/cdebiol.html .

Note: If a student is analyzing coded data from a faculty advisor/sponsor who retains a key, this would be human subjects research, because the faculty advisor is considered an investigator on the student’s protocol, and can readily ascertain the identity of the subjects since he/she holds the key to the coded data. If the student’s work fits within the scope of the initial protocol from which the dataset originates, the faculty advisor (or investigator who holds the dataset) may wish to consider adding the student and his/her work to the original protocol by means of an amendment application rather than having the student submit a new application for review.

Example: Researcher B plans to examine the relationships between attention deficit hyperactivity disorder (ADHD), oppositional defiance disorder, and teen drug abuse using data collected by Agencies I, II, and III that work with “at risk” youth. The data will be coded and the agencies have entered into an agreement prohibiting release of the key to the researcher that could connect the data with identifiers. The use of the data would not constitute research with human subjects.

If the IRB determines that the project does not constitute human subjects research, the IRB will notify the investigator. If the IRB determines that the project does involve human subjects research, the investigator will be asked to submit a protocol for consideration by the IRB.

2. When is the secondary use of existing data exempt?

There are six categories of research activities involving human subjects that may be exempt from the requirements of the federal regulations on human subjects research protections (45 CFR 46.101(2)(b)). However, only one exemption category (Category 4) applies specifically to existing data. If research is found to be exempt, it need not receive full or expedited review. In order to qualify for an exempt determination, an IRB-5 application must be submitted in InfoEd for IRB review.

Research involving collection or study of existing data, documents, and records can be exempted under Category 4 of the federal regulations if: (i) the sources of such data are publicly available; or (ii) the information is recorded by the investigator in such a manner that subjects cannot be identified, directly or through identifiers linked to the subjects.

The latter condition of this category applies in cases where the investigators initially have access to identifiable private information but abstract the data needed for the research in such a way that the information can no longer be connected to the identity of the subjects. This means that the abstracted data set does not include direct identifiers (names, social security numbers, addresses, phone numbers, etc.) or indirect identifiers (codes or pseudonyms that are linked to the subject’s identity). Furthermore, it must not be possible to identify subjects by combining a number of characteristics (e.g., date of birth, gender, position, and place of employment). This is especially relevant in smaller datasets, where the population is confined to a limited subject pool. The following do not qualify for exemption: Research involving prisoners, and FDA-regulated research.

Example: Student A will be given access to data from her faculty advisor’s health survey research project. The data consists of coded survey responses, and the advisor will retain a key that would link the data to identifiers. The student will extract the information she needs for her project without including any identifying information and without retaining the code. The use of the data does constitute research with human subjects because the initial data set is identifiable (albeit through a coding system); however, it would qualify for exempt status.

3. When does the secondary use of existing data require expedited or full board review?

If secondary analysis of existing data does involve research with human subjects and does not qualify for exempt status as explained above, the project must be reviewed either through expedited procedures or by the full (convened) IRB, and an IRB-1 protocol application must be submitted in InfoEd for IRB review.

Consent: Researchers using data previously collected under another study should consider whether the currently proposed research is a “compatible use” with what subjects agreed to in the original consent form. For non-exempt projects, a consent process description or justification for a waiver must be included in the research protocol.

The IRB may require that informed consent for secondary analysis be obtained from subjects whose data will be accessed.

Alternatively, the IRB can consider a request for a waiver of one or more elements of informed consent under 45 CFR 46.116(d). In order to approve such waiver, the IRB must first be satisfied that the research:

  • presents minimal risk (no risks of harm, considering probability and magnitude, greater than those ordinarily encountered in daily life or during the performance of routine examinations or tests); and
  • the waiver or alteration will not adversely affect the rights and welfare of the subjects; and
  • the research could not practicably be carried out without the waiver or alteration; and
  • whenever appropriate, the subjects will be provided with additional pertinent information after participation.

“Restricted Use Data”: Certain agencies and research organizations release files to researchers with specific restrictions regarding their use and storage. These restrictions are typically described in a data use or restricted use data agreement the organization requires be signed in order to receive the data. The records frequently contain identifiers or extensive variables that combined might enable identification, even though this is not the intent of the researcher. Research using these data sets requires expedited or full board level review. Note that the data use or restricted use data agreement must be reviewed by Sponsored Programs Services (SPS) prior to institutional approval. The IRB will not approve the study until the agreement receives approval by SPS. The protocol may be submitted to the IRB at the same time the agreement is submitted to SPS.

1) Student C will be given access to coded mental health assessments from his faculty advisor’s research project. The student plans to analyze the data with a code attached to each record, and the advisor will retain a key to the code that would link the data to identifiers. The use of the data does constitute research with human subjects and does not qualify for exempt status since subjects can be identified. This student project would require an IRB-1 protocol application to be submitted in InfoEd for expedited or full board review by the IRB.

Note: As previously noted, if the student’s work fits within the scope of the initial protocol from which the dataset originates, the faculty advisor (or investigator who holds the dataset) may wish to consider adding the student and his/her work to the original protocol by means of an amendment application rather than having the student submit a new application for expedited or full board review.

2) Student D is applying to the National Center for Health Statistics for use of data from the National Health and Nutrition Examination Survey that includes geographic identifiers and date of examination. The analysis of this restricted use data would require IRB-1 protocol application to be submitted in InfoEd for expedited or full board review by the IRB.

Back to Researcher’s Guide

Thank you for visiting nature.com. You are using a browser version with limited support for CSS. To obtain the best experience, we recommend you use a more up to date browser (or turn off compatibility mode in Internet Explorer). In the meantime, to ensure continued support, we are displaying the site without styles and JavaScript.

  • View all journals
  • My Account Login
  • Explore content
  • About the journal
  • Publish with us
  • Sign up for alerts
  • Data Descriptor
  • Open access
  • Published: 13 July 2020

A dataset describing data discovery and reuse practices in research

  • Kathleen Gregory   ORCID: orcid.org/0000-0001-5475-8632 1  

Scientific Data volume  7 , Article number:  232 ( 2020 ) Cite this article

10k Accesses

8 Citations

32 Altmetric

Metrics details

  • Research data
  • Social sciences

This paper presents a dataset produced from the largest known survey examining how researchers and support professionals discover, make sense of and reuse secondary research data. 1677 respondents in 105 countries representing a variety of disciplinary domains, professional roles and stages in their academic careers completed the survey. The results represent the data needs, sources and strategies used to locate data, and the criteria employed in data evaluation of these respondents. The data detailed in this paper have the potential to be reused to inform the development of data discovery systems, data repositories, training activities and policies for a variety of general and specific user communities.

Machine-accessible metadata file describing the reported data: https://doi.org/10.6084/m9.figshare.12445034

Similar content being viewed by others

a researcher's study uses an identifiable dataset

High-throughput prediction of protein conformational distributions with subsampled AlphaFold2

Gabriel Monteiro da Silva, Jennifer Y. Cui, … Brenda M. Rubenstein

a researcher's study uses an identifiable dataset

Genome-wide association studies

Emil Uffelmann, Qin Qin Huang, … Danielle Posthuma

a researcher's study uses an identifiable dataset

Genomic data in the All of Us Research Program

The All of Us Research Program Genomics Investigators

Background & Summary

Reusing data created by others, so-called secondary data 1 , holds great promise in research 2 . This is reflected in the creation of policies 3 , platforms (i.e. the European Open Science Cloud 4 ), metadata schemas (i.e. the DataCite schema 5 ) and search tools, i.e. Google Dataset ( https://datasetsearch.research.google.com/ ) or DataSearch ( https://datasearch.elsevier.com ), to facilitate the discovery and reuse of data. Despite the emergence of these systems and tools, not much is known about how users interact with data in search scenarios 6 or the particulars of how such data are used in research 7 .

This paper describes a dataset first analysed in the article Lost or Found? Discovering data needed for research 8 . The dataset includes quantitative and qualitative responses from a global survey, with 1677 complete responses, designed to learn more about data needs, data discovery behaviours, and criteria and strategies important in evaluating data for reuse. This survey was conducted as part of a project investigating contextual data search undertaken by two universities, a data archive, and an academic publisher, Elsevier. The involvement of Elsevier enabled the recruitment strategy, namely drawing the survey sample from academic authors who have published an article in the past three years that is indexed in the Scopus literature database ( https://www.scopus.com/ ) .This recruitment strategy helped to ensure that the sample consisted of individuals active in research, across disciplines and geographic locations.

The data themselves are presented in two data files, according to the professional role of respondents. The dataset as a whole consists of these two data files, one for researchers (with 165 variables) and one for research support professionals (with 167 variables), the survey questionnaire and detailed descriptions of the data variables 9 . The survey questionnaire contains universal questions which could be applicable to similar studies; publishing the questionnaire along with the data files not only facilitates understanding the data, but it also fosters possible harmonization with other survey-based studies.

The dataset has the potential to answer future research questions, some of which are outlined in the usage notes of this paper, and to be applied at a practical level. Designers of both general and specific data repositories and data discovery systems could use this dataset as a starting point to develop and enhance search and sensemaking interfaces. Data metrics could be informed by information about evaluation criteria and data uses present in the dataset, and educators and research support professionals could build on the dataset to design training activities.

The description below of the methods used to design the questionnaire and to collect the data, as well as the description of potential biases in the technical validation section, all build on those presented in the author’s previous work 8 .

Questionnaire design

The author’s past empirical work investigating data search practices 10 , 11 (see also Fig.  1 ), combined with established models of interactive information retrieval 12 , 13 , 14 , 15 and information seeking 16 , 17 and other studies of data practices 18 , 19 , were used to design questions examining the categories identified in Table  1 . Specifically, questions explored respondents’ data needs, their data discovery practices, and their methods for evaluating and making sense of secondary data.

figure 1

Creation of dataset in relation to prior empirical work by the author. Bolded rectangles indicate steps with associated publications, resulting from an analytical literature review 10 , semi-structured interviews 11 and an analysis of the survey data 8 .

The questionnaire used a branching design, consisting of a maximum of 28 primarily multiple choice items (Table  1 ). The final question of the survey, which provided space for respondents to provide additional comments in an open text field is not represented in Table  1 . The individual items were constructed in accordance with best practices in questionnaire design, with special attention given to conventions for wording questions and the construction of Likert scale questions 20 , 21 . Nine of the multiple choice questions were constructed to allow multiple responses. There were a maximum of three optional open response questions. The majority of multiple choice questions also included the possibility for participants to write-in an “other” response.

The first branch in the questionnaire design was based on respondents’ professional role. Respondents selecting “librarians, archivists or research/data support providers,” a group referred to here as research support professionals , answered a slightly different version of the questionnaire. The items in this version of the questionnaire were worded to reflect possible differences in roles, i.e. whether respondents seek data for their own use or to support other individuals. Four additional questions were asked to research support professionals in order to further probe their professional responsibilities; four questions were also removed from this version of the questionnaire. This was done in order to maintain a reasonable completion time for the survey and because the removed questions were deemed to be more pertinent to respondents with other professional roles, i.e. researchers. The questionnaire is available in its entirety with the rest of the dataset 12 .

Sampling, recruitment and administration

Individuals involved in research, across disciplines, who seek and reuse secondary data comprised the population of interest. This is a challenging population to target, as it is difficult to trace instances of data reuse, particularly given the fact that data citation, and other forms of indexing, are still in their infancy 22 . The data reuse practices of individuals in certain disciplines have been better studied than others 23 , in part because of the existence of established data repositories within these disciplines 24 . In order to recruit individuals active in research across many disciplinary domains, a broad recruitment strategy was adopted.

Recruitment emails were sent to a random sample of 150,000 authors who are indexed in Elsevier’s Scopus database and who have published in the past three years. The recruitment sample was created to reflect the distribution of published authors by country within Scopus. Two batches of recruitment emails were sent: one of 100,000 and the other of 50,000. One reminder email was sent two weeks after the initial email. A member of the Elsevier Research and Academic Relations team created the sample and sent the recruitment letter, as access to the email addresses was not available to the investigator due to privacy regulations. The questionnaire was scripted and administered using the Confirmit software ( https://www.confirmit.com/ ).

1637 complete responses were received during a four-week survey period between September and October 2018 using this methodology. Only seven of the 1637 responses came from research support professionals. In a second round of recruitment in October 2018, messages were posted to discussion lists in research data management and library science to further recruit support professionals. Individuals active in these lists spontaneously posted notices about the survey on their own Twitter feeds. These methods resulted in an additional 40 responses, yielding a total of 1677 complete responses.

Ethical review and informed consent

This study was approved by the Ethical Review Committee Inner City faculties (ERCIC) at Maastricht University, Netherlands, on 17 May 2018 under the protocol number ERCIC_078_01_05_2018.

Prior to beginning the study, participants had the opportunity to review the informed consent form. They indicated their consent by clicking on the button to proceed to the first page of survey questions. Respondents were informed about the purpose of the study, its funding sources, the types of questions which would be asked, how the survey data would be managed and any foreseen risks of participation.

Specifically, respondents were shown the text below, which also states that the data would be made available in the DANS-EASY data repository ( https://easy.dans.knaw.nl ), which is further described in the Data Records section of this paper.

Your responses will be recorded anonymously, although the survey asks optional questions about demographic data which could potentially be used to identify respondents. The data will be pseudonymized (e.g. grouping participants within broad age groups rather than giving specific ages) in order to prevent identification of participants. The results from the survey may be compiled into presentations, reports and publications. The anonymized data will be made publicly available in the DANS-EASY data repository.

Respondents were also notified that participation was voluntary, and that withdrawal from the survey was possible at any time. They were further provided with the name and contact information of the primary investigator.

Data Records

Preparation of data files.

The data were downloaded from the survey administration system as csv files by the employee from Elsevier and were sent to the author. The downloads were performed in two batches: the 1637 responses received before the additional recruiting of research support professionals, and the 40 responses received after this second stage of recruitment. The seven responses from research support professionals from the first round of recruitment were extracted and added to the csv file from the second batch. This produced separate files for research support professionals and the remainder of respondents, who are referred to as researchers in this description. This terminology is appropriate as the first recruitment strategy ensured that respondents were published academic authors, making it likely that they had been involved in conducting research at some point in the past three years.

The following formatting changes were made to the data files in order to enhance understandability for future data reusers. All changes were made using the analysis program R 25 .

Open responses were checked for any personally identifiable information, particularly email addresses. This was done by searching for symbols and domains commonly used in email addresses (i.e. “@”; “.com,” and “.edu”). Two email addresses were identified in the final question recording additional comments about the survey. In consultation with an expert at the DANS-EASY data repository, all responses from this final question were removed from both data files as a precautionary measure.

Variables representing questions asked only to research support professionals were removed from datadiscovery_researchers.csv. Variables representing questions asked only to researchers were removed from datadiscovery_supportprof.csv.

Variables were renamed using mnemonic names to facilitate understanding and analysis. Variable names for questions asked to both research support professionals and researchers have the same name in both data files.

Variables were re-ordered to match the order of the questions presented in the questionnaire. Demographic variables, including role, were grouped together at the end of the data files.

Multiple choice options which were not chosen by respondents were recorded by the survey system as zeros. If a respondent was not asked a question, this is coded as “Not asked.” If a respondent wrote “NA” or a similar phrase in the open response questions, this was left unchanged to reflect the respondent’s engagement with the survey. If a respondent did not complete an optional open response question, this was recorded as a space, which appears as an empty cell. In the analysis program R, e.g., this empty space is represented as “ “.

Description of data and documentation files

The dataset described here consists of one text readme file, four csv files, and one pdf file with the survey questionnaire. These files should be used in conjunction with each other in order to appropriately use the data. Table  2 provides a summary and description of the files included in the dataset.

Descriptions of the variable names are provided in two files (Table  2 ). Variables were named following a scheme that matches the structure of the questionnaire; each variable name begins with a mnemonic code representing the related research aim. The primary codes are summarised in Table  3 . The values of the variables for multiple choice items are represented as either a “0” for non-selected options, as described above, or with a textual string representing the selected option.

The dataset is available at the DANS-EASY data repository 9 . DANS-EASY is a principal component of the federated national data infrastructure of the Netherlands 26 and is operated by the Data Archive and Networked Services (DANS), an institution of the Royal Netherlands Academy for Arts and Sciences and the Dutch Research Council. DANS-EASY has a strong history of providing secure long-term storage and access to data in the social sciences 27 . The repository has been awarded a CoreTrustSeal certification for data repositories ( https://www.coretrustseal.org/ ), which assesses the trustworthiness of repositories according to sixteen requirements. These requirements focus on organisational infrastructure (e.g. licences, continuity of access and sustainability), digital object management (e.g. integrity, authenticity, preservation, and re-use) and technology (e.g. technical infrastructure and security).

Sample characteristics

Respondents identified their disciplinary domains of specialization from a list of 31 possible domains developed after the list used by Berghmans, et al . 28 . Participants could select multiple responses for this question. The domain selected most often was engineering and technology, followed by the biological, environmental and social sciences (Fig.  2a ) Approximately half of the respondents selected two or more domains, with one quarter selecting more than three.

figure 2

( a ) Disciplinary domains selected by respondents; multiple responses possible (n = 3431). ( b ) Respondents’ years of professional experience; percentages denote percent of respondents (n = 1677). ( c ) Number of respondents by country of employment (n = 1677).

Forty percent of respondents have been professionally active for 6–15 years (Fig.  2b ). The majority identified as being researchers (82%) and are employed at universities (69%) or research institutions (17%). Respondents work in 105 countries; the most represented countries include the United States, Italy, Brazil and the United Kingdom (Fig.  2c ).

Technical Validation

Several measures were performed to ensure the validity of the data, both before and after data collection. Sources of uncertainty and potential bias in the data are also outlined below in order to facilitate understanding and data reuse.

Questionnaire development

The questionnaire items were developed after extensively reviewing relevant literature 10 , 29 , 30 , 31 , 32 and conducting semi-structured interviews to test the validity of our guiding constructs. To test the validity and usability of the questionnaire itself, a two-phase pilot study was conducted. In the first phase, four researchers, recruited using convenience sampling, were observed as they completed the online survey. During these observations, the researchers “thought out loud” as they completed the survey; they were encouraged to ask questions and to make remarks about the clarity of wording and the structure of the survey. Based on these comments, the wording of questions was fine tuned and additional options were added to two multiple choice items.

In the second pilot phase, an initial sample of 10,000 participants was recruited, using the primary recruitment methodology detailed in the methods section of this paper. After 102 participants interacted with the survey, the overall completion rate (41%) was measured and points where individuals stopped completing the survey were noted. Based on this information, option-intensive demographic questions (i.e. country of employment, discipline of specialization) were moved to the final section of the survey in order to minimize survey fatigue. The number of open-ended questions were also reduced and open-response questions were made optional.

The online presentation of the survey questions also helped to counter survey fatigue. Only one question was displayed at a time; the branching logic of the survey ensured that respondents were only shown the questions which were relevant to them, based on their previous answers.

Questionnaire completion

1677 complete responses to the survey questionnaire were received. Using the total number of recruitment emails in the denominator, this yields a response rate of 1.1%. Taking into account the number of non-delivery reports which were received (29,913), the number of invalid emails which were reported (81) and the number of recruited participants who elected to opt-out of the survey (448) yields a slightly higher response rate of 1.4%. It is likely that not all of the 150,000 individuals who received recruitment emails match our targeted population of data seekers and reusers. Knowledge about the individuals who did not respond to the survey and about the frequency of data discovery and reuse within research as a whole, is limited; this complicates the calculation of a more accurate response rate, such as the methodology described in 33 .

A total of 2,306 individuals clicked on the survey link, but did not complete it, yielding a completion rate of 42%. Of the non-complete responses, fifty percent stopped responding after viewing the introduction page with the informed consent statement. This point of disengagement could be due to a variety of reasons, including a lack of interest in the content of the survey or a disagreement with the information in the consent form. The majority of individuals who did not complete the survey stopped responding within the first section of the survey (75% of non-complete responses). Only data from complete responses are included in this dataset.

Of the 1677 complete responses, there was a high level of engagement with the optional open response questions. Seventy-eight percent of all respondents answered Q2 regarding their data needs; 92% of respondents who were asked Q5a provided an answer; and 69% of respondents shown Q10a described how their processes for finding academic literature and data differ.

Data quality and completeness

Checks for missing values and NAs were performed using standard checks in R. As detailed in the section on data preparation, multiple choice responses not selected by respondents were recorded as a zero. If a respondent was not asked a question, this was coded as “Not asked.” If a respondent wrote “NA” or a similar phrase in the open response questions, this was left unchanged to reflect the respondent’s engagement with the survey. If a respondent did not complete an optional open response question, this was recorded as a space, which appears as an empty cell. In the analysis program R, e.g., this empty space is represented as “ “.

Due to the limited available information about non-responders to the survey and about the frequency of data seeking and discovery behaviours across domains in general, the data as they stand are representative only of the behaviours of our nearly 1700 respondents - a group of data-aware people already active in data sharing and reuse and confident in their ability to respond to an English-language survey. Surveys in general tend to attract a more active, communicative part of the targeted population and do not cover non-users at all 34 . While not generalizable to broader populations, the data could be transferable 35 , 36 to similar situations or communities. Creating subsets of the data, i.e. by discipline, may provide insights that can be applied to particular disciplinary communities.

There are potential sources of bias in the data. The recruited sample was drawn to mirror the distribution of published authors by country in Scopus; the geographic distribution of respondents does not match that of the recruited sample (Table  4 ). This is especially noticeable for Chinese participants, who comprised 15% of the recruited sample, but only 4% of respondents. This difference could be due to a number of factors, including language differences, perceived power differences 37 , or the possibility that data seeking is not a common practice.

Our respondents were primarily drawn from the pool of published authors in the Scopus database. Some disciplinary domains are under-represented within Scopus, most notably the arts and humanities 38 , 39 . Subject indexing within Scopus occurs at the journal or source level. As of January 2020, 30.4% of titles in Scopus are from the health sciences; 15.4% from the life sciences; 28% from the physical sciences and 26.2% from the social sciences 45 . Scopus has an extensive and well-defined review process for journal inclusion; 10% of the approximately 25,000 sources indexed in Scopus are published by Elsevier 40 .

Self-reported responses also tend to be pro-attitudinal, influenced by a respondent’s desire to provide a socially acceptable answer. Survey responses can also be influenced by the question presentation, wording and multiple choice options provided. The pilot studies and the provision of write-in options for individual items helped to mitigate this source of error.

Usage Notes

Notes for data analysis.

It is key to note which questions were designed to allow for multiple responses. This will impact the type of analysis which can be performed and the interpretation of the data. These nine questions are marked with an asterisk in Table  1 ; the names of the variables related to these questions are summarized in Table  5 .

The data are available in standard csv formats and may be imported into a variety of analysis programs, including R and Python. The data are well-suited in their current form to be treated as factors or categories in these programs, with the exception of open response questions and the write-in responses to the “other” selection options, which should be treated as character strings. An example of the code needed to load the data into R and Python, as well as how to change the open and other response variables to character strings, is provided in the section on code availability. To further demonstrate potential analyses approaches, the code used to create Fig.  2a in R is also provided.

Certain analysis programs, i.e. SPSS, may require that the data be represented numerically; responses in the data files are currently represented in textual strings. The survey questionnaire, which is available with the data files, contains numerical codes for each response which may be useful in assigning codes for these variables.

Future users may wish to integrate the two data files to examine the data from all survey respondents together. This can easily be done by creating subsets of the variables of interest from each data file (i.e. by using the subset and select commands in R) and combining the data into a single data frame (i.e. using the rbind command in R). Variables that are common between both of the data files have the same name, facilitating this type of integration. An example of the code needed to do this is provided in the code for creating Fig.  2a .

Open and write-in responses are included in the same data file with the quantitative data. These variables can be removed and analysed separately, if desired.

To ease computational processing, the data do not include embedded information about the question number or the detailed meaning of each variable name. This information is found in the separate variable_labels csv file associated with each data file.

Potential questions and applications

The data have the potential to answer many interesting questions, including those identified below.

How do the identified practices vary by demographic variables? The data could be sub-setted to examine practices along the lines of:

Country of employment

Career stage, e.g. early career researchers

Disciplinary domain

What correlations exist among the different variables, particularly the variables allowing for multiple responses? Such questions could examine Box  1 – 3 :

Possible correlations between the frequency of use of particular sources and the type of data needed or uses of data

Possible correlations between particular challenges for data discovery and needed data or data use

How representative are these data of the behaviours of broader populations?

How will these behaviours change as new technologies are developed? The data could serve as a baseline for comparison for future studies.

How do practices within a particular domain relate to the existence of data repositories and infrastructures within a domain? Given the practices identified in this survey, how can repositories and infrastructures better support data seekers and reusers?

Box 1. R code for loading data and changing selected columns to character strings.

#Set the working directory; the data files should be in the working directory .

setwd(“~/Desktop/survey/“)

#Import the data files as data frames and store as “researcher.df” and “support.df”. If you don’t want to use factors, set stringsAsFactors = FALSE .

researcher.df < - read.csv(file = ‘datadiscovery_researchers.csv’, header = TRUE, stringsAsFactors = TRUE)

support.df < - read.csv(file = ‘datadiscovery_supportprof.csv’, header = TRUE, stringsAsFactors = TRUE)

#Select columns to be treated as character strings .

cols.res < - c(“need_open”,“need_othresp”,“use_othresp”,“find_whoothresp”,“source_open”,“strategy_othresp”,“find_litdatopen”,“find_chalothresp”,“eval_infopen”,“eval_stratopen”,“eval_trstopen”,“eval_qualopen”,“disc_othresp”)

cols.sup < - c(“whosupprt_othresp”,“supprt_othresp”,“need_open”,“need_othresp”,“use_othresp”,“source_open”,“strategy_othresp”,“findaccseval_oth”,“find_litdatopen”,“eval_infopen”,“eval_spprtdopen”,“eval_stratopen”,“eval_trstopen”,“eval_qualopen”,“disc_othresp”)

#Change these columns from factors to characters

researcher.df[cols.res] < - lapply(researcher.df[cols.res], as.character)

support.df[cols.sup] < - lapply(support.df[cols.sup], as.character)

str(researcher.df)

str(support.df)

Box 2. Python code for loading data and changing selected columns to character strings.

#Import required libraries

import pandas as pd

#Read in the csv as a pandas dataframe. Pandas will infer data types but we will explicitly set all to “categories” initially and then change the “str” (string) columns later . df = pd.read_csv(‘./datadiscovery_researchers.csv’, index_col = ‘responseid’, dtype = “category”)

#Create a list of the columns which are not categories but should be treated as strings .

str_cols = [‘need_open’,

       need_othresp’,

       use_othresp’,

       find_whoothresp’,

       source_open’,

       strategy_othresp’,

       find_litdatopn’,

       find_chalothresp’,

       eval_infopen’,

       eval_stratopen’,

       eval_trstopen’,

       eval_qualopen’,

       ‘disc_othresp’]

#Change data type for columns to be treated as strings .

df[str_cols] = df[str_cols].astype(“str”)

#Print data types to confirm

df[cat_cols].dtypes

df[str_cols].dtypes

Box 3. R code for creating Fig.  2a .

#Install packages and libraries for plot

install.packages(ggplot2)

install.packages(reshape2)

install.packages(dplyr)

library(ggplot2)

library(reshape2)

library(dplyr)

#Select and combine variables from both data files to use in the plot

researcherdisc.df < - subset(researcher.df, select = c(responseid,disc_agricul:disc_other))

supportdisc.df < - subset(support.df, select = c(responseid,disc_agricul:disc_other))

disc.df < - rbind(researcherdisc.df,supportdisc.df)

#Transform data from wide to long

disclong.df < - disc.df % > % melt(id.vars = c(“responseid”),value.name = “discipline”)

#Create data frames with frequencies

discfreq.df < - disclong.df % > %

       filter(discipline! = “0”) % > %

       select(responseid,discipline)% > %

       count(discipline)

#Create plot of frequencies

discplot < - ggplot(discfreq.df, aes(x = reorder(discipline, n), y = n)) + geom_bar(stat = “identity”, fill = “#238A8DFF”) + coord_flip()

#Format plot and add labels

discplot < - discplot + theme(plot.title = element_text(hjust = 0),axis.ticks.y = element_blank(),axis.text = element_text(size = 15),text = element_text(size = 15),panel.grid.major = element_blank(), panel.grid.minor = element_blank(),panel.background = element_blank()) + ylab(“Frequency”) + xlab(“Disciplinary domain”)

Code availability

All R scripts used in data preparation and technical validation, along with the un-prepared data, are available upon request from the corresponding author. Examples of how to load the data and how to change factor/category columns to character columns in R (Box  1 ) and Python (Box  2 ) are provided. Additionally, the code used to create Fig.  2a in R (Box  3 ) is listed as an example of how to combine data from both data files into a single plot.

Allen, M. In The SAGE Encyclopedia of Communication Research Methods .Vols. 1-4 (ed. Allen, M.) Secondary data (SAGE Publications, Inc, 2017).

Wilkinson, M. D. et al . The FAIR Guiding Principles for scientific data management and stewardship. Sci. Data 3 , 160018 (2016).

Article   Google Scholar  

European Commission. Facts and figures for open research data. European Commission website https://ec.europa.eu/info/research-and-innovation/strategy/goals-research-and-innovation-policy/open-science/open-science-monitor/facts-and-figures-open-research-data_en (2019).

European Commission. EOSC declaration: European Open Science Cloud: new research & innovation opportunities. European Commission website , https://ec.europa.eu/research/openscience/pdf/eosc_declaration.pdf#view=fit&pagemode=none (2017).

DataCite Metadata Working Group. DataCite metadata schema documentation for the publication and citation of research data, version 4.3. DataCite website https://doi.org/10.14454/7xq3-zf69 (2019).

Noy, N., Burgess, M. & Brickley, D. In The World Wide Web Conference Google Dataset Search: building a search engine for datasets in an open Web ecosystem (ACM Press, 2019).

Pasquetto, I. V., Randles, B. M. & Borgman, C. L. On the reuse of scientific data. Data Sci J. 16 , 1–9 (2017).

Gregory, K., Groth, P., Scharnhorst, A., & Wyatt, S. Lost or found? Discovering data needed for research. Harvard Data Science Review 2 (2020).

Gregory, K. M. Data Discovery and Reuse Practices in Research. Data Archiving and Networked Services (DANS) https://doi.org/10.17026/dans-xsw-kkeq (2020).

Gregory, K., Groth, P., Cousijn, H., Scharnhorst, A. & Wyatt, S. Searching data: a review of observational data retrieval practices in selected disciplines. J. Assoc. Inf. Sci. Technol. 70 , 419–432 (2019).

Article   CAS   Google Scholar  

Gregory, K. M., Cousijn, H., Groth, P., Scharnhorst, A. & Wyatt, S. Understanding data search as a socio-technical practice. J. Inf. Sci . 0165551519837182 (2019).

Ingwersen, P. Information retrieval interaction . (Taylor Graham, 1992).

Ingwersen, P. Cognitive perspectives of information retrieval interaction: elements of a cognitive IR theory. J. Doc. 52 , 3–50 (1996).

Belkin, N. J. In Inform ation re triev al’ 93: Von der Modellierung zur Anwendung (eds. Knorz, G., Krause, J. & Womser-Hacker, C.) Interaction with texts: Information retrieval as information-seeking behavior (Universitaetsverlag Konstanz, 1993).

Belkin, N. J. In ISI ’96: Proceedings of the Fifth International Symposium for Information Science (eds. Krause, J., Herfurth, M. & Marx, J.) Intelligent information retrieval: whose intelligence? (Universtaetsverlag Konstanz, 1996).

Blandford, A. & Attfield, S. Interacting with information: synthesis lectures on human-centered informatics (Morgan & Claypool, 2010).

Adams, A. & Blandford, A. In Proceedings of the 5th ACM/IEEE-CS joint conference on Digital libraries . Digital libraries’ support for the user’s ‘information journey’ (ACM Press, 2005).

Borgman, C. L. Big data, little data, no data: Scholarship in the networked world . (MIT press, 2015).

Faniel, I. M. & Yakel, E. In P Curating Research Data, Volume 1: Practical Strategies for Your Digital Repository (ed. Johnson, L.) Ch.4 (Association of College & Research Libraries, 2017).

de Vaus, D. Surveys In Social Research . (Routledge, 2013).

Robson, C. & McCartan, K. Real World Research . (John Wiley & Sons, 2016).

Park, H., You, S. & Wolfram, D. Informal data citation for data sharing and reuse is more common than formal data citation in biomedical fields. J. Assoc. Inf. Sci. Technol. 69 , 1346–1354 (2018).

Borgman, C. L., Wofford, M. F., Darch, P. T. & Scroggins, M. J. Collaborative ethnography at scale: reflections on 20 years of data integration. Preprint at, https://escholarship.org/content/qt5bb8b1tn/qt5bb8b1tn.pdf (2020).

Leonelli, S. Integrating data to acquire new knowledge: three modes of integration in plant science. Stud. Hist. Philos. Sci. C 44 , 503–514 (2013).

Google Scholar  

R Core Team. R: A language and environment for statistical computing. R-project website , https://www.r-project.org (2017).

Dillo, I. & Doorn, P. The front office–back office model: supporting research data management in the Netherlands. Int. J. Digit. Curation 9 , 39–46 (2014).

Doorn, P. K. Archiving and managing research data: data services to the domains of the humanities and social sciences and beyond: DANS in the Netherlands. Archivar 73 , 44–50 (2020).

Berghmans, S. et al . Open data: the researcher perspective. Elsevier website , https://www.elsevier.com/about/open-science/research-data/open-data-report (2017).

Kim, Y. & Yoon, A. Scientists’ data reuse behaviors: a multilevel analysis. J. Assoc. Inf. Sci. Technol. 68 , 2709–2719 (2017).

Kratz, J. E. & Strasser, C. Making data count. Sci. Data 2 , 150039 (2015).

Schmidt, B., Gemeinholzer, B. & Treloar, A. Open data in global environmental research: the Belmont Forum’s open data survey. PLoS ONE 11 , e0146695 (2016).

Tenopir, C. et al . Changes in data sharing and data reuse practices and perceptions among scientists worldwide. PLoS ONE 10 , e0134826 (2015).

American Association for Public Opinion Research. Standard Definitions: Final Dispositions of Case Codes and Outcome Rates for Surveys . (American Association for Public Opinion Research, 2016).

Wyatt, S. M. In How Users Matter : T he Co-Construction of Users and Technology (eds. Oudshoorn, N. & Pinch, T.) Ch.3 (MIT press, 2003).

Lincoln, Y. & Guba, E. Naturalistic inquiry . (SAGE Publications, 1985).

Firestone, W. A. Alternative arguments for generalizing from data as applied to qualitative research. Educ. Res. 22 (4), 16–23 (1993).

Harzing, A.-W. Response styles in cross-national survey research: a 26-country study. Int. J. Cross Cult. Manag. 6 , 243–266 (2006).

Article   ADS   Google Scholar  

Mongeon, P. & Paul-Hus, A. The journal coverage of Web of Science and Scopus: a comparative analysis. Scientometrics 106 , 213–228 (2016).

Vera-Baceta, M.-A., Thelwall, M. & Kousha, K. Web of Science and Scopus language coverage. Scientometrics 121 , 1803–1813 (2019).

Elsevier. Scopus content coverage guide. Elsevier website https://www.elsevier.com/__data/assets/pdf_file/0007/69451/Scopus_ContentCoverage_Guide_WEB.pdf (2020).

Download references

Acknowledgements

Paul Groth, Andrea Scharnhorst and Sally Wyatt provided valuable feedback and comments on this paper. I also wish to acknowledge Ricardo Moreira for his assistance in creating the sample and distributing the survey, Wouter Haak for his organizational support, Helena Cousijn for her advice in designing the survey, and Emilie Kraaikamp for her advice regarding personally identifiable information. This work is part of the project Re-SEARCH: Contextual Search for Research Data and was funded by the NWO Grant 652.001.002.

Author information

Authors and affiliations.

Data Archiving and Networked Services, Royal Netherlands Academy of Arts & Sciences, Anna van Saksenlaan 51, 2593 HW, Den Haag, The Netherlands

Kathleen Gregory

You can also search for this author in PubMed   Google Scholar

Corresponding author

Correspondence to Kathleen Gregory .

Ethics declarations

Competing interests.

The authors declare no competing interests.

Additional information

Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made. The images or other third party material in this article are included in the article’s Creative Commons license, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons license and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this license, visit http://creativecommons.org/licenses/by/4.0/ .

The Creative Commons Public Domain Dedication waiver http://creativecommons.org/publicdomain/zero/1.0/ applies to the metadata files associated with this article.

Reprints and permissions

About this article

Cite this article.

Gregory, K. A dataset describing data discovery and reuse practices in research. Sci Data 7 , 232 (2020). https://doi.org/10.1038/s41597-020-0569-5

Download citation

Received : 06 April 2020

Accepted : 12 June 2020

Published : 13 July 2020

DOI : https://doi.org/10.1038/s41597-020-0569-5

Share this article

Anyone you share the following link with will be able to read this content:

Sorry, a shareable link is not currently available for this article.

Provided by the Springer Nature SharedIt content-sharing initiative

This article is cited by

A dataset on energy efficiency grade of white goods in mainland china at regional and household levels.

  • Chunyan Wang

Scientific Data (2023)

Quick links

  • Explore articles by subject
  • Guide to authors
  • Editorial policies

Sign up for the Nature Briefing newsletter — what matters in science, free to your inbox daily.

a researcher's study uses an identifiable dataset

info This is a space for the teal alert bar.

notifications This is a space for the yellow alert bar.

National University Library

Research Process

  • Brainstorming
  • Explore Google This link opens in a new window
  • Explore Web Resources
  • Explore Background Information
  • Explore Books
  • Explore Scholarly Articles
  • Narrowing a Topic
  • Primary and Secondary Resources
  • Academic, Popular & Trade Publications
  • Scholarly and Peer-Reviewed Journals
  • Grey Literature
  • Clinical Trials
  • Evidence Based Treatment
  • Scholarly Research
  • Database Research Log
  • Search Limits
  • Keyword Searching
  • Boolean Operators
  • Phrase Searching
  • Truncation & Wildcard Symbols
  • Proximity Searching
  • Field Codes
  • Subject Terms and Database Thesauri
  • Reading a Scientific Article
  • Website Evaluation
  • Article Keywords and Subject Terms
  • Cited References
  • Citing Articles
  • Related Results
  • Search Within Publication
  • Database Alerts & RSS Feeds
  • Personal Database Accounts
  • Persistent URLs
  • Literature Gap and Future Research
  • Web of Knowledge
  • Annual Reviews
  • Systematic Reviews & Meta-Analyses
  • Finding Seminal Works
  • Exhausting the Literature
  • Finding Dissertations
  • Researching Theoretical Frameworks
  • Research Methodology & Design
  • Tests and Measurements
  • Organizing Research & Citations This link opens in a new window
  • Scholarly Publication
  • Learn the Library This link opens in a new window

SAGE Research Methods: Datasets

Statistics

Content: Practical guides to data analysis, comprised of peer-reviewed datasets and tools to manage data.

Purpose: Use to learn and practice data analysis including cleaning and normalizing data. 

A dataset (also spelled ‘data set’) is a collection of raw statistics and information generated by a research study. Datasets produced by government agencies or non-profit organizations can usually be downloaded free of charge. However, datasets developed by for-profit companies may be available for a fee.

Most datasets can be located by identifying the agency or organization that focuses on a specific research area of interest. For example, if you are interested in learning about public opinion on social issues, Pew Research Center would be a good place to look. For data about population, the U.S. government’s Population Estimates Program from American Factfinder  would be a good source.

An “open data” philosophy is becoming more common among governments and business organizations around the world, with the belief that data should be freely accessible. Open data efforts have been led by both the government and non-government organizations such as the Open Knowledge Foundation . Learn more by exploring The Open Data Handbook . There is also a growing trend in what is being called “ Big Data ”, where extremely large amounts of data are analyzed for new and interesting perspectives, and data visualization , which is helping to drive the availability and accessibility of datasets and statistics.

Don't know where to begin? Here is a quick view of our recommendations.

* Indicates that datasets on this topic are prominent in the source

For additional information about locating statistics , please see our Statistics page.

  • The Evolution of Big Data, and Where We’re Headed
  • Data Visualization and Infographics

Subject Specific and Additional Dataset Resources

  • Computer Science
  • Public Opinion/Surveys
  • Social Sciences
  • Social Media or Community Driven Datasets
  • Additional Dataset Resources
  • Large Datasets
  • Searchable Sites
  • Datasets for Learning Purposes
  • Tools for Data Analysis
  • Damodaran Online: Corporate Finance and Valuation NYU, Stern School of Business, Dr. Aswath Damodaran
  • IMF DataMapper
  • IMF Fiscal Rules Dataset (1985-2013)
  • International Monetary Fund Data & Statistics
  • National Longitudinal Surveys Bureau of Labor Statistics
  • Organization for Economic Co-Operation and Development Data
  • Quandl “Time-series” numerical only data for economics, finance, markets & energy; Features step-by-step wizard for finding and compiling data.
  • Statistical Abstract of the United States (2012): Banking, Finance, & Insurance
  • Statistical Abstract of the United States (2012): Business Enterprise
  • Surveys of Consumers Thomson Reuters & University of Michigan
  • U. S. Bureau of Economic Data
  • Mergent Online Financial records, country and industry reports. Searchable by company name, country, number of employees and more. Up to 15 years of historical data. Also provides news articles on recent mergers and acquisitions, as well as industry and country reports.
  • ACM A research, discovery and network platform. The database provides journals, conference proceedings, technical magazines, newsletters and books. Provides a list of authors after an initial topic search, includes a dataset search filter, and the ability to sort results by most cited.
  • IEEE Full-text peer-reviewed journals, transactions, magazines, conference proceedings, and published standards in the areas of electrical engineering, computer science, and electronics. Access to the IEEE Standards Dictionary Online. Useful to learn about current technology industry trends. Along with ACM database, this database has a function that allows searching for datasets.
  • Barro-Lee Dataset Datasets available for download from their article: Barro, R., & Lee, J. (n.d). A new dataset of educational attainment in the world, 1950-2010. Journal Of Development Economics,104,184-198.
  • Child care and Early Education Research Connections
  • Datasets from NCES
  • Education Data.gov
  • Higher Education General Information Survey (HEGIS) Series
  • Integrated Postsecondary Education Data System (IPEDS)
  • National Center for Education Statistics (NCES)
  • Statistical Abstract of the United States (2012): Education
  • U.K. Department of Education Datasets
  • American Psychological Association Links to datasets and Repositories
  • Children Born to Unwed Parents between 1998-2000 Princeton
  • Childstats.gov Forum on Child and Family Statistics
  • Gender & Achievement Research Program
  • The Kinsey Institute Data Archives
  • National Archive of Criminal Justice Data
  • National Data Archive on Child Abuse and Neglect
  • National Longitudinal Study of Adolescent Health Add Health
  • Neuroscience Information Framework (NIF) Data Federation
  • Substance Abuse and Mental Health Data Archive (SAMHA)
  • Gallup.com Global datasets on what people from all over the world think about important social issues, as well as financial behavior and literacy.
  • General Social Survey (GSS) A social trends survey conducted on American society and compared to international trends. This survey has been unchanged since 1972. datasets are in SPSS and STATA formats, with additional options available.
  • International Social Survey Programme (ISSP) Affiliated with the GSS, this survey has been conducted since 1980.
  • The Latin American Databank Provides a portal for Latin American datasets acquired, processed and archives by the Roper Center for Public Opinion Research. Data can be browsed by country or decade. Keyword search options are also available.
  • Pew Research Center Datasets available to download for many of the Center’s main projects. Free registration is required to download.
  • Roper Center Public Opinion Archives Over 20,000 datasets available from 1935 to present. Users can also set up an RSS feed for updates.
  • World Values Survey Datasets available to download for surveys dating back to 1981 in SPSS, SAS and STATA formats.
  • Consortium of European Social Science Data Archives (CESSDA)
  • Gapminder A non-profit organization that calls itself a “fact tank”. More than 500 world demographic indicators from the World Bank, Lancet and many other entities are available for download in Excel format, view or visualize
  • Inter-university Consortium for Political and Social Research (ICPSR) One of the largest collections of data for social and behavioral research. File formats include SPSS, SAS and csv.
  • National Archive on Criminal Justice Data
  • National Center for Health Statistics (NCHS) Extensive tutorials are available to assist users with learning how to incorporate NCHS data into their research.
  • The Odum Institute Dataverse University of North Carolina Chapel Hill
  • U. S. Department of Housing and Urban Development (HUD) Housing and housing market data provided by government.
  • U.K. Data Service Sponsored by the U.K. Economic & Social Research Council (ESRC).
  • Association of Religion Archives
  • U.S. Bureau of Labor Statistics Economy and labor market provided by governmental site.
  • U.S. Census Bureau Population demographics provided by governmental site.
  • Guardian (UK) Datablog
  • Kaggle 3rd party, multi-disciplinary crowd-sourcing platform. Check credibility of data provided by institutions not affiliated with academic or professional instiuttions.
  • Social Computing Data Repository Arizona State University collects and makes available for download datasets from the most popular social networks including Twitter, FourSquare, YouTube and more.
  • Stanford Large Network Dataset Collection Features data from social networks, online reviews and more.
  • Registry of Open Data on AWS Notable sets include the NASA Nex Project and 1000 Genome Project.
  • Figshare 3rd party multi-disciplinary repository. Search by keyword or browse by subject.
  • Africa Open Data Search and download more than 900 datasets from countries across the continent. File formats are available in csv, zip and shapefile (shp) for use with GIS software.
  • American Fact Finder A division of the US Census Bureau, this site provides datasets from censuses and surveys conducted by the Bureau.
  • Data.gov The gateway to searching and discovering U.S. government data. This sites boasts over 90,000 datasets!
  • Data.gov.uk Search over 17,000 datasets from the government of the United Kingdom. This database allows for limiting search results by theme (subject), format (file type) and publisher.
  • European Union Open Data Portal Gateway to data produced by EU member institutions.The homepage features most viewed datasets, as well as updated datasets and top publishers (agencies/institutions). Most datasets can be downloaded in pdf or zip formats.
  • National Digital Archive of Datasets A division of the U.K. National Archives, these datasets are from 1997-2010. Fully searchable and can be downloaded in html, csv, xls, and more.
  • Open Data Canada Search and download datasets in different formats (csv, xml, zip, html). Featured datasets are also available across a wide range of categories.
  • United Nations Data The gateway to data and statistics for UN supported projects, including the Monthly Bulletin of Statistics. To learn how to best use this resource, see these FAQs.
  • UN Statistical Databases Directory of UN statistical databases from the United Nations Dag Hammarskjöld Library.
  • World Bank Datasets can be browsed and searched across a wide range of indicators and categories. Download options are available from basic to advanced. View the World Bank Databank tutorial to learn more about how to use and download datasets.
  • Datacatalogs.org This site provides a browseable or searchable of open data catalogs around the world, including government and non-government sources.
  • Datacite A repository of open datasets that are available online. Links to the dataset homepage are available along with the associated subjects, publisher (authority) and description.
  • Dryad A curated resource that makes research data discoverable, freely reusable, and citable. Provides a general-purpose home for a wide diversity of data types.
  • Google Public Data Freely available tool for searching public datasets.Importing, saving and linking tools are also available. See more from Google Public Data Help.
  • Harvard Dataverse Network An open network of research and scientific data containing over 50,000 studies.
  • Qualitative Data Repository Dedicated archive for storing and sharing digital data (and accompanying documentation) generated or collected through qualitative and multi-method research in the social sciences and related disciplines.
  • Figshare Figshare is a repository where users can make all of their research outputs available in a citable, shareable and discoverable manner
  • Re3 Data Re3data is a global registry of research data repositories that covers research data repositories from different academic disciplines. It includes repositories that enable permanent storage of and access to data sets to researchers, funding bodies, publishers, and scholarly institutions. re3data promotes a culture of sharing, increased access and better visibility of research data. The registry has gone live in autumn 2012 and has been funded by the German Research Foundation (DFG).
  • Kaggle This for-profit company offers data forecasting services for the energy industry, also maintains a platform for “predictive modeling competitions”. Get a team together and challenge yourselves to compete!
  • Sociology Data Set Server St. Joseph’s University, Dept. of Sociology
  • SPSS Data Page East Carolina University, Dept. of Psychology, Dr. Karl L. Wuensch
  • SPSS Data Sets Butler University, Dept. of Psychology, Dr. Roger J. Padgett
  • Statistical Reference Datasets National Institute for Standards & Technology
  • Statistics for Psychology University of Bath, Dept. of Psychology, Dr. Ian Walker
  • Teaching with Data While this site does not have datasets to download, they have excellent resources for locating datasets and other tools for using data in education.
  • UCI Machine Learning Repository Used primarily for the computer sciences, a number of social sciences datasets are available here. Each dataset has cited references.
  • V7 Open Datasets Open-access searchable page with over 500 quality datasets.
  • National Map This website provides datasets for representing U. S. government data using various map tools. Maps include: The National Atlas of the United States, U.S. Topo, Historical Topographic Map Collection, and the National Map Viewer.
  • Nesstar (Norwegian Social Science Data Services) An open access, web-based tool for publishing and analyzing data.
  • OpenRefine Formerly Google Refine, this free tool allows intermediate to advanced level users multiple options for managing large datasets.
  • Social Explorer This tool allows users to manipulate data from demographic and economic sources to create their own maps, interactive images, and more. The limited free version provides access to data from the 2000 US Census.
  • Statwing A limited free tool to analyze and visualize data. (Note:The free version makes your data available publicly up to 25mb.)
  • TableauPublic A free tool for visualizing data in a wide variety of design options.

Health Dataset Sites

  • Hospitals and Spending
  • Medicaid & Medicare
  • Multi-topic
  • Non-Profit Hospitals
  • CDC BRFSS Health-related telephone surveys that collect state data about U.S. residents regarding their health-related risk behaviors, chronic health conditions, and use of preventive services.
  • CDC Data Statistics on major diseases.
  • CDC Wonder Data on diseases and death, as well as prevention.

Sources for statistics on hospitals and/or hospital spending.

  • AHA Annual survey of hospitals in the United States. It includes the number of government hospitals, the number hospitals in each bed, and the number of hospital beds.
  • AHD The American Hospital Directory® provides data, statistics, and analytics about more than 7,000 hospitals nationwide. AHD.com® hospital information includes both public and private sources such as Medicare claims data, hospital cost reports, and commercial licensors. AHD® is not affiliated with the American Hospital Association (AHA) and is not a source for AHA Data. Our data are evidence-based and derived from the most definitive sources.
  • AHRQ HCUPnet is a free, on-line query system based on data from the Healthcare Cost and Utilization Project (HCUP). The system provides health care statistics and information for hospital inpatient, emergency department, and ambulatory settings, as well as population-based health care data on counties
  • HCUPnet HCUPnet is a free, on-line query system based on data from the Healthcare Cost and Utilization Project (HCUP). The system provides health care statistics and information for hospital inpatient, emergency department, and ambulatory settings, as well as population-based health care data on counties
  • CMS.gov Research, data and statistics on Medicare & Medicaid from The Centers for Medicare & Medicaid Services, CMS, part of the Department of Health and Human Services (HHS).
  • Medicare.gov Compare hospitals quality of care.

A list of public datasets by topic, from the Society of General Internal Medicine.

  • Propublica Various healthcare datasets including Medicare, treatments and nursing homes.
  • HealthData.gov 50 datasets, mostly related to Covid.
  • IRS EXENon-Profit Hospital Compliance Report The IRS commenced its Hospital Compliance Project (Project) in May 2006 to study nonprofit hospitals and community benefit, and to determine how nonprofit hospitals establish and report executive compensation. The Project involved mailing out a comprehensive compliance check questionnaire to 544 nonprofit hospitals and analyzing their responses.1
  • CDC NCHS Access data from national health surveys.

Digging for Data Webinar

Not sure where or how to start your data search? This webinar provides a basic overview of how to find datasets using Google Dataset Search and other dataset directories/repositories, and answer any questions you bring to the session.

Searching for Datasets Online

  • Google Dataset Search

Google Dataset Search is a search engine across metadata for millions of datasets in thousands of repositories across the Web. Similar to how Google Scholar works, Google Dataset Search lets you find datasets wherever they’re hosted, whether it’s a publisher's site, a digital library, or an author's personal web page.

Dataset Search can be useful to a broad audience, whether you're looking for scientific data, government data, or data provided by news organizations. Simply enter what you are looking for, and the results will guide you to the published dataset on the repository provider’s site.

Screenshot of search results for Google Dataset Search

Persistent links to datasets may be found by clicking on the share icon. You can may then copy/paste the link to share or save the location.

Screenshot showing the share feature in Google Dataset Search

  • To find open data for a particular U.S. state or country, try using a search engine and the keywords: open data [name of state or country] , as shown in the image below.

Screenshot of Google search results for search terms arizona open data.

  • You can also search Google for datasets by typing in your topic followed by the keywords "raw data" or "datasets" . For example, "barriers to AI adoption raw data or datasets".
  • Lastly, you can search Google for xls. file type , which will pull excel documents that might contain raw data. For example, "artificial intelligence filetype: xls"

Locating an Original Dataset from a Journal Article

  • ACM Digital Library
  • IEEE Xplore Digital Library

e-Book

Content: The Association of Computing Machinery database is a research, discovery and network platform. The database provides journals, conference proceedings, technical magazines, newsletters and books.

Purpose: An essential database computing and technology research topics.

Special Features: Provides a list of authors after an initial topic search, includes a dataset search filter, and the ability to sort results by most cited.

Use the following steps to locate the actual dataset used in a research article within the ACM Digital Library database. 

  • Access the  ACM Digital Library  database from the  A-Z Databases List . 
  • Using the search box, enter your keyword terms to locate relevant research articles on your topic. 

ACM Digital Library search results refined by content formats

Content: Full-text peer-reviewed journals, transactions, magazines, conference proceedings, and published standards in the areas of electrical engineering, computer science, and electronics.

Purpose: Users may learn about technology industry information

Special Features: Users may search datasets

To limit to full-text only, change the results from "All Results" to "My Subscribed Content".

Use the following steps to locate the actual dataset used in a research article within the IEEE Xplore Digital Library database. 

  • Access the  I EEE Xplore Digital Library  database from the  A-Z Databases List . 

IEEE Xplore Digital Library search box

Was this resource helpful?

  • << Previous: Statistics
  • Next: Scholarly Research >>
  • Last Updated: Mar 31, 2024 4:57 PM
  • URL: https://resources.nu.edu/researchprocess

National University

© Copyright 2024 National University. All Rights Reserved.

Privacy Policy | Consumer Information

U.S. flag

An official website of the United States government

The .gov means it's official. Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you're on a federal government site.

The site is secure. The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

  • Publications
  • Account settings
  • Browse Titles

NCBI Bookshelf. A service of the National Library of Medicine, National Institutes of Health.

Committee on Strategies for Responsible Sharing of Clinical Trial Data; Board on Health Sciences Policy; Institute of Medicine. Sharing Clinical Trial Data: Maximizing Benefits, Minimizing Risk. Washington (DC): National Academies Press (US); 2015 Apr 20.

Cover of Sharing Clinical Trial Data

Sharing Clinical Trial Data: Maximizing Benefits, Minimizing Risk.

  • Hardcopy Version at National Academies Press

Appendix B Concepts and Methods for De-identifying Clinical Trial Data 1

  • INTRODUCTION

Very detailed health information about participants is collected during clinical trials. A number of different stakeholders would typically have access to individual-level participant data (IPD), including the study sites, the sponsor of the study, statisticians, Institutional Review Boards (IRBs), and regulators. By IPD we mean individual-level data on trial participants, which is more than the information that is typically included, for example, in clinical study reports (CSRs).

There is increasing pressure to share IPD more broadly than occurs at present. There are many reasons for such sharing, such as transparency in the trial and wider disclosure of adverse events that may have transpired, or to facilitate the reuse of such data for secondary purposes, specifically in the context of health research ( Gøtzsche, 2011 ; IOM, 2013 ; Vallance and Chalmers, 2013 ). Many funding agencies tasked with the oversight of research, as well as its funding, are requiring that data collected by the projects they support be made available to others ( MRC, 2011 ; NIH, 2003 ; The Wellcome Trust, 2011 ). There are current efforts by regulators, such as the European Medicines Agency (EMA) (2014a,b), to examine how to make IPD from clinical trials shared more widely ( IOM, 2013 ). In many cases, however, privacy concerns have been stated as a key obstacle to making these data available ( Castellani, 2013 ; IOM, 2013 ).

One way in which privacy issues can be addressed is through the protection of the identities of the corresponding research participants. Such “de-identified” or “anonymized” health data (the former term being popular in North America, and the latter in Europe and other regions) are often considered to be sufficiently devoid of personal health information in many jurisdictions around the world. Many privacy laws therefore allow the data to be used and disclosed for any secondary purposes with participant consent. As long as the data are appropriately de-identified, many privacy concerns associated with data sharing can be readily addressed.

It should be recognized that de-identification is not, by any means, the only privacy concern that needs to be addressed when sharing clinical trial data. In fact, there must be a level of governance in place to ensure that the data will not be analyzed or used to discriminate against or stigmatize the participants or certain groups (e.g., religious or ethnic) associated with the study. This is because discrimination and stigmatization can occur even if the data are de-identified.

This paper describes a high-level risk-based methodology that can be followed to de-identify clinical trial IPD. To contextualize our review and analysis of de-identification, we also touch upon additional governance mechanisms, but we acknowledge that a complete treatment of governance is beyond the scope of this paper. Rather, the primary focus here is only on the privacy protective elements.

Data Recipients, Sponsors, and Adversaries

Clinical trial data may be disclosed by making them completely public or through a request mechanism. The data recipient may be a qualified investigator (QI) who must meet specific criteria. There may be other data recipients who are not QIs as well. If the data are made publicly available with no restrictions, however, then other types of users may access the data, such as journalists and nongovernmental organizations (NGOs). In our discussions we refer to the data recipient as the QI as a primary exemplar, although this is not intended to exclude other possible data recipients (it does make the presentation less verbose).

Data are being disclosed to the QI by the sponsor. We use the term “sponsor” generally to refer to all data custodians who are disclosing IPD, recognizing that the term may mean different entities depending on the context. It may not always be the case that the sponsor is a pharmaceutical company or a medical device company. For example, a regulator may decide to disclose the data to a QI, or a pharmaceutical company may provide the data to an academic institution, whereupon that institution becomes the entity that discloses the data.

The term “adversary” is often used in the disclosure control literature to refer to the role of the individual or entity that is trying to re-identify data subjects. Other terms used are “attacker” and “intruder.” Discussions about the QI being a potential adversary are not intended to paint QIs as having malicious objectives. Rather, in the context of a risk assessment, one must consider a number of possible data recipients as being potential adversaries and manage the re-identification risk accordingly.

Data Sharing Models

A number of different ways to provide access to IPD have been proposed and used, each with different advantages and risks ( Mello et al., 2013 ). First, there is the traditional public data release where anyone can get access to the data with no registration or conditions. Examples of such releases include the publicly available clinical trial data from the International Stroke Trial (IST) ( Sandercock et al., 2011 ) and data posted to the Dryad online open access data repository (Dryad, undated; Haggie, 2013 ).

A second form of data sharing, which is more restrictive, occurs when there exists a formal request and approval process to obtain access to clinical trial data, such as the GlaxoSmithKline (GSK) trials repository ( Harrison, 2012 ; Nisen and Rockhold, 2013 ); Project Data Sphere (whose focus is on oncology trial data) ( Bhattacharjee, 2012 ; Hede, 2013 ); the Yale University Open Data Access (YODA) Project, which is initially making trial data from Medtronic available ( CORE, 2014 ; Krumholz and Ross, 2011 ); and the Immunology Database and Analysis Portal (ImmPort, n.d.), which is restricted to researchers funded by the Division of Allergy, Immunology, and Transplantation of the National Institute of Allergy and Infectious Diseases (DAIT/NIAID), other approved life science researchers, National Institutes of Health employees, and other preauthorized government employees (ImmPort, n.d.). More recently, pharmaceutical companies have created the ClinicalStudyDataRequest.com website, which facilitates data requests to multiple companies under one portal. Following this restrictive model, a request can be processed by the study sponsor or by a delegate of the sponsor (e.g., an academic institution).

A hybrid of the above approaches is a quasi-public release, in which the data user must agree to some terms of use or sign a “click-through” contract. Click-through contracts are online terms of use that may place restrictions on what can be done with the data and how the data are handled. Regardless, anyone can still download such data. For example, public analytics competition data sets, such as the Heritage Health Prize ( El Emam et al., 2012 ), and data-centric software application development competitions, such as the Cajun Code Fest ( Center for Business and Information Technologies, 2013 ), fall into this category. In practice, however, click-through terms are not common for the sharing of clinical trial IPD. 2

A form of data access that does not require any data sharing occurs when analysts request that the data controller perform an analysis on their behalf. Because this does not involve the sharing of IPD, it is a scenario that we do not consider further in this paper.

Data Sharing Mechanisms

Different mechanisms can be used to share IPD. Clinical trial IPD can be shared either as microdata or through an online portal . The term “microdata” is commonly used in the disclosure control literature to refer to individual-level raw data ( Willenborg and de Waal, 1996 , 2001 ). These microdata may be in the form of one or more flat files or relational databases.

When disclosed as microdata, the data are downloaded as a raw data file that can be analyzed by QIs on their own machines, using their own software if they wish to do so. The microdata can be downloaded through a website, sent to the QI on a disc, or transferred electronically. If access is through a website, the QI may have to register, sign a contract, or go through other steps before downloading the data.

When a portal is used, the QI can access the data only through a remote computer interface, such that the raw data reside on the sponsor's computers and all analysis performed is on the sponsor's computers. Data users do not download any microdata to their own local computers through this portal. Under this model, all actions can be audited.

A public online portal allows anyone to register and get access to the IPD. Otherwise, the access mechanism requires a formal request process.

De-identification is relevant in both of the aforementioned scenarios. When data are provided as microdata, the de-identification process ensures that each record is protected from the QI and his/her staff as the potential adversary. When data are shared through the portal, a QI or his/her staff may inadvertently recognize a data subject because that data subject is a neighbor, relative, coworker, or famous person (see Box B-1 ).

Types of Re-identification Attacks. For public data, the sponsor needs to make a worst-case assumption and protect against an adversary who is targeting the data subjects with the highest risk of re-identification. For a nonpublic data set, we consider (more...)

The different approaches for sharing clinical trial IPD are summarized in Figure B-1 .

Different approaches for sharing clinical trial data. NOTE: QI = qualified investigator.

Scope of Data to Be De-identified

It is important to make a distinction between biological, and particularly genomic, data and other types of data. Many clinical trials are creating biorepositories. These may have a pseudonym or other unique identifier for the participant, and a sample of data. The de-identification methods we describe in this paper are applicable to clinical, administrative, and survey data. Genomic data raise a different set of issues. These issues are addressed directly in a later section of this paper.

Clinical trial data can be shared at multiple levels of detail. For example, the data can be raw source data or analysis-ready data. We assume that the data are analysis-ready and that no data cleansing is required before de-identification.

Existing Standards for De-identification

Various regulations associated with data protection around the world permit the sharing of de-identified (or similarly termed) data. For instance, European Union (EU) Data Protection Directive 95/46/EC, which strictly prohibits secondary uses of person-specific data without individual consent, provides an exception to the ruling in Recital 26, which states that the “principles of protection shall not apply to data rendered anonymous in such a way that the data subject is no longer identifiable.” However, what does it mean for data to be “identifiable”? How do we know when they are no longer identifiable? The Data Protection Directive, and similar directives around the world, do not provide explicit guidelines regarding how data should be protected. An exception to this rule is a code of practice document published by the U.K. Information Commissioner's Office (ICO) (2012). And although this document provides examples of de-identification methods and issues to consider when assessing the level of identifiability of data, it does not provide a full methodology or specific standards to follow.

There are, however, de-identification standards provided in the Privacy Rule of the U.S. Health Insurance Portability and Accountability Act of 1996 (HIPAA) and subsequent guidance published by the Office for Civil Rights (OCR) at the U.S. Department of Health and Human Services (HHS) ( HHS, 2012 ). This rule is referred to by many regulatory frameworks around the world, and the principles are strongly related to those set forth in the United Kingdom's code of practice document mentioned above.

Two of the key existing standards for the de-identification of health microdata are described in the HIPAA Privacy Rule. It should be recognized that HIPAA applies only to “covered entities” (i.e., health plans, health care clearinghouses, and health care providers that transmit health information electronically) in the United States. It is likely that in many instances the sponsors of clinical trials will not fall into this class. However, these de-identification standards have been in place for approximately a decade, and there is therefore a considerable amount of real-world experience in their application. They can serve as a good launching point for examining best practices in this area. For the disclosure of clinical trial data, the HIPAA Privacy Rule de-identification standards offer a practically defensible foundation even if they are not a regulatory requirement.

According to section 164.514 of the HIPAA Privacy Rule, “health information that does not identify an individual and with respect to which there is no reasonable basis to believe that the information can be used to identify an individual is not individually identifiable health information.” Section 164.514(b) of the Privacy Rule contains the implementation specifications that a covered entity, or affiliated business associate, must follow to meet the de-identification standard. In particular, the Privacy Rule outlines two routes by which health data can be designated as de-identified. These are illustrated in Figure B-2 .

The two de-identification standards in the HIPAA Privacy Rule. SOURCE: Reprinted from a document produced by OCR (HHS, 2012).

The first route is the “Safe Harbor” method. Safe Harbor requires the manipulation of 18 fields in the data set as described in Box B-2 . The Privacy Rule requires that a number of these data elements be “removed.” However, there may be acceptable alternatives to actual removal of values as long as the risk of reverse engineering the original values is very small. Compliance with the Safe Harbor standard also requires that the sponsor not have any actual knowledge that a data subject can be re-identified. Assumptions of the Safe Harbor method are listed in Box B-3 .

The Safe Harbor De-identification Standard. Names; All geographic subdivisions smaller than a state, including street address, city, county, precinct, zip code, and their equivalent geocodes, except for the initial three digits of a zip code if, according (more...)

Assumptions of the HIPAA Safe Harbor Method. There are only two quasi-identifiers that need to be manipulated in a data set: dates and zip codes. The adversary does not know who is in the data set (i.e., would not know which individuals participated in (more...)

The application of Safe Harbor is straightforward, but there clearly are instances in which dates and more fine-grained geographic information are necessary. In practice the Safe Harbor standard would remove critical geospatial and temporal information from the data (see items 2 and 3 in Box B-2 ), potentially reducing the utility of the data. Many meaningful analyses of clinical trial data sets require the dates and event order to be clear. For example, in a Safe Harbor data set, it would not be possible to include the dates when adverse events occurred.

In recognition of the limitations of de-identification via Safe Harbor, the HIPAA Privacy Rule provides for an alternative in the form of the Expert Determination method. This method has three general requirements:

  • The de-identification must be based on generally accepted statistical and scientific principles and methods for rendering information not individually identifiable . This means that the sponsor needs to ensure that there is a body of work that justifies and evaluates the methods that are used for the de-identification and that these methods must be generally known (i.e., undocumented methods or proprietary methods that have never been published would be difficult to classify as “generally accepted”).
  • The risk of re-identification needs to be very small , such that the information could not be used, alone or in combination with other reasonably available information, by an anticipated recipient to identify an individual who is a subject of the information. However, the mechanism for measuring re-identification risk is not defined in the HIPAA Privacy Rule, and what would be considered very small risk also is not defined. Therefore, the de-identification methodology must include some manner of measuring re-identification risk in a defensible way and have a repeatable process to follow that allows for the definition of very small risk.
  • Finally, the methods and results of the analysis that justify such determination must be documented. The basic principles of deidentification are expected to be consistent across all clinical trials, but the details will be different for each study, and these details also need to be documented.

These conditions are reasonable for a de-identification methodology and are consistent with the guidance that has been produced by other agencies and regulators ( Canadian Institute for Health Information, 2010 ; ICO, 2012 ). They also serve as a set of conditions that must be met for the methods described here.

Unique and Derived Codes Under HIPAA

According to the 18th item in Safe Harbor (see Box B-2 ), “any unique identifying number, characteristic, or code” must be removed from the data set; otherwise it would be considered personal health information. However, in lieu of removing the value, it may be hashed or encrypted. This would be called a “pseudonym.” For example, the unique identifier may be a participant's clinical trial number, and this is encrypted with a secret key to create a pseudonym. A similar scheme for creating pseudonyms would be used under the Expert Determination method.

However, in the HIPAA Privacy Rule at § 164.514(c), it is stated that any code that is derived from information about an individual is considered identifiable data. However, such pseudonyms are practically important for knowing which records belong to the same clinical trial participant and constructing the longitudinal record of a data subject. Not being able to create derived pseudonyms means that random pseudonyms must be created. To be able to use random pseudonyms, one must maintain a crosswalk between the individual identity and the random pseudonym. The crosswalk allows the sponsor to use the same pseudonym for each participant across data sets and to allow re-identification at a future date if the need arises. These crosswalks, which are effectively linking tables between the pseudonym and the information about the individual, arguably present an elevated privacy risk because clearly identifiable information must now be stored somehow. Furthermore, the original regulations did not impose any controls on this crosswalk table.

For research purposes, the Common Rule will also apply. Under the Common Rule, which guides IRBs, if the data recipient has no means of getting the key, for example, through an agreement with the sponsor prohibiting the sharing of keys under any circumstances or through organizational policies prohibiting such an exchange, then creating such derived pseudonyms is an acceptable approach ( HHS, 2004 , 2008b ).

Therefore, there is an inconsistency between the Privacy Rule and the Common Rule in that the former does not permit derived pseudonyms, while the latter does. This is well documented ( Rothstein, 2005 , 2010 ). However, in the recent guidelines from OCR, this is clarified to state that “a covered entity may disclose codes derived from PHI (protected health information) as part of a de-identified data set if an expert determines that the data meets the de-identification requirements at §164.514(b)(1)” ( HHS, 2012 , p. 22). This means that a derived code, such as an encryption or hash function, can be used as a pseudonym as long as there is assurance that the means to reverse that pseudonym are tightly controlled. There is now clarity and consistency among rules in that if there is a defensible mechanism whereby reverse engineering a derived pseudonym has a very small probability of being successful this is permitted.

Is It Necessary to Destroy Original Data

Under the Expert Determination method, the re-identification risk needs to be managed assuming that the adversary is “an anticipated recipient” of the data. This limits the range of adversaries that needs to be considered because, in our context, the anticipated recipient is the QI.

However, under the EU Data Protection Directive, the adversary may be the “data controller or any other person.” The data controller is the sponsor or the QI receiving the de-identified data. There are a number of challenges with interpreting this at face value.

One practical issue is that the sponsor will, by definition, be able to re-identify the data because the sponsor will retain the original clinical trial data set. The Article 29 Working Party has proposed that, effectively, the sponsor needs to destroy or aggregate the original data to be able to claim that the data provided to the QI are truly de-identified ( Article 29 Data Protection Working Party, 2014 ). This means that the data are not de-identified if there exists another data set that can re-identify it, even in the possession of another data controller. Therefore, because the identified data exist with the sponsor, the data provided to the QI cannot be considered de-identified. This is certainly not practical because the original data are required for legal reasons (e.g., clinical trial data need to be retained for an extended period of time whose duration depends on the jurisdiction). Such a requirement would discourage de-identification by sponsors and push them to share identifiable data, which arguably would increase the risk of re-identification for trial participants significantly.

In an earlier opinion the Article 29 Data Protection Working Party (2007) emphasized the importance of “likely reasonable” in the definition of identifiable information in the 95/46/EC Directive. In that case, if it is not likely reasonable that data recipients would be able to readily re-identify the anonymized data because they do not have access to the original data, those anonymized data would not be considered personal information. That would seem to be a more reasonable approach that is consistent with interpretations in other jurisdictions.

Is De-identification a Permitted Use

Retroactively obtaining participant consent to de-identify data and use them for secondary analysis may introduce bias in the data set ( El Emam, 2013 ). If de-identification is a permitted use under the relevant regulations, then de-identification can proceed without seeking participant consent. Whether that is the case will depend on the prevailing jurisdiction.

Under HIPAA and extensions under the Health Information Technology for Economic and Clinical Health (HITECH) Act Omnibus Rule, deidentification is a permitted use by a covered entity. However, a business associate can de-identify a data set only if the business associate agreement explicitly allows for that. Silence on de-identification in a business associate agreement is interpreted as not permitting de-identification.

In other jurisdictions, such as Ontario, the legislation makes explicit that de-identification is a permitted use ( Perun et al., 2005 ).

Terminology

Terminology in this area is not always clear, and different authors and institutions use the same terms to mean different things or different terms to mean the same thing ( Knoppers and Saginur, 2005 ). Here, we provide the terminology and definitions used in this paper.

The International Organization for Standardization (ISO) Technical Specification on the pseudonymization of health data defines relevant terminology for our purposes. The term “anonymization” is defined as a “process that removes the association between the identifying data set and the data subject” ( ISO, 2008 ). This is consistent with current definitions of “identity disclosure,” which corresponds to assigning an identity to a data subject in a data set ( OMB, 1994 ; Skinner, 1992 ). For example, an identity disclosure would transpire if the QI determined that the third record (ID = 3) in the example data set in Table B-1 belonged to Alice Brown. Thus, anonymization is the process of reducing the probability of identity disclosure to a very small value.

TABLE B-1. An Example of Data Used to Illustrate a Number of Concepts Referred to Throughout This Paper.

An Example of Data Used to Illustrate a Number of Concepts Referred to Throughout This Paper.

Arguably, the term “anonymization” would be the appropriate term to use here given its more global utilization. However, to remain consistent with the HIPAA Privacy Rule, we use the term “de-identification” in this paper.

Beyond identity disclosure, organizations (and privacy professionals) are, at times, concerned about “attribute disclosure” ( OMB, 1994 ; Skinner, 1992 ). This occurs when a QI learns a sensitive attribute about a participant in the database with a sufficiently high probability, even if the Q1 does not know which specific record belongs to that patient ( Machanavajjhala et al., 2007 ; Skinner, 1992 ). For example, in Table B-1 , all males born in 1967 had a creatine kinease lab test. Assume that an adversary does not know which record belongs to Almond Zipf (who has record ID = 17; see Table B-2 ). However, because Almond is male and was born in 1967, the QI will discover something new about him—that he had a test often administered to individuals showing symptoms of a heart attack. All known re-identification attacks are identity disclosures and not attribute disclosures ( El Emam et al., 2011a ). 3 Furthermore, privacy statutes and regulations in multiple jurisdictions, including the HIPAA Privacy Rule, the Ontario Personal Health Information Protection Act (PHIPA), and the EU Data Protection Directive, consider identity disclosure only in their definitions of personal health information. Although participants may consider certain types of attribute disclosure to be a privacy violation, it is not considered so when the objective is anonymization of the data set.

TABLE B-2. Identities of Participants from the Hypothetical Data Set.

Identities of Participants from the Hypothetical Data Set.

Technical methods have been developed to modify the data to protect against attribute disclosure ( Fung et al., 2010 ). However, these methods have rarely, if ever, been used in practice for the disclosure of health data. One possible reason for this is that they distort the data to such an extent that the data are no longer useful for analysis purposes. There are other, nontechnical approaches that are more appropriate for addressing the risks of attribute disclosure, and in the final section on governance we provide a description of how a sponsor can protect against attribute disclosure. Therefore, our focus in this paper is on identity disclosure.

  • HOW TO MEASURE THE RISK OF RE-IDENTIFICATION

We begin with some basic definitions that are critical for having a meaningful discussion about how re-identification works. Along the way, we address some of the controversies around de-identification that have appeared in the literature and the media.

Categories of Variables

It is useful to differentiate among the different types of variables in a clinical trial data set. The way the variables are handled during the deidentification process will depend on how they are categorized. We make a distinction among three types of variables ( Samarati, 2001 ; Sweeney, 2002 ):

  • Directly identifying variables . Direct identifiers have two important characteristics: (1) one or more direct identifiers can be used to uniquely identify an individual, either by themselves or in combination with other readily available information; and (2) they often are not useful for data analysis purposes. Examples of directly identifying variables include names, email addresses, and telephone numbers of participants. It is uncommon to perform data analysis on clinical trial participant names and telephone numbers.
  • Indirectly identifying variables (quasi-identifiers) . Quasi-identifiers are the variables about research participants in the data set that a QI can use, either individually or in combination, to reidentify a record. If an adversary does not have background knowledge of a variable, it cannot be a quasi-identifier. The means by which an adversary can obtain such background knowledge will determine which attacks on a data set are plausible. For example, the background knowledge may be available because the adversary knows a particular target individual in the disclosed clinical trial data set, an individual in the data set has a visible characteristic that is also described in the data set, or the background knowledge exists in a public or semipublic registry. Examples of quasi-identifiers include sex, date of birth or age, locations (such as postal codes, census geography, and information about proximity to known or unique landmarks), language spoken at home, ethnic origin, aboriginal identity, total years of schooling, marital status, criminal history, total income, visible minority status, activity difficulties/reductions, profession, event dates (such as admission, discharge, procedure, death, specimen collection, visit/encounter), codes (such as diagnosis codes, procedure codes, and adverse event codes), country of birth, birth weight, and birth plurality.
  • Other variables . These are the variables that are not really useful for determining an individual's identity. They may or may not be clinically relevant.

Individuals can be re-identified because of the directly identifying variables and the quasi-identifiers. Therefore, our focus is on these two types of variables.

Classifying Variables

An initial step in being able to reason about the identifiability of a clinical trial data set is to classify the variables into the above categories. We consider the process for doing so below.

Is It an Identifier

There are three conditions for a field to be considered an identifier (of either type). These conditions were informed by HHS's de-identification guidelines ( HHS, 2012 ).

Replicability

The field values must be sufficiently stable over time so that the values will occur consistently in relation to the data subject. For example, the results of a patient's blood glucose level tests are unlikely to be replicable over time because they will vary quite a bit. If a field value is not replicable, it will be challenging for an adversary to use that information to re-identify an individual.

Distinguishability

The variable must have sufficient variability to distinguish among individuals in a data set. For example, in a data set of only breast cancer patients, the diagnosis code (at least at a high level) will have little variation. On the other hand, if a variable has considerable variation among the data subjects, it can distinguish among individuals more precisely. That diagnosis field will be quite distinguishable in a general insurance claims database.

Knowability

An adversary must know the identifiers about the data subject in order to re-identify them. If a variable is not knowable by an adversary, it cannot be used to launch a re-identification attack on the data.

When we say that a variable is knowable, it also means that the adversary has an identity attached to that information. For example, if an adversary has a zip code and a date of birth, as well as an identity associated with that information (such as a name), then both the zip code and date of birth are knowable.

Knowability will depend on whether an adversary is an acquaintance of a data subject. If the adversary is an acquaintance, such as a neighbor, coworker, relative, or friend, it can be assumed that certain things will be known. Things known by an acquaintance will be, for example, the subject's demographics (e.g., date of birth, gender, ethnicity, race, language spoken at home, place of birth, and visible physical characteristics). An acquaintance may also know some socioeconomic information, such as approximate years of education, approximate income, number of children, and type of dwelling.

A nonacquaintance will know things about a data subject in a number of different ways, in decreasing order of likelihood:

  • The information can be inferred from other knowable information or other variables that determined to be identifiers. For example, birth weight can often be inferred from weeks of gestation. If weeks of gestation are included in the database, birth weight can be determined with reasonable accuracy.
  • The information is publicly available. For example, the information is in a public registry, or it appears in a newspaper article (say, an article about an accident or a famous person). Information can also become public if self-revealed by individuals. Examples are information posted on social networking sites and broadcast email announcements (e.g., births). It should be noted that only information that many people would self-reveal should be considered an identifier. If there is a single example or a small number of examples of people who are revealing everything about their lives (e.g., a quantified self-enthusiast who is also an exhibitionist), this does not mean that this kind of information is an identifier for the majority of the population.
  • The information is in a semipublic registry. Access to these registries may require a nominal fee or application process.
  • The information can be purchased from commercial data brokers. Use of commercial databases is not inexpensive, so an adversary would need to have a strong motive to use such background information.

Some of these data sources can be assessed objectively (e.g., whether there is relevant public information). In other cases, the decision will be subjective and may vary over time.

A Suggested Process for Determining whether a variable Is an Identifier

A simple way to determine whether a variable is an identifier is to ask an expert, internal or external to the sponsor, to do so. There are other, more formal processes that can be used as well.

There are two general approaches to classifying variables. In one approach, two analysts who know the data and the data subject population classify the variables independently; then some measure of agreement is computed. A commonly used measure of agreement is Cohen's Kappa ( Cohen, 1960 ). If this value is above 0.8, there is arguably general consensus, and the two analysts will meet to resolve the classifications on which they had disagreements. The results of this exercise are then retained as documentation.

If the Kappa value is less than 0.8, there is arguably little consensus. In such a case, it is recommended that a group of individuals at the sponsor site review the field classifications and reach a classification consensus. This consensus then needs to be documented, along with the process used to reach it. This process provides the data custodian with a defensible classification of variables.

Is It a Direct or Indirect Identifier

Once a variable has been determined to be an identifier, it is necessary to determine whether it is a direct or indirect (quasi-) identifier. If the field uniquely identifies an individual (e.g., a Social Security Number), it will be treated as a direct identifier. If it is not unique, the next question is whether it is likely to be used for data analysis. If so, it should be treated as a quasi-identifier. This is an important decision because the techniques often used to protect direct identifiers distort the data and their truthfulness significantly.

Is it possible to know which fields will be used for analysis at the time that de-identification is being applied? In many instances, an educated judgment can be made, for example, about potential outcome variables and confounders.

The overall decision rule for classifying variables is shown in Figure B-3 .

Decision rule for classifying identifiers. SOURCE: Reprinted with permission from El Emam and colleagues, 2014.

How Is Re-identification Probability Measured

Measurement of re-identification risk is a topic that has received extensive study over multiple decades. We examine it at a conceptual level to illustrate key concepts. This discussion builds on the classification of variables described above.

The Risk of Re-identification for Direct Identifiers

We define risk as the probability of re-identifying a trial participant. In practice, we consider the risk of re-identification for direct identifiers to be 1. If a direct identifier does exist in a clinical trial data set, then by definition it will be considered to have a very high risk of re-identification.

Strictly speaking, the probability is not always 1. For example, consider the direct identifier “Last Name.” If a trial participant is named “Smith,” it is likely that there are other people in the trial named “Smith,” and this is even more likely in the community where that participant lives. However, assuming that the probability of re-identification is equal to 1 is a simplification that has little impact in practice, errs on the conservative side, and makes it possible to focus attention on the quasi-identifiers, which is where, in many instances, the most data utility lies.

Two methods can be applied to protect direct identifiers. The first is suppression, or removal of the variable. For example, when a clinical trial data set is disclosed, all of the names of the participants are stripped from the data set. The second method is to create a pseudonym ( ISO, 2008 ). Pseudonymization is also sometimes called “coding” in the health research literature ( Knoppers and Saginur, 2005 ). 4 There are different schemes and technical methods for pseudonymization, such as single and double coding, reversible or irreversible pseudonyms, and encryption and hashing techniques. If executed well, pseudonymization ensures that the probability of re-identification is very small. There is no need to measure this probability on the data after suppression or pseudonymization because in almost all cases that value is going to be very small.

Quasi-identifiers, however, cannot be protected using such procedures. This is because the resulting data, in almost all cases, will not be useful for analytic purposes. Therefore, a different set of approaches is required for measuring and de-identifying quasi-identifiers.

The Risk of Re-identification for Quasi-identifiers

Equivalence classes.

All the records that share the same values on a set of quasi-identifiers are called an “equivalence class.” For example, consider the quasi-identifiers in Table B-1 —sex and age. All the records in Table B-1 for males born in 1967 (i.e., records 10, 13, 14, 17, and 22) form an equivalence class. Equivalence class sizes for a data concept, such as age, potentially change during de-identification. For example, there may be five records for males born in 1967. When the precision of age is reduced to a 5-year interval, there are eight records for males born between 1965 and 1969 (i.e., records 2, 10, 11, 13, 14, 17, 22, and 27). In general, there is a trade-off between the level of detail provided for a data concept and the size of the corresponding equivalence classes, with more detail being associated with smaller equivalence classes.

The most common way to measure the probability of re-identification for a record in a data set is for the probability to be equal to 1 divided by the size of its equivalence class. For example, record number 14 is in an equivalence class of size five, and therefore its probability of re-identification is 0.2. Record number 27 is in an equivalence class of size one and therefore its probability of re-identification is equal to 1 divided by 1. Records that are in equivalence classes of size one are called “uniques.” In Table B-3 , we have assigned the probability to each record in our example.

TABLE B-3. The Data Set in Table B-1 with the Probabilities of Re-identification per Record Added.

The Data Set in Table B-1 with the Probabilities of Re-identification per Record Added.

This probability applies under two conditions: (1) the adversary knows someone in the real world and is trying to find the record that matches that individual, and (2) the adversary has selected a record in the data set and is trying to find the identity of that person in the real world. Both of these types of attacks on health data have occurred in practice, and therefore both perspectives are important to consider. An example of the former perspective is when an adversary gathers information from a newspaper and attempts to find the data subject in the data set. An example of the latter attack is when the adversary selects a record in the data set and tries to match it with a record in the voter registration list.

A key observation here is that the probability of re-identification is not based solely on the uniques in the data set. For example, record number 18 is not a unique, but it still has quite a high probability of re-identification. Therefore, it is recommended that the risk of re-identification be considered, and managed, for both uniques and non-uniques.

Maximum Risk

One way to measure the probability of re-identification for the entire data set is through the maximum risk, which corresponds to the maximum probability of re-identification across all records. From Table B-3 , it can be seen that there is a unique record, such that the maximum risk is 1 for this data set.

Average Risk

The average risk corresponds to the average across all records in the data set. In the example of Table B-3 , this amounts to 0.59. By definition, the average risk for a data set will be no greater than the maximum risk for the same data set.

which Risk Metric to Use

As the data set is modified, the risk values may change. For example, consider Table B-4 , in which year of birth has been generalized to decade of birth. The maximum risk is still 1, but the average risk has declined to 0.33. The average risk will be more sensitive than the maximum risk to modifications to the data.

TABLE B-4. The Data Set in Table B-1 After Year of Birth Has Been Generalized to Decade of Birth, with the Probabilities of Re-identification per Record Added.

The Data Set in Table B-1 After Year of Birth Has Been Generalized to Decade of Birth, with the Probabilities of Re-identification per Record Added.

Because the average risk is no greater than the maximum risk, the latter is generally used when a data set is going to be disclosed publicly ( El Emam, 2013 ). This is because a dedicated adversary who is launching a demonstration attack against a publicly available data set will target the record(s) in the disclosed clinical trial data set with the maximum probability of re-identification. Therefore, it is prudent to protect against such an adversary by measuring and managing maximum risk.

The average risk, by comparison, is more suitable for nonpublic data disclosures. For nonpublic data disclosures, some form of data sharing agreement with prohibitions on re-identification can be expected. In this case, it can be assumed that any data subject may be targeted by the adversary.

As a general rule, it is undesirable to have unique records in the data set after de-identification. In the example of Table B-1 , there are unique records both in the original data set and after year of birth has been changed to decade of birth (see Table B-4 ). For example, record 26 is unique in Table B-4 . Unique records have a high risk of re-identification. Also, as a general rule, it is undesirable to have records with a probability of re-identification equal to 0.5 in the data set.

With average risk, one can have data sets with an acceptably small average risk but with unique records or records in equivalence classes of size 2. To avoid that situation, one can use the concept of “strict average risk.” Here, maximum risk is first evaluated to ensure that it is at or below 0.33. If that condition is met, average risk is computed. This two-step measure ensures that there are no uniques or doubles in the data set.

In the example data set in Table B-4 , the strict average risk is 1. This is because the maximum risk is 1, so the first condition is not met. However, the data set in Table B-5 has a strict average risk of 0.33. Therefore, in practice, maximum risk or strict average risk would be used to measure re-identification risk.

TABLE B-5. The Generalized Data Set with No Uniques or Doubles.

The Generalized Data Set with No Uniques or Doubles.

Samples and Populations

The above examples are based on the premise that an adversary knows who is in the data set. Under those conditions, the manner in which the risk metrics have been demonstrated is correct. We call this a “closed” data set. There are situations in which this premise holds true. For instance, one such case occurs when the data set covers everyone in the population. A second case is when the data collection method itself discloses who is in the data set. Here are several examples in which the data collection method makes a data set closed:

  • If everyone attending a clinic is screened into a trial, an adversary who knows someone who attends the clinic will know that that individual is in the trial database.
  • A study of illicit drug use among youth requires parental consent, which means that parents will know if their child is in the study database.
  • The trial participants self-reveal that they are taking part in a particular trial, for example, on social networks or on online forums.

If it is not possible to know who is in the data set, the trial data set can be considered to be a sample from some population. We call this an “open” data set. Because the data set is a sample, there is some uncertainty about whether a person is in the data set or not. This uncertainty can reduce the probability of re-identification.

When the trial data set is treated as a sample, the maximum and average risk need to be estimated from the sample data. The reason is that in a sample context, the risk calculations depend on the equivalence class size in the population as well. Therefore, the population equivalence class sizes need to be estimated for the same records. Estimates are needed because in most the cases, the sponsor will not have access to the population data.

There is a large body of work on these estimators in the disclosure control literature (e.g., Dankar et al., 2012 ; Skinner and Shlomo, 2008 ). A particularly challenging estimation problem is deciding whether a unique record in the sample is also a unique in the population. If a record is unique in the sample, it may be because the sampling fraction is so small that all records in the sample are uniques. Yet a record may be unique in the sample because it is also unique in the population.

Under these conditions, appropriate estimators need to be used to compute the maximum and average risk correctly. In general, when the data set is treated as a sample, the probability of re-identification will be no greater than the probability associated with situations in which the data set is not treated as a sample (i.e., the adversary knows who is in the data set).

Re-identification Risk of Participants with Rare Diseases

It is generally believed that clinical trials conducted on rare diseases will always have a high risk of re-identification. It is true that the risk of re-identification will, in general, be higher than that for nonrare diseases. However, it is not necessarily too high. If the data set is open with a small sampling fraction and one is using (strict) average risk, the risk of re-identification may be acceptably small. The exact risk value will need to be calculated on the actual data set to make that determination.

Taking Context into Account

Determining whether a data set is disclosed to the public or a more restricted group of recipients illustrates how context is critical. In the case of the recipient, for instance, it informs us which metric is more appropriate. However, this is only one aspect of the context surrounding a data set, and a more complete picture can be applied to make more accurate assessments of re-identification risk.

For a public data release, we assume that the adversary will launch a demonstration attack, and therefore it is necessary to manage maximum risk. There are no other controls that can be put in place. For a nonpublic data set, we consider three types of attacks that cover the universe of attacks: deliberate, inadvertent, and breach ( El Emam, 2013 ; El Emam and Arbuckle, 2013 ).

A deliberate attack transpires when the adversary deliberately attempts to re-identify individuals in the data set. This may be a deliberate decision by the leadership of the data recipient (e.g., the QI decides to re-identify individuals in order to link to another data set) or by a rogue employee associated with the data recipient. The probability that this type of attack will be successful can be computed as follows:

(1)

where the term Pr (attempt) captures the probability that a deliberate attempt to re-identify the data will be made by the data recipient. The actual value for Pr (attempt) will depend on the security and privacy controls that the data recipient has in place and the contractual controls that are being imposed as part of the data sharing agreement. The second term, Pr (re-id | attempt), corresponds to the probability that the attack will be successful in the event that the recipient has chosen to commit the attack. This conditional can be measured from the actual data.

An inadvertant attack transpires when a data analyst working with the QI (or the QI himself/herself) inadvertently re-identifies someone in the data set. For instance, this could occur when the recipient is already aware of the identity of someone in the data set, such as a friend; relative, or, more generally, an acquaintance. The probability of successful re-identification in this situation can be computed as follows:

(2)

There are defensible ways to compute Pr (acquaintance) ( El Emam, 2013 ), which evaluates the probability of an analyst's knowing someone in the data set. For example, if the trial is of a breast cancer treatment, then Pr (acquaintance) is the probability of the analyst's knowing someone who has breast cancer. The value for Pr (re-id | acquaintance) needs to be computed from the data. Box B-4 considers the question of whether it is always necessary to be concerned about the risk of inadvertent re-identification.

Is It Always Necessary to Be Concerned About the Risk of Inadvertent Re-identification. In the context of data release through an online portal, an argument can be made that the sponsor imposes significant security and privacy controls and requires (more...)

A breach will occur if there is a data breach at the QI's facility. The probability of this type of attack being successful is

(3)

where the term Pr (breach) captures the probability that a breach will occur. What should Pr (breach) be? Publicly available data about the probability of a breach can be used to determine this value; the value of the conditional in this case, Pr (re-id | breach), will be computed from these data. Data for 2010 show that 19 percent of health care organizations suffered a data breach within the previous year ( HIMSS Analytics, 2010 ); data for 2012 show that this number rose to 27 percent ( HIMSS Analytics, 2012 ). These organizations were all following the HIPAA Security Rule. Note that these figures are averages and may be adjusted to account for variation.

For a nonpublic data release, then, there are three types of attacks for which the re-identification risk needs to be measured and managed. The risk metrics are summarized in Table B-6 . The overall probability of re-identification will then be the largest value among the three equations.

TABLE B-6. Data Risk Metrics.

Data Risk Metrics.

Setting Thresholds: What Is Acceptable Risk

There are quite a few precedents for what can be considered an acceptable amount of risk. These precedents have been in use for many decades, are consistent internationally, and have persisted over time as well ( El Emam, 2013 ). It should be noted, however, that the precedents set to date have been for assessments of maximum risk.

In commentary about the de-identification standard in the HIPAA Privacy Rule, HHS notes in the Federal Register ( HHS, 2000 ) that

the two main sources of disclosure risk for de-identified records about individuals are the existence of records with very unique characteristics (e.g., unusual occupation or very high salary or age) and the existence of external sources of records with matching data elements which can be used to link with the de-identified information and identify individuals (e.g., voter registration records or driver's license records) … an expert disclosure analysis would also consider the probability that an individual who is the target of an attempt at re-identification is represented on both files, the probability that the matching variables are recorded identically on the two types of records, the probability that the target individual is unique in the population for the matching variables, and the degree of confidence that a match would correctly identify a unique person.

It is clear that HHS considers unique records to have a high risk of re-identification, but such statements also suggest that non-unique records have an acceptably low risk of re-identification.

Yet uniqueness is not a universal threshold. Historically, data custodians (particularly government agencies focused on reporting statistics) have used the “minimum cell size” rule as a threshold for deciding whether to de-identify data ( Alexander and Jabine, 1978 ; Cancer Care Ontario, 2005 ; Health Quality Council, 2004a , b ; HHS, 2000 ; Manitoba Center for Health Policy, 2002 ; Office of the Information and Privacy Commissioner of British Columbia, 1998 ; Office of the Information and Privacy Commissioner of Ontario, 1994 ; OMB, 1994 ; Ontario Ministry of Health and Long-Term Care, 1984 ; Statistics Canada, 2007 ). This rule was originally applied to counting data in tables (e.g., number of males aged 30-35 living in a certain geographic region). The most common minimum cell size in practice is 5, which implies that the maximum probability of re-identifying a record is 1/5, or 0.2. Some custodians, such as certain public health offices, use a smaller minimum count, such as 3 ( CDC and HRSA, 2004 ; de Waal and Willenborg, 1996 ; NRC, 1993 ; Office of the Privacy Commissioner of Quebec, 1997 ; U.S. Department of Education, 2003 ). Others, by contrast, use a larger minimum, such as 11 (in the United States) ( Baier et al., 2012 ; CMS, 2008 , 2011 ; Erdem and Prada, 2011 ; HHS, 2008a ) and 20 (in Canada) ( El Emam et al., 2011b , 2012 ). Based on our review of the literature and the practices of various statistical agencies, the largest minimum cell size is 25 ( El Emam et al., 2011b ). It should be recognized, however, that there is no agreed-upon threshold, even for what many people would agree is highly sensitive data. For example, minimal counts of 3 and 5 were recommended for HIV/AIDS data ( CDC and HRSA, 2004 ) and abortion data ( Statistics Canada, 2007 ), respectively. Public data releases have used different cell sizes in different jurisdictions. The variability is due, in part, to different tolerances for risk, the sensitivity of data, whether a data sharing agreement is in place, and the nature of the data recipient.

A minimum cell size criterion amounts to a maximum risk value. Yet in some cases, this is too stringent a standard or may not be an appropriate reflection of the type of attack. In such a case, one can use the average risk, as discussed in the previous section. This makes the review of cell size thresholds suitable for both types of risk metrics.

It is possible to construct a decision framework based on these precedents with five “bins” representing five possible thresholds, as shown in Figure B-4 . At one extreme is data that would be considered identifiable when the cell size is smaller than 3. Next to that are data that are de-identified with a minimal cell size of 3. Given that this is the least de-identified data set, one could choose to disclose such data sets only to trusted entities where the risks are minimal (for example, where a data sharing agreement is in place and the data recipient has good security and privacy practices). At the other end of the spectrum is the minimal cell size of 20. This high level of de-identification is appropriate when the data are publicly released, with no restrictions on or tracking of what is done with the data and who has accessed them.

Commonly used risk thresholds based on the review/references in the text.

If the extreme situations cannot be justified in a particular disclosure, an alternative process is needed for choosing one of the intermediate values. In Figure B-4 , this is a choice between a value of 5 and a value of 20.

The above framework does not preclude the use of other values (for example, a sponsor may choose to use a threshold value of 25 observations per cell). However, this framework does ground the choices based on precedents of actual data sets.

What Is the Likelihood of Re-identifying Clinical Trial Data Sets

There has been concern in the health care and privacy communities that the risk of re-identification in data is quite high and that deidentification is not possible ( Ohm, 2010 ). This argument is often supported by examples of a number of publicly known re-identification attacks. A systematic review of publicly known re-identification attacks found, however, that when appropriate re-identification standards are used, the risk of re-identification is indeed very small ( El Emam et al., 2011a ). 5 It was only when no de-identification at all was performed on the data or the de-identification applied was not consistent with or based on best practices that data sets were re-identified with a high success rate. Therefore, the evidence that exists today suggests that using current standards and best practices does provide reasonably strong protections against re-identification.

  • HOW TO MANAGE RE-IDENTIFICATION RISK

Managing re-identification risk means (1) selecting an appropriate risk metric, (2) selecting an appropriate threshold, and (3) measuring the risk in the actual clinical trial data set that will be disclosed. The choice of a metric is a function of whether the clinical trial data set will be released publicly. For public data sets, it is prudent to use maximum risk in measuring risk and setting thresholds. For nonpublic data sets, a strong case can be made for using average risk ( El Emam, 2013 ; El Emam and Arbuckle, 2013 ).

How to Choose an Acceptable Threshold

Selecting an acceptable threshold within the range described earlier requires an examination of the context of the data themselves. The re-identification risk threshold is determined based on factors characterizing the QI and the data themselves ( El Emam, 2010 ). These factors have been suggested and have been in use informally by data custodians for at least the last decade and a half ( Jabine, 1993a , b ). They cover three dimensions ( El Emam et al., 2010 ), as illustrated in Figure B-5 :

Factors to consider when deciding on an acceptable level of re-identification risk. SOURCE: Reprinted with permission from El Emam and colleagues, 2014.

  • Mitigating controls . This is the set of security and privacy practices that the QI has in place. A recent review identifies a collection of practices used by large data custodians and recommended by funding agencies and IRBs for managing sensitive health information ( El Emam et al., 2009 ).
  • Invasion of privacy . This entails evaluation of the extent to which a particular disclosure would be an invasion of privacy to the participants (a checklist is available in El Emam et al. [2009] ). There are three considerations: (1) the sensitivity of the data (the greater the sensitivity of the data, the greater the invasion of privacy), (2) the potential injury to patients from an inappropriate disclosure (the greater the potential for injury, the greater the invasion of privacy), and (3) the appropriateness of consent for disclosing the data (the less appropriate the consent, the greater the invasion of privacy) (see Box B-5 ).
  • Motives and capacity . This dimension compasses the motives and the capacity of the QI to re-identify the data, considering such issues as conflicts of interest, the potential for financial gain from re-identification, and whether the data recipient has the skills and financial capacity to re-identify the data (a checklist is available in El Emam et al. [2009] ).

Consent and De-identification. As noted earlier, there is no legislative or regulatory requirement to obtain consent from participants to share their de-identified data. There are additional ongoing efforts to ensure that consent forms do not create barriers (more...)

In general, many of these elements can be managed through contracts (e.g., a prohibition on re-identification, restrictions on linking the data with other data sets, and disallowing the sharing of the data with other third parties). For example, if the mitigating controls are low, which means that the QI has poor security and privacy practices, the re-identification threshold should be set at a lower level. This will result in more de-identification being applied. However, if the QI has very good security and privacy practices in place, the threshold can be set higher. Checklists for evaluating these dimensions, as well as a scoring scheme, are available ( El Emam, 2013 ).

If the sponsor is disclosing the data through an online portal, the sponsor has control of many, but not all, of the mitigating controls. This provides additional assurances to the sponsor that a certain subset of controls will be implemented to the sponsor's satisfaction.

Once a threshold has been determined, the actual probability of re-identification is measured in the data set. If the probability is higher than the threshold, transformations of the data need to be performed. Otherwise, the data can be declared to have a very small risk of re-identification.

The implication here is that the amount of data transformation needed will be a function of these other contextual factors. For example, if the QI has good security and privacy practices in place, the threshold chosen will be higher, which means that the data will be subjected to less de-identification.

The security and privacy practices of the QI can be manipulated through contracts. The contract signed by the QI can impose a certain list of practices that must be in place, which are the basis for determining the threshold. Therefore, they must be in place by the QI to justify the level of transformation performed on the data.

This approach is consistent with the limited data set (LDS) method for sharing data under HIPAA. However, this method does not ensure that the risk of re-identification is very small, and therefore the data will still be considered personal health information.

For public data releases, there are no contracts and no expectation that any mitigating controls will be in place. In that case, the lowest probability thresholds (or highest cell size thresholds) are used.

Methods for Transforming the Data

There are a number ways to transform a data set to reduce the probability of re-identification to a value below the threshold. Many algorithms for this purpose have been proposed by the computer science and statistics communities. They vary in quality and performance. Ideally, algorithms adopted for clinical trial data sets should minimize the modifications to the data while ensuring that the measured probability is below the threshold.

Four general classes of techniques have worked well in practice:

  • Generalization . This is when the value of a field is modified to a more general value. For example, a date of birth can be generalized to a month and year of birth.
  • Suppression . This is when specific values in the clinical trial data set are removed from the data set (i.e., induced missingness). For example, a value in a record that makes it an outlier may be suppressed.
  • Randomization . This denotes adding noise to a field. The noise can come from a uniform or other type of distribution. For example, a date may be shifted a week forward or backward.
  • Subsampling . This is used to disclose a random subset of the data rather than the full data set to the QI.

In practice, a combination of these techniques is applied for any given data disclosure. Furthermore, these techniques can be customized to specific field types. For example, generalization and suppression can be applied differently to dates and zip codes to maximize the data quality for each ( El Emam and Arbuckle, 2013 ).

The application of these techniques can reduce the risk of re-identification. For example, consider the average risk in Table B-3 , which is 0.59. There is a reduction in average risk to 0.33 when the year of birth is generalized to decades in Table B-4 . By suppressing some records, it was possible to further reduce the average risk to 0.22 in Table B-5 . Each transformation progressively reduces the risk.

The Use of Identifier Lists

Thus far we have covered a sufficient number of topics that we can start performing a critical appraisal of some commonly used deidentification methods and the extent to which they can ensure that the risk of re-identification is very small. We focus on the use of identifier lists. The reason is that this approach is quite common and is being adopted to de-identify clinical trial data.

The HIPAA Privacy Rule's Safe Harbor Standard

We first consider the variable list in the HIPAA Privacy Rule Safe Harbor method.

The Safe Harbor list contains a number of direct identifiers and two quasi-identifiers (i.e., dates and zip codes), as summarized earlier in Box B-2 . It should be evident that in applying a fixed list of variables, there is no assurance that all of the quasi-identifiers have been accounted for in the risk measurement and the transformation of the data set. For example, other quasi-identifiers, such as race, ethnicity, and occupation, may be in the data set, but they will be ignored. Even if the probability of re-identification under Safe Harbor is small ( Benitez and Malin, 2010 ), this low probability may not carry over with more quasi-identifiers than the two in the original list.

The empirical analysis that was conducted before the Safe Harbor standard was issued assumed that the data set is a random sample from the U.S. population. This assumption may have variable validity in real data sets. However, there will be cases when it is definitely not true. For example, consider a data set that consists of only the records in Table B-1 . Now, assume that an adversary can find out who is in the data set. This can happen if the data set covers a well-defined population. If the trial site is known, it can be reasonably assumed that the participants in the trial who received treatment at that site live in the same geographic region. If the adversary knows that Bob was born in 1965, lives in the town in which the site is situated, and was in the trial, the adversary knows that Bob is in the data set, and therefore the 27th record must be Bob. This re-identification occurs even though this table meets the requirements of the Safe Harbor standard. Members of a data set may be known if their inclusion in the trial is revealing (e.g., a trial in a workplace where participants have to wear a visible device, parents who must consent to have their teenage children participate in a study, or adolescents who must miss a few days of school to participate in a study). Therefore, this standard can be protective only if the adversary cannot know who is in the data set. This will be the case if the data set is a random sample from the population.

If these assumptions are met, the applicability of Safe Harbor to a clinical trial data set will be defensible, but only if there are no international participants. If a clinical trial data set includes participants from sites outside the United States, the analysis that justifies using this standard will not be applicable. For example, there is a difference of two orders of magnitude between the median number of individuals living in U.S. zip codes and in Canadian postal codes. Therefore, translating the zip code truncation logic in Safe Harbor to Canadian postal codes would not be based on defensible evidence.

Safe Harbor also has some weaknesses that are specific to the two quasi-identifiers that are included.

In some instances, there may be dates in a clinical trial data set that are not really quasi-identifiers because they do not pass the test highlighted earlier. For example, consider an implantable medical device that fires, and each time it does so there is a time and date stamp in the data stream. The date of a device's firing is unlikely to be a quasi-identifier because it is not knowable, but it is a date.

Safe Harbor states that all three-digit zip codes with fewer than 20,000 inhabitants from the 2010 census must be replaced with “000”; otherwise the three-digit zip code may be included in the data set. The locations of three-digit zip codes with fewer than 20,000 inhabitants are shown in Figure B-6 . However, in some states there is only one zip code with fewer than 20,000 inhabitants. For example, if a data set is disclosed with “000” for the residential three-digit zip code for participants in a site in New Hampshire (and it is known that the site is in that state), it is reasonable to assume that the participants also live in that state and to infer that their true three-digit zip code is 036. The same conclusion can be drawn about “000” three-digit zip codes in states such as Alabama, Minnesota, Nebraska, and Nevada.

Inhabited three-digit zip codes with fewer than 20,000 inhabitants from the 2010 U.S. census.

Other Examples of Identifier Lists

More recent attempts at developing a fixed list of quasi-identifiers to de-identify clinical trial data have indicated that including any combination of two quasi-identifiers (from the prespecified list) is acceptable ( Hrynaszkiewicz et al., 2010 ). Data sets with more than two quasi-identifiers need to go through a more thorough evaluation, such as the risk management approach described earlier. However, this approach suffers from the same limitations as the Safe Harbor standard with respect to the assumption of two quasi-identifiers always having acceptably small risk. An additional limitation is that the authors of the list in Hrynaszkiewicz et al. (2010) present no empirical evaluation demonstrating that this approach consistently produces data sets with a low risk of re-identification, whereas at least the Safe Harbor list is based on empirical analysis performed by the Census Bureau.

More important, a number of de-identification standards proposed by sponsors have followed similar approaches for sharing clinical trial data from participants globally (see the standards at ClinicalStudyDataRequest.com ). Ideally, methods that can provide stronger assurances should be used to de-identify such data.

Putting It All Together

Now that we have gone through the various key elements of the deidentification process, we can put them together into a comprehensive data flow. This flow is illustrated in Figure B-7 . The steps in this process are as follows.

The overall de-identification process. SOURCE: Reprinted with permission from El Emam and colleagues, 2014.

Step 1: Determine direct identifiers in the data set

Determine which fields in the data set are direct identifiers. If the clinical trial data set has already been stripped of direct identifiers, this step may not be necessary.

Step 2: Mask (transform) direct identifiers

Once the direct identifiers have been determined, masking techniques must be applied to those direct identifiers. Masking techniques include the following: (1) removal of the direct identifiers, (2) replacement of the direct identifiers with random values, or (3) replacement of the direct identifiers with pseudonyms. Once masking has been completed there is virtually no risk of re-identification from direct identifiers. If the database has already been stripped of direct identifiers, this step may not be necessary.

Step 3: Perform threat modeling

Threat modeling consists of two activities: (1) identification of the plausible adversaries and what information they may be able to access, and (2) determination of the quasi-identifiers in the data set.

Step 4: Determine minimal acceptable data utility

It is important to determine in advance the minimal relevant data based on the quasi-identifiers. This is essentially an examination of what fields are considered most appropriate given the purpose of the use or disclosure. This step concludes with the imposition of practical limits on how some data may be de-identified and the analyses that may need to be performed later on.

Step 5: Determine the re-identification risk threshold

This step entails determining what constitutes acceptable risk. As an outcome of the process used to define the threshold, the mitigating controls that need to be imposed on the QI, if any, become evident.

Step 6: Import (sample) data from the source database

Importing data from the source database may be a simple or complex exercise, depending on the data model of the source data set. This step is included explicitly in the process because it can consume significant resources and must be accounted for in any planning for de-identification.

Step 7: Evaluate the actual re-identification risk

The actual risk is computed from the data set using the appropriate metric (maximum or strict average). To compute risk, a number of parameters need to be set, such as the sampling fraction.

Step 8: Compare the actual risk with the threshold

This step entails comparing the actual risk with the threshold determined in Step 5.

Step 9: Set parameters and apply data transformations

If the measured risk is higher than the threshold, anonymization methods, such as generalization, suppression, randomization, and sub-sampling, are applied to the data. Sometimes a solution cannot be found within the specified parameters, and it is necessary to go back and reset the parameters. It may also be necessary to modify the threshold and adjust some of the assumptions behind the original risk assessment. Alternatively, some of the assumptions about acceptable data utility may need to be renegotiated with the data users.

Step 10: Perform diagnostics on the solution

If the measured risk is lower than the threshold, diagnostics should be performed on the solution. Diagnostics may be objective or subjective. An objective diagnostic will evaluate the sensitivity of the solution to violations of assumptions that were made. For example, an assumption may be that an adversary might know the diagnosis code of a patient, or if there is uncertainty about the sampling fraction of the data set, a sensitivity to that value can be performed. A subjective diagnostic will determine whether the utility of the data is sufficiently high for the intended purposes of the use or disclosure.

If the diagnostics are satisfactory, the de-identified data are exported, and a report documenting the de-identification is produced. On the other hand, if the diagnostics are not satisfactory, the re-identification parameters may need to be modified; the risk threshold adjusted; and the original assumptions about minimal, acceptable utility renegotiated with the data user.

Step 11: Export transformed data to external data set

Exporting the de-identified data to the destination database may be a simple or complex exercise, depending on the data model of the destination database. This step is included explicitly in the process because it can consume significant resources and must be accounted for in any planning for de-identification.

  • ASSESSING THE IMPACT OF DE-IDENTIFICATION ON DATA QUALITY

As noted above, Safe Harbor and similar methods that significantly restrict the precision of the fields that can be disclosed can result in a nontrivial reduction in the quality of de-identified data. Therefore, in this section, we focus on data quality when statistical methods are used to de-identify data.

The evidence on the impact of de-identification on data utility is mixed. Some studies show little impact ( Kennickell and Lane, 2006 ), while others show significant impact ( Purdam and Elliot, 2007 ). There is also evidence that data utility will depend on the type of analysis performed ( Cox and Kim, 2006 ; Lechner and Pohlmeier, 2004 ). In general, if deidentification is accomplished using precise risk measurement and strong optimization algorithms to transform the data, data quality should remain high.

Ensuring that the analysis results produced after de-identification are similar to the results that would be obtained on the original data sets is critical. It would be problematic if a QI attempted to replicate the results from a published trial and were unable to do so because of extensive distortion caused by the de-identification that was applied. Therefore, the amount of distortion must be minimized.

However, de-identification always introduces some distortion, and there is a trade-off between data quality and the amount of deidentification performed to protect privacy. This trade-off can be represented as a curve between data utility and privacy protection, as illustrated in Figure B-8 .

The trade-off between privacy and data utility.

Consider, then, that there is a minimal amount of data utility that would be tolerable to ensure that the results of the original trial can be replicated to a large extent. On the other hand, there is a re-identification probability threshold that cannot be exceeded. As shown in Figure B-8 , this will leave a small range of possible solutions. To ensure that the deidentification solution is truly within this narrow operating range, it is necessary to perform a pilot evaluation on one or more representative clinical trial data sets and compare the before and after analysis results using exactly the same analytic techniques.

Obtaining similar results for a de-identified clinical trial data set that is intended for public release will be more challenging than disclosing the data set to a QI with strong mitigating controls. The reason is that the amount of de-identification will vary, being more in the former case. This may limit a sponsor's ability to disclose data publicly, or there may have to be a strong replicability caveat on the public data set. For a nonpublic data set when a QI is known, the sponsor may impose a minimal set of mitigating controls through a contract or by providing the data through an online portal to ensure that the de-identification applied to the data set is not excessive.

Governance is necessary for the sponsor to manage the risks when disclosing clinical trial data, and it requires that a set of additional practices be in place. What would be characterized as high-maturity sponsors will have a robust governance process in place.

Governance Practices

Some governance practices are somewhat obvious, such as the need to track all data releases; trigger alerts for data use expirations; and ensure that the documentation for the de-identification for each data release has, in fact, been completed. Other practices are necessary to ensure that participant privacy is adequately protected in practice. Elements of governance practices are listed in Box B-6 .

Elements of Governance Practices. Developing and maintaining global anonymization documentation Process and tools for tracking all data releases

Controlled Re-identification

The U.K. ICO has recommended that organizations that disclose data also perform controlled re-identification attacks on their disclosed data sets ( ICO, 2012 ). Doing so will allow them to obtain independent evidence on how well their de-identification practices are working and determine whether there are any potential weaknesses that they need to start addressing.

Controlled re-identification attacks are commissioned by the sponsor. With limited funding, these attacks often use publicly available information to attack databases. If additional funding is available, those who conduct these attacks can purchase and use commercial databases to reidentify data subjects.

Appropriate Contracts

Additional governance elements become particularly important when a sponsor discloses data to a QI under a contract. This contract will document the mitigating controls as part of the conditions for receiving the data. The sponsor should then have an audit regime in place to ensure that QIs have indeed put these practices in place. The sponsor may select high-risk QIs for audit, select randomly, or combine the two. Another approach is to ask QIs to conduct third-party audits and report the results back to the sponsor on a regular basis for as long as they are using the data set. The purpose of the audit is to ensure that the mitigating controls are indeed in place.

Enterprise De-identification Process

At an enterprise level, sponsors need to have an enterprise deidentification process that will be applied across all clinical trial data sets. This process includes the appropriate thresholds and controls for data releases, as well as templates for data sharing agreements and terms of use of data. The global process ensures consistency across all data releases. This process must then be enacted for each clinical trial data set, and this may involve some customization to address specific characteristics of a given data set.

The cost of such a process will depend on the size of the sponsor and the heterogeneity of its clinical trials and therapeutic areas. However, in the long term such an approach can be expected to have a lower total cost because there will be more opportunities for reuse and learning.

In practice, many sponsors have standard case report forms (CRFs) for a subset of the data they collect in their clinical trials. For example, there may be standard CRFs for demographics or for standardized measures and patient-reported outcomes. The global process can classify the variables in these standard CRFs as direct and quasi-identifiers and articulate the techniques that should be used to transform those variables. This will reduce the anonymization effort per clinical trial by a nontrivial amount.

Protecting Against Attribute Disclosure

At the beginning of this paper, we briefly mentioned attribute disclosure, but did not address how to protect against it. Such protections can be implemented as part of governance. However, in general, modifying the data to protect against attribute disclosure means reducing the plausible inferences that can be drawn from the data. This can be detrimental to the objective of learning as much as possible from the data and building generalizable statistical models from the data. Furthermore, to protect against attribute disclosure, one must anticipate all inferences and make data modifications to impede them, which may not be possible.

Some inferences may be desirable because they may enhance understanding of the treatment benefits or safety of a new drug or device, and some inferences will be stigmatizing to the data subjects. One will not want to make modifications to the data that block the former type of inferences.

For nonpublic data releases, it is recommended that there be an ethics review of the analysis protocols. As part of the ethics review process, the ethics committee or council will examine the potential for stigmatizing attribute disclosure. This is a subjective decision and will have to take into account current social norms and participant expectations (see also the discussion in El Emam and Arbuckle [2013] ). The ethics review may be performed on the secondary analysis protocol by the QI's institutional IRB, or by a separate committee reporting to the sponsor or even within the sponsor. Such an approach will maximize data integrity but also provide assurance that attribute disclosure is addressed. An internal sponsor ethics review council will include a privacy professional, an ethicist, a lay person representing the participants, a person with knowledge of the clinical trials business at the sponsor, and a brand or public relations person.

For public data releases, there is no analysis protocol or a priori approval process, and therefore it will be challenging to provide assurances about attribute disclosure.

De-identifying Genomic Data

There have been various proposals to apply the types of generalization and randomization strategies discussed in this paper to genomic data, and *omics data more generally (e.g., RNA expression or proteomic records) ( Li et al., 2012 ; Lin et al., 2002 , 2004 ; Malin, 2005 ). However, evidence suggests that such methods may not be suitable for the anonymization of biomarkers that constitute a large number of dimensions. The main reasons are that they can cause significant distortion of long sequences, and the assumptions that need to be made to de-identify sequences of patient events (e.g., visits and claims) will not apply to *omic data. At the same time, there are nuances that are worth considering. For context, we address concerns around genomic data specifically, while noting that similar allusions can be made to other types of data.

First, it is important to recognize that many of the attacks that have been carried out on genomic data require additional information ( Malin et al., 2011 ). In certain cases, for instance, the re-identification of genomic data is accomplished through the demographics of the corresponding research participant; the associated clinical information ( Loukides et al., 2010b ); or contextual cues associated with the collection and dissemination of the data, such as the set of health care providers visited by the participant ( Malin and Sweeney, 2004 ). For example, a recently reported re-identification attack on participants in the Personal Genome Project (PGP) was based almost entirely on information derived from publicly accessible profiles—notably birth date (or month and year), gender, and geographic indicators of residence (e.g., zip code) ( Sweeney et al., 2013 ). Other individuals in the PGP were re-identified based on the fact that they uploaded compressed files that incorporated their personal names as file names when uncompressed. This attack used the same type of variables that can be protected using the techniques described in this paper. Moreover, it has been shown that many of the protection strategies discussed in this paper can be tailored to support genome-phenome association discovery (e.g., through anonymization of standardized clinical codes [ Heatherly et al., 2013 ; Loukides et al., 2010a ]).

This fact is true for attacks that factor genomic data into the attack as well. For instance, it was recently shown that an adversary could use publicly available databases that report on Y-chromosome–surname correlations to ascertain the surname of a genome sequence lacking an individual's name ( Haggie, 2013 ). However, for this attack to be successful, it required additional information about the corresponding individual. Specifically, the attacker also needed to know the approximate area of residence (e.g., U.S. state) and approximate age of the individual. Although such information may be permitted within a Safe Harbor de-identification framework, a statistical assessment of the potential identifiability of such information would indicate that such ancillary information might constitute an unacceptably high rate of re-identification risk. At the same time, it should be recognized that, even when such information was made available, the attack reported in Haggie (2013) was successful 12 percent of the time and unsuccessful 5 percent of the time. In other words, there is variability in the chance that such attacks will be successful.

More direct attacks are, however, plausible. There is evidence that a sequence of 30 to 80 independent single nucleotide polymorphisms (SNPs) could uniquely identify a single person ( Lin et al., 2004 ). Unlike the surname inference attack mentioned above, a direct attack would require that the adversary already have identified genotype data for a target individual. Yet linking an individual using his or her genome would permit the adversary to learn any additional information in the new resource, such as the individual's health status. Additionally, a recent demonstration with data from openSNP and Facebook suggests that in certain instances, the genomic status of an individual can be inferred from the genome sequences of close family members ( Humbert et al., 2013 ).

Beyond direct matching of sequences, there is also a risk of privacy compromise in “pooled” data, where only summary statistics are reported. For instance, it has been shown that it is possible to determine whether an individual is in a pool of cases or controls for a study by assessing the likelihood that the individual's sequence is “closer” to one group or the other ( Homer et al., 2008 ; Jacobs et al., 2009 ; Wang et al., 2009 ). Despite such vulnerability, it has also been shown that the likelihood of success for this attack becomes lower as the number of people in each group increases. In fact, for studies with a reasonable number of participants (more than 1,000), it is safe to reveal the summary statistics of all common (not rare) genomic regions ( Sankararaman et al., 2009 ).

However, one of the challenges with genomic data is that it is possible to learn phenotypic information directly. When such information can be ascertained with certainty, it can then be used in a re-identification attack. For example, predictions (varying in accuracy) of height, facial morphology, age, body mass index, approximate skin pigmentation, eye color, and diagnosis of cystic fibrosis or Huntington's chorea from genetic information have been reported ( Kayser and de Knijff, 2011 ; Kohn, 1991 ; Lowrance and Collins, 2007 ; Malin and Sweeney, 2000 ; Ou et al., 2012 ; Silventoinen et al., 2003 ; Wjst, 2010 ; Zubakov et al., 2010 ), although it should be noted that there have been no full demonstrations of attacks using such inferences. Also, because of the errors in some of these predictions (excluding Mendelian disorders that are directly dependent on a mutation in a certain portion of the genome), it is not clear that they would be sufficiently reliable for re-identification attacks.

Although traditional generalization and randomization strategies may not provide a sufficient balance between utility and privacy for high-dimensional *omics data, a solution to the problem may be possible with the assistance of modern cryptography. In particular, secure multiparty computation (SMC) corresponds to a set of techniques (and protocols) that allow quite sophisticated mathematical and statistical operations to be performed on encrypted data. In the process, individual records would never be disclosed to the user of such a resource. This type of protection would not prevent inference through summary-level statistics, but it would prevent direct attacks on individuals' records. SMC solutions have been demonstrated that have been tailored to support frequency queries ( Kantarcioglu et al., 2008 ), genomic sequence alignment ( Chen et al., 2012 ), kinship (and other comparison) tests ( Baldi et al., 2011 ; He et al., 2014 ), and personalized medical risk scores ( Ayday et al., 2013a , b ). Nonetheless, the application of these methods to genetic data is still in the early stages of research, and it may be a few more years before some large-scale practical results are seen.

  • Alexander L, Jabine T. Access to social security microdata files for research and statistical purposes. Social Security Bulletin. 1978; 41 (8):3–17. [ PubMed : 715640 ]
  • Article 29 Data Protection Working Party. Opinion 4/2007 on the concept of personal data. 2007. [December 19, 2014]. WP136. http://ec ​.europa.eu/justice ​/policies/privacy ​/docs/wpdocs/2007/wp136_en.pdf .
  • Article 29 Data Protection Working Party. Opinion 05/2014 on anonymization techniques. 2014. [December 19, 2014]. WP216. http://ec ​.europa.eu/justice ​/data-protection ​/article-29/documentation ​/opinionrecommendation ​/files/2014/wp216_en.pdf .
  • Ayday E, Raisaro JL, Hubaux JP. 20th Annual Network and Distributed System Security Symposium (NDSS). San Diego, CA: Feb, 2013a. 2013. (Privacy-enhancing technologies for medical tests using genomic data).
  • Ayday E, Raisaro JL, Hubaux JP, Rougemont J. Protecting and evaluating genomic privacy in medical tests and personalized medicine; Proceedings of the 12th ACM workshop on workshop on Privacy in the Electronic Society; 2013b. pp. 95–106.
  • Baier P, Hinkins S, Scheuren F. The electronic health records incentive program eligible professionals public use file. 2012. [December 19, 2014]. http://www ​.cms.gov/Regulations-and-Guidance ​/Legislation/EHRIncentivePrograms ​/DataAndReports.html .
  • Baldi P, Baronio R, De Cristofaro E, Gasti P, Tsudik G. Countering GATTACA: Efficient and secure testing of fully-sequenced human genomes; Proceedings of the 18th ACM Conference on Computer and Communications Security; 2011. pp. 691–702.
  • Benitez K, Malin B. Evaluating re-identification risks with respect to the HIPAA Privacy Rule. Journal of the American Medical Informatics Association. 2010; 17 (2):169–177. [ PMC free article : PMC3000773 ] [ PubMed : 20190059 ]
  • Bhattacharjee Y. Pharma firms push for sharing of cancer trial data. Science. 2012; 38 (6103):29. [ PubMed : 23042862 ]
  • Canadian Institute for Health Information. “Best practice” guidelines for managing the disclosure of de-identified health information. 2010. [December 19, 2014]. http://www ​.ijpc-se.org/documents/hhs10 ​.pdf .
  • Cancer Care Ontario. Cancer Care Ontario data use and disclosure policy. Toronto, ON: Cancer Care Ontario; 2005.
  • Castellani J. Are clinical trial data shared sufficiently today?: Yes. British Medical Journal. 2013; 347 (1):f1881. [ PubMed : 23838461 ]
  • CDC (Centers for Disease Control and Prevention) and HRSA (Health Resources and Services Administration). Integrated guidelines for developing epidemiologic profiles: HIv Prevention and Ryan white CARE Act community planning. Atlanta, GA: CDC; 2004. [December 19, 2014]. http://www ​.cdph.ca.gov ​/programs/aids/Documents ​/GLines-IntegratedEpiProfiles.pdf .
  • Center for Business and Information Technologies. Cajun Code Fest. 2013. [November 9, 2012]. http: ​//cajuncodefest.org .
  • Chen Y, Peng B, Wang X, Tang H. Large-scale privacy-preserving mapping of human genomic sequences on hybrid clouds; Proceeding of the 19th Network and Distributed System Security Symposium; San Diego, CA. February 2012. 2012.
  • CMS (Centers for Medicare & Medicaid Services). 2008 basic stand alone Medicare claims public use files. 2008. [December 19, 2014]. http://www ​.cms.gov/Research-Statistics-Data-and-Systems ​/Statistics-Trends-and-Reports ​/BSAPUFS/Downloads ​/2008_BSA_PUF_Disclaimer.pdf .
  • CMS. BSA inpatient claims PUF. 2011. [December 19, 2014]. http://www ​.cms.gov/Research-Statistics-Data-and-Systems ​/Statistics-Trends-and-Reports ​/BSAPUFS/Inpatient_Claims.html .
  • Cohen J. A coefficient of agreement for nominal scales. Educational and Psychological Measurement. 1960; 20 (1):37–46.
  • CORE (Center for Outcomes Research and Evaluation). The YODA Project. 2014. [December 19, 2014]. http://medicine ​.yale ​.edu/core/projects/yodap .
  • Cox LH, Kim JJ. Privacy in statistical databases. Domingo-Ferrer J, Franconi L, editors. New York: Springer Berlin Heidelberg; 2006. pp. 48–56. (Effects of rounding on the quality and confidentiality of statistical data).
  • Dankar F, Emam KE, Neisa A, Roffey T. Estimating the re-identification risk of clinical data sets. BMC Medical Informatics and Decision Making. 2012; 12 (1):66. [ PMC free article : PMC3583146 ] [ PubMed : 22776564 ]
  • de Waal A, Willenborg L. A view on statistical disclosure control for microdata. Survey Methodology. 1996; 22 (1):95–103.
  • Dryad. Dryad digital repository. undated. [September 19, 2013]. http://datadryad ​.org .
  • El Emam K. Risk-based deidentification of health data. IEEE Security and Privacy. 2010; 8 (3):64–67.
  • El Emam K. Guide to the deidentification of personal health information. Boca Raton, FL: CRC Press (Auerbach Publications); 2013.
  • El Emam K, Arbuckle L. Anonymizing health data: Case studies and methods to get you started. Sebastopol, CA: O'Reilly Media; 2013.
  • El Emam K, Dankar F, Vaillancourt R, Roffey T, Lysyk M. Evaluating patient re-identification risk from hospital prescription records. Canadian Journal of Hospital Pharmacy. 2009; 62 (4):307–319. [ PMC free article : PMC2826964 ] [ PubMed : 22478909 ]
  • El Emam K, Brown A, AbdelMalik P, Neisa A, Walker M, Bottomley J, Roffey T. A method for managing re-identification risk from small geographic areas in Canada. BMC Medical Informatics and Decision Making. 2010; 10 (1):18. [ PMC free article : PMC2858714 ] [ PubMed : 20361870 ]
  • El Emam K, Jonker E, Arbuckle L, Malin B. A systematic review of re-identification attacks on health data. PLoS ONE. 2011a; 6 (12):e28071. [ PMC free article : PMC3229505 ] [ PubMed : 22164229 ]
  • El Emam K, Paton D, Dankar F, Koru G. Deidentifying a public use microdata file from the Canadian national discharge abstract database. BMC Medical Informatics and Decision Making. 2011b; 11 :53. [ PMC free article : PMC3179438 ] [ PubMed : 21861894 ]
  • El Emam K, Arbuckle L, Koru G, Eze B, Gaudette L, Neri E, Rose S, Howard J, Gluck J. Deidentification methods for open health data: The case of the Heritage Health Prize claims dataset. Journal of Medical Internet Research. 2012; 14 (1):e33. [ PMC free article : PMC3374547 ] [ PubMed : 22370452 ]
  • El Emam K, Middleton G, Arbuckle L. An implementation guide for data anonymization. Bloomington, IN: Trafford Publishing; 2014.
  • EMA (European Medicines Agency). European Medicines Agency policy on publication of clinical data for medicinal products for human use. 2014a. [December 19, 2014]. http://www ​.ema.europa ​.eu/docs/en_GB/document_library ​/Other/2014/10/WC500174796 ​.pdf .
  • EMA. Release of data from clinical trials. 2014b. [December 19, 2014]. http://www ​.ema.europa ​.eu/ema/index.jsp?curl=pages ​/special_topics ​/general/general_content_000555 ​.jsp&mid ​=WC0b01ac0580607bfa .
  • Erdem E, Prada SI. Creation of public use files: Lessons learned from the comparative effectiveness research public use files data pilot project. 2011. [November 9, 2012]. http://mpra ​.ub.uni-muenchen.de/35478 .
  • Fung BCM, Wang K, Chen R, Yu PS. Privacy-preserving data publishing: A survey of recent developments. ACM Computing Surveys. 2010; 42 (4):1–53.
  • Gøtzsche PC. Why we need easy access to all data from all clinical trials and how to accomplish it. Trials. 2011; 12 (1):249. [ PMC free article : PMC3264537 ] [ PubMed : 22112900 ]
  • Haggie E. PLoS Genetics partners with Dryad. 2013. [September 19, 2013]. http://blogs ​.plos.org ​/biologue/2013/09/18 ​/plos-genetics-partners-with-dryad .
  • Harrison C. GlaxoSmithKline opens the door on clinical data sharing. Nature Reviews Drug Discovery. 2012; 11 (12):891–892. [ PubMed : 23197021 ]
  • He D, Furlotte NA, Hormozdiari F, Joo JWJ, Wadia A, Ostrovsky R, Sahai A, Eskin E. Identifying genetic relatives without compromising privacy. Genome Research. 2014; 24 (4):664–672. [ PMC free article : PMC3975065 ] [ PubMed : 24614977 ]
  • Health Quality Council. Privacy code. Saskatoon, Canada: Health Quality Council; 2004a.
  • Health Quality Council. Security and confidentiality policies and procedures. Saskatoon, Canada: Health Quality Council; 2004b.
  • Health Research Authority. The HRA interest in good research conduct: Transparent research. 2013. [December 19, 2014]. http://www ​.hra.nhs.uk ​/documents/2013/08/transparent-research-report.pdf .
  • Heatherly RD, Loukides G, Denny JC, Haines JL, Roden DM, Malin BA. Enabling genomic-phenomic association discovery without sacrificing anonymity. PLoS ONE. 2013; 8 (2):e53875. [ PMC free article : PMC3566194 ] [ PubMed : 23405076 ]
  • Hede K. Project data sphere to make cancer clinical trial data publicly available. Journal of the National Cancer Institute. 2013; 105 (16):1159–1160. [ PubMed : 23904505 ]
  • HHS (U.S. Department of Health and Human Services). Standards for privacy of individually identifiable health information. Washington, DC: HHS; 2000. [March 31, 2015]. http://aspe ​.hhs.gov/admnsimp ​/final/PVCFR05.txt .
  • HHS. Guidance on research involving coded private information or biological specimens. Washington, DC: HHS; 2004.
  • HHS. Instructions for completing the limited data set data use agreement (DUA) (CMS-R-0235L). 2008a. [December 19, 2014]. http://innovation ​.cms ​.gov/Files/x/Bundled-Payments-for-Care-Improvement-Data-Use-Agreement.pdf .
  • HHS. OHRP: Guidance on research involving coded private information or biological specimens. Washington, DC: HHS; 2008b.
  • HHS. Guidance regarding methods for deidentification of protected health information in accordance with the Health Insurance Portability and Accountability Act (HIPAA) Privacy Rule. Washington, DC: HHS; 2012.
  • HIMSS Analytics. 2010 HIMSS Analytics report: Security of patient data. Chicago, IL: HIMSS Analytics; 2010. [ PubMed : 18800655 ]
  • HIMSS Analytics. 2012 HIMSS Analytics report: Security of patient data. Chicago, IL: HIMSS Analytics; 2012. [ PubMed : 18800655 ]
  • Homer N, Szelinger S, Redman M, Duggan D, Tembe W, Muehling J, Pearson J, Stephan D, Nelson S, Craig D. Resolving individuals contributing trace amounts of DNA to highly complex mixtures using high-density SNP genotyping microarrays. PLoS Genetics. 2008; 4 (8):e1000167. [ PMC free article : PMC2516199 ] [ PubMed : 18769715 ]
  • Hrynaszkiewicz I, Norton ML, Vickers AJ, Altman DG. Preparing raw clinical data for publication: Guidance for journal editors, authors, and peer reviewers. Trials. 2010; 11 (1):9. [ PMC free article : PMC2825513 ] [ PubMed : 20113465 ]
  • Humbert M, Ayday E, Hubaux JP, Telenti A. Addressing the concerns of the Lacks family: Quantification of kin genomic privacy; Proceedings of the 2013 ACM SIGSAC Conference on Computer and Communications Security; 2013. pp. 1141–1152.
  • ICO (Information Commissioner's Office). Anonymisation: Managing data protection risk code of practice. 2012. [December 19, 2014]. http://ico ​.org.uk/~/media ​/documents/library ​/Data_Protection/Practical ​_application ​/anonymisation-codev2.pdf .
  • ImmPort (Immunology Database and Analysis Portal). ImmPort: Immunology database and analysis portal. undated. [September 19, 2013]. https://immport ​.niaid ​.nih.gov/immportWeb/home/home ​.do?loginType=full .
  • IOM (Institute of Medicine). Sharing clinical research data: workshop summary. Washington, DC: The National Academies Press; 2013. [ PubMed : 23700647 ]
  • ISO (International Organization for Standardization). Health informatics—Pseudonymization. Geneva, Switzerland: ISO; 2008. (ISO/TS 25237:2008).
  • Jabine T. Procedures for restricted data access. Journal of Official Statistics. 1993a; 9 (2):537–589.
  • Jabine T. Statistical disclosure limitation practices of United States statistical agencies. Journal of Official Statistics. 1993b; 9 (2):427–454.
  • Jacobs KB, Yeager M, Wacholder S, Craig D, Kraft P, Hunter DJ, Paschal J, Manolio TA, Tucker M, Hoover RN, Thomas GD, Chanock SJ, Chatterjee N. A new statistic and its power to infer membership in a genome-wide association study using genotype frequencies. Nature Genetics. 2009; 41 (11):1253–1257. [ PMC free article : PMC2803072 ] [ PubMed : 19801980 ]
  • Kantarcioglu M, Jiang W, Liu Y, Malin B. A cryptographic approach to securely share and query genomic sequences. IEEE Transactions on Information Technology in Biomedicine. 2008; 12 (5):606–617. [ PubMed : 18779075 ]
  • Kayser M, de Knijff P. Improving human forensics through advances in genetics, genomics and molecular biology. Nature Reviews Genetics. 2011; 12 (3):179–192. [ PubMed : 21331090 ]
  • Kennickell A, Lane J. Privacy in statistical databases. Domingo-Ferrer J, Franconi L, editors. New York: Springer Berlin Heidelberg; 2006. pp. 291–303. (Measuring the impact of data protection techniques on data utility: Evidence from the survey of consumer finances).
  • Knoppers BM, Saginur M. The Babel of genetic data terminology. Nature Biotechnology. 2005; 23 (8):925–927. [ PubMed : 16082354 ]
  • Kohn LAP. The role of genetics in craniofacial morphology and growth. Annual Review of Anthropology. 1991; 20 (1):261–278.
  • Krumholz HM, Ross JS. A model for dissemination and independent analysis of industry data. Journal of the American Medical Association. 2011; 306 (14):1593–1594. [ PMC free article : PMC3688082 ] [ PubMed : 21990302 ]
  • Lechner S, Pohlmeier W. Privacy in statistical databases. Domingo-Ferrer J, Torra V, editors. New York: Springer Berlin Heidelberg; 2004. pp. 187–200. (To blank or not to blank?: A comparison of the effects of disclosure limitation methods on nonlinear regression estimates).
  • Li G, Wang Y, Su X. Improvements on a privacy-protection algorithm for DNA sequences with generalization lattices. Computer Methods and Programs in Biomedicine. 2012; 108 (1):1–9. [ PubMed : 21429615 ]
  • Lin Z, Hewett M, Altman R. Using binning to maintain confidentiality of medical data; Proceedings of the American Medical Informatics Association Annual Symposium; 2002. pp. 454–458. [ PMC free article : PMC2244360 ] [ PubMed : 12463865 ]
  • Lin Z, Owen A, Altman R. Genomic research and human subject privacy. Science. 2004; 305 :183. [ PubMed : 15247459 ]
  • Loukides G, Gkoulalas-Divanis A, Malin B. Anonymization of electronic medical records for validating genome-wide association studies. Proceedings of the National Academy of Sciences of the United States of America. 2010a; 107 (17):7898–7903. [ PMC free article : PMC2867915 ] [ PubMed : 20385806 ]
  • Loukides G, Denny JC, Malin B. The disclosure of diagnosis codes can breach research participants' privacy. Journal of the American Medical Informatics Association. 2010b; 17 (3):322–327. [ PMC free article : PMC2995712 ] [ PubMed : 20442151 ]
  • Lowrance W, Collins F. Identifiability in genomic research. Science. 2007; 317 :600–602. [ PubMed : 17673640 ]
  • Machanavajjhala A, Gehrke J, Kifer D. l-Diversity: Privacy beyond k-anonymity. Transactions on Knowledge Discovery from Data. 2007; 1 (1):1–47.
  • Malin B. Protecting genomic sequence anonymity with generalization lattices. Methods of Information in Medicine. 2005; 44 :687–692. [ PubMed : 16400377 ]
  • Malin B, Sweeney L. Determining the identifiability of DNA database entries; Proceedings of the AMIA Symposium; 2000. pp. 537–541. [ PMC free article : PMC2244110 ] [ PubMed : 11079941 ]
  • Malin B, Sweeney L. How (not) to protect genomic data privacy in a distributed network: Using trail re-identification to evaluate and design anonymity protection systems. Journal of Biomedical Informatics. 2004; 37 (3):179–192. [ PubMed : 15196482 ]
  • Malin B, Loukides G, Benitez K, Clayton EW. Identifiability in biobanks: Models, measures, and mitigation strategies. Human Genetics. 2011; 130 (3):383–392. [ PMC free article : PMC3621020 ] [ PubMed : 21739176 ]
  • Manitoba Center for Health Policy. Privacy code. 2002. [December 19, 2014]. http://umanitoba ​.ca/faculties ​/medicine/units ​/mchp/media_room ​/media/MCHP_privacy_code.pdf .
  • Mello M, Francer MJK, Wilenzick M, Teden P, Bierer BE, Barnes M. Preparing for responsible sharing of clinical trial data. New England Journal of Medicine. 2013; 369 (17):1651–1658. [ PubMed : 24144394 ]
  • MRC (Medical Research Council). Data sharing. 2011. [December 19, 2014]. http://www ​.mrc.ac.uk ​/research/research-policy-ethics ​/data-sharing .
  • NIH (U.S. National Institutes of Health). Final NIH statement on sharing research data. 2003. [December 19, 2014]. http://grants ​.nih.gov ​/grants/guide/notice-files ​/NOT-OD-03-032.html .
  • Nisen P, Rockhold F. Access to patient-level data from GlaxoSmithKline clinical trials. New England Journal of Medicine. 2013; 369 (5):475–478. [ PubMed : 23902490 ]
  • NRC (National Research Council). Private lives and public policies: Confidentiality and accessibility of government statistics. Washington, DC: National Academy Press; 1993.
  • Office of the Information and Privacy Commissioner of British Columbia. 1998. [December 19, 2014]. (Order No. 261-1998). https://www ​.oipc.bc.ca/orders/496 .
  • Office of the Information and Privacy Commissioner of Ontario. 1994. [December 19, 2014]. (Order P-644). http://www ​.ipc.on.ca ​/images/Findings/Attached_PDF/P-644.pdf .
  • Office of the Privacy Commissioner of Quebec (CAI). Chenard v. Ministère de l'agriculture, des pécheries et de l'alimentation (141). 1997.
  • Ohm P. Broken promises of privacy: Responding to the surprising failure of anonymization. UCLA Law Review. 2010; 57 :1701.
  • OMB (Office of Management and Budget). Report on statistical disclosure limitation methodology. Washington, DC: OMB; 1994. (Working paper 22).
  • Ontario Ministry of Health and Long-Term Care. Corporate policy 3-1-21. Toronto, ON: Ontario Ministry of Health and Long-Term Care; 1984.
  • Ou X, Gao J, Wang H, Wang H, Lu H, Sun H. Predicting human age with bloodstains by sjTREC quantification. PLoS ONE. 2012; 7 (8):e42412. [ PMC free article : PMC3411734 ] [ PubMed : 22879970 ]
  • Perun H, Orr M, Dimitriadis F. Guide to the Ontario Personal Health Information Protection Act. Toronto, ON: Irwin Law; 2005.
  • Purdam K, Elliot M. A case study of the impact of statistical disclosure control on data quality in the individual UK samples of anonymised records. Environment and Planning A. 2007; 39 (5):1101–1118.
  • Rothstein M. Research privacy under HIPAA and the Common Rule. Journal of Law, Medicine & Ethics. 2005; 33 :154–159. [ PubMed : 15934672 ]
  • Rothstein M. Is deidentification sufficient to protect health privacy in research. American Journal of Bioethics. 2010; 10 (9):3–11. [ PMC free article : PMC3032399 ] [ PubMed : 20818545 ]
  • Samarati P. Protecting respondents' identities in microdata release. IEEE Transactions on Knowledge and Data Engineering. 2001; 13 (6):1010–1027.
  • Sandercock PA, Niewada M, Członkowska A. the International Stroke Trial Collaborative Group. The International Stroke Trial database. Trials. 2011; 12 (1):101. [ PMC free article : PMC3104487 ] [ PubMed : 21510853 ]
  • Sankararaman S, Obozinski G, Jordan MI, Halperin E. Genomic privacy and limits of individual detection in a pool. Nature Genetics. 2009; 41 (9):965–967. [ PubMed : 19701190 ]
  • Silventoinen K, Sammalisto S, Perola M, Boomsma DI, Cornes BK, Davis C, Dunkel L, De Lange M, Harris JR, Hjelmborg JVB, Luciano M, Martin NG, Mortensen J, Nisticò L, Pedersen NL, Skytthe A, Spector TD, Stazi MA, Willemsen G, Kaprio J. Heritability of adult body height: A comparative study of twin cohorts in eight countries. Twin Research. 2003; 6 (5):399–408. [ PubMed : 14624724 ]
  • Skinner CJ. On identification disclosure and prediction disclosure for microdata. Statistica Neerlandica. 1992; 46 (1):21–32.
  • Skinner C, Shlomo N. Assessing identification risk in survey microdata using log-linear models. Journal of the American Statistical Association. 2008; 103 (483):989–1001.
  • Statistics Canada. Therapeutic abortion survey. 2007. [December 19, 2014]. http://www23 ​.statcan ​.gc.ca/imdb/p2SV.pl?Function ​=getSurvey&SurvId ​=1062&InstaId ​=31176&SDDS=3209 .
  • Sweeney L. k-anonymity: A model for protecting privacy. International Journal on Uncertainty, Fuzziness and Knowledge-based Systems. 2002; 10 (5):557–570.
  • Sweeney L, Abu A, Winn J. Identifying participants in the personal genome project by name. 2013. [December 19, 2014]. http: ​//dataprivacylab ​.org/projects/pgp/1021-1.pdf .
  • U.S. Department of Education. NCES statistical standards. 2003. [December 19, 2014]. http://nces ​.ed.gov/pubs2003/2003601.pdf .
  • Vallance P, Chalmers I. Secure use of individual patient data from clinical trials. Lancet. 2013; 382 (9898):1073–1074. [ PubMed : 24075034 ]
  • Wang R, Li YF, Wang X, Tang H, Zhou X. Learning your identity and disease from research papers: Information leaks in genome wide association study. 2009. [December 19, 2014]. http://www ​.informatics ​.indiana.edu/xw7/papers/gwas_paper.pdf .
  • The Wellcome Trust. Sharing research data to improve public health: Full joint statement by funders of health research. 2011. [December 19, 2014]. http://www ​.wellcome.ac ​.uk/About-us/Policy ​/Spotlight-issues/Data-sharing ​/Public-health-and-epidemiology/WTDV030690.htm .
  • Willenborg L, de Waal T. Statistical disclosure control in practice. New York: Springer-Verlag; 1996.
  • Willenborg L, de Waal T. Elements of statistical disclosure control. New York: Springer-Verlag; 2001.
  • Wjst M. Caught you: Threats to confidentiality due to the public release of large-scale genetic data sets. BMC Medical Ethics. 2010; 11 :21. [ PMC free article : PMC3022540 ] [ PubMed : 21190545 ]
  • Zubakov D, Liu F, van Zelm MC, Vermeulen J, Oostra BA, van Duijn CM, Dries-sen GJ, van Dongen JJM, Kayser M, Langerak AW. Estimating human age from T-cell DNA rearrangements. Current Biology. 2010; 20 (22):R970–R971. [ PubMed : 21093786 ]

This background report was commissioned by the Institute of Medicine Committee on Strategies for Responsible Sharing of Clinical Trial Data, written by Khaled El Emam, University of Ottawa, and Bradley Malin, Vanderbilt University.

Although the EMA has recently proposed using an online portal to share the CSRs using a simple terms-of-use setup, this was not intended to apply to IPD.

This statement does not apply to genomic data. See the summary of evidence on genomic data later in this paper for more detail.

A case can made for just using the term “coding” rather than the term “pseudonymization” because it is easier to remember and pronounce. That is certainly a good reason to use the former term as long as the equivalence of the two terms is noted, because “pseudonymization” is the term used in an ISO technical specification.

Note that this conclusion does not apply to genomic data sets. A discussion of genomic data sets is provided in the last section of this paper.

  • Cite this Page Committee on Strategies for Responsible Sharing of Clinical Trial Data; Board on Health Sciences Policy; Institute of Medicine. Sharing Clinical Trial Data: Maximizing Benefits, Minimizing Risk. Washington (DC): National Academies Press (US); 2015 Apr 20. Appendix B, Concepts and Methods for De-identifying Clinical Trial Data.
  • PDF version of this title (3.0M)

In this Page

Related information.

  • PMC PubMed Central citations
  • PubMed Links to PubMed

Recent Activity

  • Concepts and Methods for De-identifying Clinical Trial Data - Sharing Clinical T... Concepts and Methods for De-identifying Clinical Trial Data - Sharing Clinical Trial Data

Your browsing activity is empty.

Activity recording is turned off.

Turn recording back on

Connect with NLM

National Library of Medicine 8600 Rockville Pike Bethesda, MD 20894

Web Policies FOIA HHS Vulnerability Disclosure

Help Accessibility Careers

statistics

The Sheridan Libraries

  • Protecting Human Subject Identifiers
  • Sheridan Libraries
  • Steps for De-identifying Data
  • Protecting Identifiers Overview
  • What is Disclosure Risk?
  • JHM Data Trust Council
  • Secure Storage Choices
  • Intro to Advanced Techniques
  • Further Resources

Data Services Profile

a researcher's study uses an identifiable dataset

We are here to help you find, use, manage, visualize and share your data. Contact us  to schedule a consultation. View and register for upcoming workshops . Visit our website to learn more about our services.

5 steps for removing identifiers from datasets

Here are five categories of tasks for preparing datasets for sharing, either among collaborators, or for restricted or public access depending on the extent of de-identification required. These steps can also be used to review datasets for identifying disclosure risk. This section introduces the steps. Techniques and complexity of implementing the steps can vary wisely among datasets and variables, from simple find-and-replace to advanced statistical methods. Simple transformations may be sufficient to improve protection of datasets for internal use and collaboration. Preparing fully de-identified public-use datasets may require a professional statistician for more complex transformations.

1. Review and remove direct identifiers

a researcher's study uses an identifiable dataset

2. Remove and re-code specific dates

Changing specific dates

4. Remove / recode variables that pose risk of link to external datasets

Copyright © 2018, by Johns Hopkins Data Services

5. Re-sort and renumber records and IDs from external source data and IDs created for your study

  • << Previous: Secure Storage Choices
  • Next: Intro to Advanced Techniques >>
  • Last Updated: Feb 20, 2024 8:56 AM
  • URL: https://guides.library.jhu.edu/protecting_identifiers

Researcher FAQs

How is “research” defined by the Privacy Rule?

Who qualifies as a “researcher”?

When does the Privacy Rule apply to me as a researcher?

What is “individually identifiable health information”?

Does HIPAA apply to my research even if I am not a health care provider?

How does HIPAA affect a research study that also involves health care treatment?

What is the relationship between HIPAA and the “Common Rule” for the protection of human subjects?

What are the HIPAA requirements for using or disclosing PHI in research?

Can I disclose PHI as part of my research?

Is PHI ever created within the course of conducting research?

When is individually identifiable health information created within a research study not PHI?

Does HIPAA regulate how PHI created in the course of a research study is handled?

Can I use Box or other campus services to store my data set containing PHI?

What is a research authorization?

How is an authorization form different than an informed consent form?

How do I obtain an authorization to use and/or disclose PHI in my research?

What if the human research participant revokes the authorization?

What is a waiver of authorization?

How is a waiver of authorization different than a waiver of informed consent?

How do I obtain a waiver of authorization to use PHI in my research?

How does HIPAA apply to the recruitment of study participants?

May I use email to communicate with research subjects?

What is a de-identified data set?

What are the requirements for obtaining and using a de-identified data set for my research?

My data set is coded. Does this qualify as “de-identified”?

If a data set identifies the site from which the data has been disclosed, does the geographic location of the site constitute an identifier?

What is a limited data set?

What are the requirements for using a limited data set?

How do I obtain a limited data set for use in my research?

Can a business associate agreement be used to obtain PHI from a covered entity for research purposes?

What uses of PHI are permitted under HIPAA in a review preparatory to research?

How does HIPAA apply to research using the PHI of decedents?

Can subjects authorize the use of their PHI for future, unspecified research (such as for collection and storage in a data base)?

Does HIPAA permit me to share data with other researchers not part of my study team?

How do I report a breach or other concern related to HIPAA?

How is “research” defined by the Privacy Rule? Research has the same definition in the Privacy Rule as it does in the Common Rule. Research means a systematic investigation, including research development, testing, and evaluation, designed to contribute to generalizable knowledge.

Who qualifies as a “researcher”? UW-Madison employees, trainees, or students who conduct research involving human subjects. Researchers include investigators, research staff, postdocs, fellows, residents, graduate students, undergraduate students and others who collaborate in UW-Madison human subjects research, including employees of the University of Wisconsin Hospital and Clinics Authority and the University of Wisconsin Medical Foundation.

When does the Privacy Rule apply to me as a researcher? The Privacy Rule applies if: (1) you are a researcher with an appointment within the UW-Madison Health Care Component (UW HCC) or the UW Affiliated Covered Entity (ACE) ; or (2) you are a researcher with an appointment outside of the UW HCC or UW ACE but you are collaborating on a research study in which the principal investigator is within the UW HCC or UW ACE; and (3) you collect individually identifiable health information directly from subjects or from medical records or other databases.

Individually identifiable health information is information that is a subset of health information, including demographics, and (1) is created or received by a health care provider, health plan, employer, or health care clearinghouse; (2) relates to the past, present, or future physical or mental health or condition of individual; the provision of health care to an individual; or payment for the provision of health care to an individual; and (3) that identifies an individual or where there is a reasonable basis to believe the information can be used to identify an individual.

Yes, if as part of your research you are seeking to use individually identifiable health information from records in the custody of a “covered entity” (most health care providers, health plans, and health care clearinghouses), then HIPAA applies to your access to and use of that data whether or not you are a health care provider.

HIPAA requires that research study subjects who will receive health care as part of the study authorize the use of their PHI in that research — or that a privacy board or Institutional Review Board (IRB) waive the authorization requirement — regardless of the consent for treatment. Additionally, any research-generated PHI that may be applied to treatment decisions is subject to HIPAA’s medical record requirements.

What is the relationship between HIPAA and the Common Rule for the protection of human subjects? While the Common Rule addresses issues related to consent of subjects to participate in research, HIPAA addresses issues related to the subjects’ authorization to have their health information used or disclosed as part of a research study, and how that health information must be protected. The consent and authorization form may be combined. While the Common Rule and HIPAA have some similarities, such as the definition of research, there are many differences as well. For example, HIPAA does not contain the same exemptions from IRB review as the Common Rule.

What are the HIPAA requirements for using or disclosing PHI in research? HIPAA regulates how covered entities may share PHI with researchers who are part of the covered entity, or how they may disclose PHI to researchers who are not part of the covered entity. HIPAA permits a covered entity to share PHI with, or disclose PHI to, researchers only through the following six options:

  • Review of PHI solely in preparation for research, without collecting or using the PHI for research – commonly called “preparatory to research” activities (HIPAA requires the researcher to make certain attestations to the covered entity about the use).
  • A signed patient authorization is obtained from the individual whose PHI is sought for research.
  • Waiver by an IRB of the authorization requirement for use of individually identifiable PHI for research.
  • Complete de-identification of the data.
  • Conversion of the PHI to a limited data set (HIPAA requires the researcher to enter into a data use agreement).
  • Use of PHI solely of decedents (HIPAA requires the researcher to make certain attestations to the covered entity about the use).

“Disclosure” of PHI under the Privacy Rule means that you are sharing PHI outside of the UW-Madison Health Care Component (UW HCC) or outside of the UW Affiliated Covered Entity (ACE) . A disclosure of PHI for research may only occur if you have authorization to do so from the subject. UW-Madison IRBs do not approve requests to disclose PHI under a waiver of authorization. Alternatively, you may disclose a de-identified data set or, with a data use agreement in place, you may disclose a limited data set.

Yes. When a health care activity is performed within the research study itself, any clinical information about the subject that is generated within the research is PHI and is subject to all the HIPAA regulations that apply to PHI. For example, clinical information generated within a research study may be simultaneously entered into the electronic health record of an individual patient and into the research data set intended to produce generalizable knowledge. The research use of the PHI and protection of the privacy and security of the research data set must be in accord with the terms and conditions of the IRB approval, the informed consent and the authorization, relevant institutional policies on data privacy and security, and applicable HIPAA privacy and security regulations.

When is individually identifiable health information that is created within a research study not PHI? When the principal investigator is not part of the UW-Madison Health Care Component (UW HCC) or the UW Affiliated Covered Entity (ACE) , the study does not involve health care treatment by a health care provider, and the health information created within the study is not expected to be shared by the researchers with the subject’s health care provider or placed in the subject’s electronic health record. For example, if researchers solely within the Department of Kinesiology conduct an exercise study that collects personal health data directly from the research participant and includes some health screening testing (blood pressure measurements, etc.), this data is not health information that is protected by HIPAA.

Does HIPAA regulate how PHI created in the course of a research study is handled? Yes, when clinical treatment is performed in the course of a research study (e.g. a therapeutic trial studying the safety and efficacy of a new cancer drug), the information must be handled in accord with the appropriate medical practices regarding entry of the individual’s treatment data into the medical record. The research use of the information must be authorized in the HIPAA authorization and informed consent documents that the research participant signs. These documents should specify how PHI created in the course of a research study will be treated, for example:

  • how PHI will be used in the research study,
  • whether any of the data will be entered into the medical record, and
  • whether the information will be shared with any health plan for payment purposes for any activities included within the study participation.

Please consult the Approved Tools list to determine which campus services are available to use with PHI. Contact your HIPAA Security Coordinator for any follow up questions.

What is a research authorization? An authorization is a document signed by an individual that gives the individual’s explicit permission to obtain her/his specified PHI from a health care provider(s), or to generate PHI as part of the study, and use it for a specified purpose other than the individual’s health care, such as for research. HIPAA is specific about the elements that must be included in a valid authorization document. See the For Researchers page for more information.

How is an authorization form different than an informed consent form? An authorization is a HIPAA required document that defines only the terms and conditions of permission to use or disclose specified PHI for a specified research project. Except for authorizations to use psychotherapy notes in research, which must always be stand alone documents, an authorization can be combined with the informed consent document.

How do I obtain an authorization to use and/or disclose PHI in my research? Apply to the appropriate IRB for approval of an authorization form to use in the informed consent process in your research project. You can find template authorization forms on the For Researchers page. When you have an IRB approved form of authorization for use in your research study, you are able to include the discussion and execution of this form in the informed consent process with each human research participant. Covered entities may want a copy of this authorization (or a waiver of authorization — see below) when you request access to the research participant’s individually identifiable health information in their records.

What if the human research participant revokes the authorization? If the authorization is revoked, the researcher generally cannot continue to collect PHI on the participant for use in the research study; however, the researcher can continue to use the PHI already obtained before the revocation to the extent necessary to preserve the integrity of the research study. FDA regulations do not permit destruction of study data based on a subject’s revocation of their authorization.

What is a waiver of authorization? When obtaining subject authorization is “impracticable,” the IRB may approve a waiver of authorization for a researcher to use protected health information. The purposes of the research must be described in a waiver application and the IRB must determine that the researcher has satisfied all Privacy Rule requirements for the waiver.

How is a waiver of authorization different than a waiver of informed consent? The waiver of authorization is based solely on an assessment of the privacy risks in the proposed research use of individually identifiable PHI, whereas the waiver of informed consent is based on an assessment of risks to participation in the study itself.

How do I obtain a waiver of authorization to use PHI in my research? Apply to the appropriate IRB for approval of a waiver of the authorization requirement. This is similar to a request for waiver of the informed consent requirement. If you are applying for a waiver, please refer to the additional Guidelines for Waiver of Authorization or Altered Authorization for an explanation of what information will be needed by the IRB to grant a request for a waiver of authorization. When the IRB has approved a waiver of authorization, it will issue an approval document. Covered entities may want a copy of this waiver of authorization (or an authorization — see above) when you request access to the research participant’s individually identifiable health information in their records.

How does HIPAA apply to the recruitment of study participants? Under HIPAA, a covered entity may provide individually identifiable health information to researchers within its own workforce to allow those researchers to contact potential subjects for the purpose of obtaining their authorization to use their health information in the research. UW-Madison IRBs require that the first contact with potential subjects come from someone the subject would recognize as having valid access to their health information.

Email should not be considered a secure, confidential means of communication with subjects. As such, it should generally not be used to communicate, to subjects or from subjects, information that contains or is likely to contain PHI. For example, a recruitment email sent to recipients based on non-health related information (e.g. “you are receiving this email because you are a female over the age of 45”) would usually be permissible but a recruitment email sent to participates that discloses a medical condition (e.g. “you are receiving this email because you have rheumatoid arthritis”) would not be permissible. Similarly, it would generally not be permissible to request subjects to reply to a series of questions about their health via email. There are often other, more secure, means of communication available. If email must be used, subjects must first agree to email communication by signing a written consent form in which they are informed of the security risks associated with email. See Policy 8.6 (UW-129) Email Communications Involving Protected Health Information for more information. Additionally, you must describe the use of email, and specifically what information is expected to be emailed, in your protocol and obtain IRB approval before email may be used as a method of communication.

What is a de-identified data set? A de-identified data set is PHI from which the following identifiers of the individual or of relatives, employers, or household members of the individual, have been removed:

  • All geographic subdivisions smaller than a State;
  • All elements of dates (except year) for dates directly related to an individual, including birth date, admission date, discharge date, date of death; and all ages over 89 and all elements of dates (including year) indicative of such age, except that such ages and elements may be aggregated into a single category of age 90 or older;
  • Telephone numbers;
  • Fax numbers;
  • Electronic mail addresses;
  • Social security numbers;
  • Medical record numbers;
  • Health plan beneficiary numbers;
  • Account numbers;
  • Certificate/license numbers;
  • Vehicle identifiers and serial numbers, including license plate numbers;
  • Device identifiers and serial numbers;
  • Web Universal Resource Locators (URLs);
  • Internet Protocol (IP) address numbers;
  • Biometric identifiers, including finger and voice prints;
  • Full face photographic images and any comparable images; and
  • Any other unique identifying number, characteristic, or code;

The covered entity may not consider the information de-identified if it has actual knowledge that the information could be used alone or in combination with other information to identify an individual who is a subject of the information.

What are the requirements for obtaining and using a de-identified data set for my research? De-identified data sets do not contain any individually identifiable health information. Neither authorization nor waiver of authorization, nor a data use agreement is required by HIPAA for a researcher to use and/or disclose de-identified data for research purposes.

My data set is coded. Does this qualify as “de-identified”? If you have the key to the code, your data set is not de-identified. If an individual(s) within the covered entity maintains the key to the code but you do not have access to the code and will never have access to the code, then your data set is de-identified as to you.

If a data set identifies the site from which the data has been disclosed, does the geographic location of the site constitute an identifier? No. The de-identified information does not lose its de-identification status simply by virtue of identification of the disclosing site. This is true as long as one other HIPAA caveat is met: the disclosing covered entity does not have actual knowledge that the de-identified information could be used alone or in combination with other information available to others outside the covered entity to identify an individual who is the subject of the information.

What is a limited data set? In contrast to a de-identified data set, a limited data set can contain dates related to the individual (birth date, death date, etc.) and dates of services as well as geographic information at the level of town or city, State and zip code. A limited data set is PHI that excludes the following direct identifiers of the individual or of relatives, employers, or household members of the individual:

  • Postal address information, other than town or city, State, and zip code;
  • Biometric identifiers, including finger and voice prints; an Full face photographic images and any comparable images.

What are the requirements for using a limited data set? A covered entity may use or disclose a limited data set from its records containing PHI for research use without either authorization or waiver of authorization if the researcher executes a data use agreement that binds the limited data set recipient to use or disclose the limited data set only for limited, specified purposes. The data use agreement must establish who is permitted to use or receive the limited data set and must pledge all recipients both to use appropriate safeguards to protect the data from unauthorized disclosure and not to attempt to identify or contact the individuals whose PHI is contained in the data.

How do I obtain a limited data set for use in my research? You can find UW-Madison’s template Data Use Agreement , as well as other information about the use of a limited data set on the For Researchers page.

Can a business associate agreement be used to obtain PHI from a covered entity for research purposes? Generally, no. A business associate is an individual that performs on behalf of the covered entity or assists the covered entity in performing certain business related activities, such as claims processing, billing, benefit management or quality improvement. A researcher is generally not performing a business related activity on behalf of the covered entity when conducting research. However, a business associate agreement may be used when the researcher, who is not a member of the covered entity’s workforce, contracts with the covered entity to access the covered entity’s PHI for the purpose of creating a limited data set or a deidentified data set for his or her research.

What uses of PHI are permitted under HIPAA in a review preparatory to research? The “review preparatory to research” is an option that allows review (but not research use) of individually identifiable PHI by researchers and does not require authorization or waiver of authorization. A covered entity may allow researchers to review PHI in the covered entity’s records in preparation for research but may not permit researchers to collect any of the PHI for actual research use. For example, the researcher may be permitted to review PHI for the development of research questions; to determine whether a study is feasible (in terms of available number and eligibility of potential subjects); or to develop inclusion and exclusion criteria. However, the researcher may not transcribe information from the records for inclusion in research data. Researchers must complete UW-Madison’s Use of PHI in Activities Preparatory to Research Certification prior to engaging in preparatory to research activities.

How does HIPAA apply to research using the PHI of decedents? Research using the individually identifiable PHI of decedents requires neither authorization nor waiver of authorization nor a data use agreement. However, researchers must complete UW-Madison’s Certification for Research on the Protected Health Information of Decedents prior to engaging in such research activities.

Can subjects authorize the use of their PHI for future, unspecified research (such as for collection and storage in a data base)? HIPAA requires that an authorization include a description of each purpose of the requested use or disclosure. An authorization may include use for future research so long as the authorization adequately describes the use in such a manner that it would be reasonable for the subject to expect that his or her PHI to be used or disclosed for such future research. In cases where the authorization does not address future research, an IRB waiver of authorization may be the most appropriate and practical HIPAA-compliant approach.

Does HIPAA permit me to share data with other researchers not part of my study team? PHI in research data may only be shared with other researchers in accord with the agreement for acquiring the PHI; i.e. only in accord with the terms of the authorization or waiver of authorization or data use agreement. Research data that includes PHI may be shared, disclosed or transferred among the investigators named in the authorization, waiver of authorization or data use agreement. Sharing or disclosing or transferring the data outside of that circle requires IRB review and approval of the proposed research study for which the data would be shared. In the event that the original investigators wish to share research data that includes PHI with another colleague not originally identified as part of the research team within the existing approved study, contact the IRB for review of a change in the approved protocol.

How do I report a suspected breach or other concern related to HIPAA? If the personally identifiable health information in any way involves information technology (e.g. lost or stolen portable device, compromised server, etc.) you must immediately contact the DoIT Help Desk at 608-264-HELP (4357). For any suspected breach of personally identifiable health information, you must contact the UW-Madison HIPAA Privacy Officer, whose contact information is on the right side of this page. You should also file an Unanticipated Problem Report form with the IRB that reviewed your protocol.

  • HIPAA Overview More
  • Key Definitions More
  • External Resources More
  • For Business Associates More
  • For Researchers More
  • News & Updates More
  • Policies and Forms More
  • Researchers FAQ More
  • Research Proposal Requirements More
  • Training More
  • UW-Madison HIPAA Privacy Officer More
  • Privacy & Security Coordinators More

Online Education

Online Education

I am a passionate educator who has been teaching in USA for many years.

A Researcher’s Study Uses An Identifiable Dataset

Introduction.

Hey there, fellow educators and curious minds! Today, I want to dive deep into the fascinating world of research and how a researcher’s study uses an identifiable dataset. As an experienced educator, I’ve always been intrigued by the power of data analysis in education, and this topic is no exception.

Main Curiosities, Top Statistics, Facts, and Interesting Information

  • Curiosity 1: How does a researcher select an identifiable dataset for their study?
  • Curiosity 2: What are the benefits of using an identifiable dataset in research?
  • Curiosity 3: What are the potential challenges and ethical considerations?
  • Statistic 1: Over 70% of researchers use identifiable datasets in their studies.
  • Statistic 2: Identifiable datasets have led to groundbreaking discoveries in various fields.
  • Fact: Identifiable datasets provide a wealth of valuable information for researchers.
  • Interesting Information: Privacy concerns surrounding identifiable datasets are an ongoing debate.

Why Use Identifiable Datasets?

In today’s data-driven world, identifiable datasets offer researchers a unique opportunity to gain profound insights into their chosen fields. By utilizing these datasets, researchers can:

  • Uncover hidden patterns and trends
  • Make evidence-based decisions
  • Validate or challenge existing theories
  • Contribute to the advancement of knowledge

Survey Results and Studies

Several surveys and studies have shed light on the importance and impact of using identifiable datasets in research. In a recent survey conducted among 500 researchers:

  • 85% agreed that identifiable datasets enhance the validity of their findings.
  • 92% believed that identifiable datasets accelerate the research process.
  • 78% reported that identifiable datasets have influenced their career positively.

Furthermore, a study published in the Journal of Education Research found that researchers who incorporate identifiable datasets in their studies are more likely to produce impactful results compared to those who solely rely on anonymous data.

Data Analysis and Insights

Through rigorous data analysis, researchers can derive invaluable insights from identifiable datasets. Let me share a personal experience to illustrate this point.

During my own research project on student performance, I had access to an identifiable dataset that included information on students’ socioeconomic backgrounds, previous academic achievements, and family support systems. By analyzing this data, I discovered a strong correlation between parental involvement and students’ academic success. This finding enabled me to advocate for increased parental engagement in education, leading to positive changes in my school community.

Ethical Considerations and Privacy Concerns

While identifiable datasets offer immense potential for research, it’s crucial to address the ethical considerations and privacy concerns associated with their use. Researchers must:

  • Ensure data anonymization and protection
  • Obtain informed consent from participants
  • Follow strict ethical guidelines and regulations

Expert Quotes

Let’s hear from experts in the field about the significance of identifiable datasets:

Identifiable datasets provide researchers with a unique lens to understand complex phenomena and make meaningful contributions to their respective disciplines. – Dr. Sarah Collins, Research Ethicist
As educators, we have a responsibility to leverage the power of identifiable datasets to drive evidence-based practices and improve student outcomes. – Dr. John Anderson, Education Researcher

Frequently Asked Questions

Faq 1: what are the main challenges researchers face when working with identifiable datasets.

Researchers often encounter challenges related to data privacy, access restrictions, and the interpretation of complex data sets. However, these challenges can be overcome with proper planning and collaboration with stakeholders.

FAQ 2: How do researchers protect the privacy of individuals whose data is included in identifiable datasets?

Researchers must follow strict protocols to ensure data anonymization, secure storage, and limited access. By safeguarding privacy, researchers can maintain the trust of participants and the wider community.

FAQ 3: Are identifiable datasets more reliable than anonymous datasets?

Identifiable datasets provide researchers with a richer context and enable them to draw more comprehensive conclusions. However, the reliability of the findings ultimately depends on the research design, methodology, and the quality of the data collected.

Closing Thoughts

As we conclude this exploration into how a researcher’s study uses an identifiable dataset, it’s clear that these datasets play a vital role in advancing knowledge and driving evidence-based practices. While ethical considerations and privacy concerns must be addressed, the potential benefits are undeniable.

So, fellow educators, let’s embrace the power of data analysis and continue to contribute to the field of education through our research endeavors. Together, we can make a lasting impact on student outcomes and shape the future of education.

Related questions:

  • Who uses online learning the most?
  • UML Uses Italics to Denote ___ Classes.
  • Edmundson on the Uses of a Liberal Education
  • The Next Einstein Initiative Uses the Power of Supercomputing to Enhance Mathematical Education
  • _____ Therapy Uses Techniques Based on Learning Principles to Change Maladaptive Behavior.
  • Researchers Endeavoring To Conduct An Online Study
  • Which Of The Following Researchers Is Conducting A Case Study
  • The Study Of Escherichia Coli Led Researchers
  • What Can Management Researchers Infer Based On This Study
  • What Specific Trait Did Researchers Study In This Investigation
  • Why Do Researchers Study The Brains Of Nonhuman Animals
  • Psychological Researchers Study Genetics In Order To Better Understand The
  • Carl Warden Was One of the First Researchers to Demonstrate Observational Learning in Animals.
  • We Need Theorists and Researchers to Generate and Refine Learning-Focused
  • Researchers Who Emphasize Learning and Experience Tend to View Development As

Leave a Comment

Save my name, email, and website in this browser for the next time I comment.

a researcher's study uses an identifiable dataset

Snapsolve any problem by taking a picture. Try it in the Numerade app?

IMAGES

  1. Understanding Identifiable Data

    a researcher's study uses an identifiable dataset

  2. Machine Learning Datasets

    a researcher's study uses an identifiable dataset

  3. Solved A researcher's study uses an identifiable dataset of

    a researcher's study uses an identifiable dataset

  4. Modeling the data

    a researcher's study uses an identifiable dataset

  5. How to Describe a Dataset in a Report

    a researcher's study uses an identifiable dataset

  6. Top 5 Sources For Analytics and Machine Learning Datasets

    a researcher's study uses an identifiable dataset

VIDEO

  1. Intro: Part I

  2. A Study on Prevention and Automatic Recovery of Blockchain Networks Against Persistent Censorship At

  3. Dementia Speech Dataset Creation and Analysis in Indic Languages—A Pilot Study

  4. Value Based Analysis Framework of Crossover Service A Case Study of New Retail in China

  5. The Impact of Enterprise Digitization on Green Total Factor Productivity A Case Study of High Pollu

  6. Study of Different Transformer based Networks For Glaucoma Detection

COMMENTS

  1. CITI Training: Research with Prisoners Flashcards

    A researcher's study uses a dataset of prisoner demographic characteristics. This dataset includes criminal history data that predates incarceration and includes data on disciplinary behavior while in prison. There is no interaction with prisoners. The researcher claims and the IRB chair agrees that the study is exempt from IRB review. This ...

  2. Research CITI Flashcards

    A researcher's study uses an identifiable dataset of prisoner demographic characteristics. This dataset includes criminal history data that predates incarceration and includes data on disciplinary behavior while in prison. There is no interaction with prisoners. The researcher claims, and the IRB chair agrees, that the study is exempt from IRB ...

  3. Understanding Identifiable Data

    Researchers working with human subjects will often hear the phrase, "remove all identifiable data" or, "protect identifiable data with reliable security measures." Identifiable data is vulnerable, as it includes information or records about the research participant that allows others to identify that person. If unauthorized individuals ...

  4. Understanding Identifiable Data

    What is identifiable data? Within the context of human subjects research, the definition of identifiable data can sometimes be complex. According to the federal regulations governing human subjects research, data is considered identifiable if the identity of the participant is known or may readily be ascertained. A participant's identity may be readily ascertained if a code key exists that ...

  5. Research with Prisoners, Citi Training, Assessing Risk

    A researcher's study uses a dataset of prisoner demographic characteristics. This dataset includes criminal history data that predates incarceration and includes data on disciplinary behavior while in prison. There is no interaction with prisoners. The researcher claims and the IRB chair agrees that the study is exempt from IRB review. This ...

  6. Guidance on Secondary Analysis of Existing Data Sets

    An IRB review may be required for a research study that relies exclusively on secondary use of anonymous information BUT records data linkage or disseminates results in such a way that it generates identifiable information. ... The use of the data does constitute research with human subjects because the initial data set is identifiable (albeit ...

  7. Estimating the success of re-identifications in incomplete datasets

    De-identification, the process of anonymizing datasets before sharing them, has been the main paradigm used in research and elsewhere to share data while preserving people's privacy 12,13,14.

  8. A dataset describing data discovery and reuse practices in research

    Abstract. This paper presents a dataset produced from the largest known survey examining how researchers and support professionals discover, make sense of and reuse secondary research data. 1677 ...

  9. Datasets

    Datasets. A dataset (also spelled 'data set') is a collection of raw statistics and information generated by a research study. Datasets produced by government agencies or non-profit organizations can usually be downloaded free of charge. However, datasets developed by for-profit companies may be available for a fee.

  10. Solved A researcher's study uses an identifiable dataset of

    A researcher's study uses an identifiable dataset of prisoner demographic characteristics. This dataset includes criminal history data that predates incarceration and includes data on disciplinary behavior while in prison. There is no interaction with prisoners. The researcher claims, and the IRB chair agrees, that the study is exempt from IRB ...

  11. Concepts and Methods for De-identifying Clinical Trial Data

    Very detailed health information about participants is collected during clinical trials. A number of different stakeholders would typically have access to individual-level participant data (IPD), including the study sites, the sponsor of the study, statisticians, Institutional Review Boards (IRBs), and regulators. By IPD we mean individual-level data on trial participants, which is more than ...

  12. Protecting Human Subject Identifiers

    5 steps for removing identifiers from datasets. Here are five categories of tasks for preparing datasets for sharing, either among collaborators, or for restricted or public access depending on the extent of de-identification required. These steps can also be used to review datasets for identifying disclosure risk. This section introduces the ...

  13. Research with Prisoners, CITI 5 Flashcards

    A researcher's study uses a dataset of prisoner demographic characteristics. This dataset includes criminal history data that predates incarceration and includes data on disciplinary behavior while in prison. There is no interaction with prisoners. The researcher claims and the IRB chair agrees that the study is exempt from IRB review. This ...

  14. Researcher FAQs

    A signed patient authorization is obtained from the individual whose PHI is sought for research. Waiver by an IRB of the authorization requirement for use of individually identifiable PHI for research. Complete de-identification of the data. Conversion of the PHI to a limited data set (HIPAA requires the researcher to enter into a data use ...

  15. PDF Guidance: Secondary analysis of existing data sets

    If the information or biospecimens to be used in the secondary research study are not identifiable in any . way and there is no way to link the information back to the subjects from whom it was collected, the ... Student A is provided with a deidentified, non-coded data set, the use of the data does not constitute research with human .

  16. Solved A researcher's study uses an identifiable dataset of

    Psychology. Psychology questions and answers. A researcher's study uses an identifiable dataset of prisoner demographic characteristics. This dataset includes criminal history data that predates incarceration and includes data on disciplinary behavior while in prison. There is no interaction with prisoners.

  17. A Researcher's Study Uses An Identifiable Dataset

    A Researcher's Study Uses An Identifiable Dataset Introduction Hey there, fellow educators and curious minds! Today, I want to dive deep into the fascinating world of research and how a researcher's study uses an identifiable dataset. As an experienced educator, I've always been intrigued by the power of data analysis in education, and this topic

  18. Solved Question 4A researcher's study uses an identifiable

    A researcher's study uses an identifiable dataset of prisoner demographic characteristics. This dataset includes criminal history data that predates incarceration and includes data on disciplinary behavior while in prison. There is no interaction with prisoners. The researcher claims, and the IRB chair agrees, that the study is exempt from IRB ...

  19. A researchers study uses an identifiable dataset of prisoner

    A researcher's study uses an identifiable dataset of prisoner demographic characteristics. This dataset includes criminal history data that predates incarceration and includes data on disciplinary behavior while in prison. There is no interaction with prisoners. The researcher claims, and the IRB chair agrees, that the study is exempt from ...

  20. A researcher's study uses an identifiable dataset of prisoner

    VIDEO ANSWER: We're doing a project here on to a sentence where it is publicly available: data right of pleades and court decisions in here in Massachusetts, I'm looking at their offence, sex age, prison and incarceration. Before you can remove

  21. CITI Model 3 Flashcards

    Study with Quizlet and memorize flashcards containing terms like A researcher wants to contact former prisoners who are now on parole. She wants to study the difficulty of getting employment based on whether the subjects had been convicted of felony versus misdemeanor crimes. She needs to:, A researcher's study uses a dataset of prisoner demographic characteristics. This dataset includes ...

  22. CITI training Flashcards

    Study with Quizlet and memorize flashcards containing terms like A researcher is examining the quality of life for prisoners who are HIV positive using surveys followed by interview. The IRB must ensure that:, A researcher's study uses a dataset of prisoner demographic characteristics. This dataset includes criminal history data that predates incarceration and includes data on disciplinary ...