Jump to navigation

Home

Cochrane Training

Chapter 5: collecting data.

Tianjing Li, Julian PT Higgins, Jonathan J Deeks

Key Points:

  • Systematic reviews have studies, rather than reports, as the unit of interest, and so multiple reports of the same study need to be identified and linked together before or after data extraction.
  • Because of the increasing availability of data sources (e.g. trials registers, regulatory documents, clinical study reports), review authors should decide on which sources may contain the most useful information for the review, and have a plan to resolve discrepancies if information is inconsistent across sources.
  • Review authors are encouraged to develop outlines of tables and figures that will appear in the review to facilitate the design of data collection forms. The key to successful data collection is to construct easy-to-use forms and collect sufficient and unambiguous data that faithfully represent the source in a structured and organized manner.
  • Effort should be made to identify data needed for meta-analyses, which often need to be calculated or converted from data reported in diverse formats.
  • Data should be collected and archived in a form that allows future access and data sharing.

Cite this chapter as: Li T, Higgins JPT, Deeks JJ (editors). Chapter 5: Collecting data. In: Higgins JPT, Thomas J, Chandler J, Cumpston M, Li T, Page MJ, Welch VA (editors). Cochrane Handbook for Systematic Reviews of Interventions version 6.4 (updated August 2023). Cochrane, 2023. Available from www.training.cochrane.org/handbook .

5.1 Introduction

Systematic reviews aim to identify all studies that are relevant to their research questions and to synthesize data about the design, risk of bias, and results of those studies. Consequently, the findings of a systematic review depend critically on decisions relating to which data from these studies are presented and analysed. Data collected for systematic reviews should be accurate, complete, and accessible for future updates of the review and for data sharing. Methods used for these decisions must be transparent; they should be chosen to minimize biases and human error. Here we describe approaches that should be used in systematic reviews for collecting data, including extraction of data directly from journal articles and other reports of studies.

5.2 Sources of data

Studies are reported in a range of sources which are detailed later. As discussed in Section 5.2.1 , it is important to link together multiple reports of the same study. The relative strengths and weaknesses of each type of source are discussed in Section 5.2.2 . For guidance on searching for and selecting reports of studies, refer to Chapter 4 .

Journal articles are the source of the majority of data included in systematic reviews. Note that a study can be reported in multiple journal articles, each focusing on some aspect of the study (e.g. design, main results, and other results).

Conference abstracts are commonly available. However, the information presented in conference abstracts is highly variable in reliability, accuracy, and level of detail (Li et al 2017).

Errata and letters can be important sources of information about studies, including critical weaknesses and retractions, and review authors should examine these if they are identified (see MECIR Box 5.2.a ).

Trials registers (e.g. ClinicalTrials.gov) catalogue trials that have been planned or started, and have become an important data source for identifying trials, for comparing published outcomes and results with those planned, and for obtaining efficacy and safety data that are not available elsewhere (Ross et al 2009, Jones et al 2015, Baudard et al 2017).

Clinical study reports (CSRs) contain unabridged and comprehensive descriptions of the clinical problem, design, conduct and results of clinical trials, following a structure and content guidance prescribed by the International Conference on Harmonisation (ICH 1995). To obtain marketing approval of drugs and biologics for a specific indication, pharmaceutical companies submit CSRs and other required materials to regulatory authorities. Because CSRs also incorporate tables and figures, with appendices containing the protocol, statistical analysis plan, sample case report forms, and patient data listings (including narratives of all serious adverse events), they can be thousands of pages in length. CSRs often contain more data about trial methods and results than any other single data source (Mayo-Wilson et al 2018). CSRs are often difficult to access, and are usually not publicly available. Review authors could request CSRs from the European Medicines Agency (Davis and Miller 2017). The US Food and Drug and Administration had historically avoided releasing CSRs but launched a pilot programme in 2018 whereby selected portions of CSRs for new drug applications were posted on the agency’s website. Many CSRs are obtained through unsealed litigation documents, repositories (e.g. clinicalstudydatarequest.com ), and other open data and data-sharing channels (e.g. The Yale University Open Data Access Project) (Doshi et al 2013, Wieland et al 2014, Mayo-Wilson et al 2018)).

Regulatory reviews such as those available from the US Food and Drug Administration or European Medicines Agency provide useful information about trials of drugs, biologics, and medical devices submitted by manufacturers for marketing approval (Turner 2013). These documents are summaries of CSRs and related documents, prepared by agency staff as part of the process of approving the products for marketing, after reanalysing the original trial data. Regulatory reviews often are available only for the first approved use of an intervention and not for later applications (although review authors may request those documents, which are usually brief). Using regulatory reviews from the US Food and Drug Administration as an example, drug approval packages are available on the agency’s website for drugs approved since 1997 (Turner 2013); for drugs approved before 1997, information must be requested through a freedom of information request. The drug approval packages contain various documents: approval letter(s), medical review(s), chemistry review(s), clinical pharmacology review(s), and statistical reviews(s).

Individual participant data (IPD) are usually sought directly from the researchers responsible for the study, or may be identified from open data repositories (e.g. www.clinicalstudydatarequest.com ). These data typically include variables that represent the characteristics of each participant, intervention (or exposure) group, prognostic factors, and measurements of outcomes (Stewart et al 2015). Access to IPD has the advantage of allowing review authors to reanalyse the data flexibly, in accordance with the preferred analysis methods outlined in the protocol, and can reduce the variation in analysis methods across studies included in the review. IPD reviews are addressed in detail in Chapter 26 .

MECIR Box 5.2.a Relevant expectations for conduct of intervention reviews

5.2.1 Studies (not reports) as the unit of interest

In a systematic review, studies rather than reports of studies are the principal unit of interest. Since a study may have been reported in several sources, a comprehensive search for studies for the review may identify many reports from a potentially relevant study (Mayo-Wilson et al 2017a, Mayo-Wilson et al 2018). Conversely, a report may describe more than one study.

Multiple reports of the same study should be linked together (see MECIR Box 5.2.b ). Some authors prefer to link reports before they collect data, and collect data from across the reports onto a single form. Other authors prefer to collect data from each report and then link together the collected data across reports. Either strategy may be appropriate, depending on the nature of the reports at hand. It may not be clear that two reports relate to the same study until data collection has commenced. Although sometimes there is a single report for each study, it should never be assumed that this is the case.

MECIR Box 5.2.b Relevant expectations for conduct of intervention reviews

It can be difficult to link multiple reports from the same study, and review authors may need to do some ‘detective work’. Multiple sources about the same trial may not reference each other, do not share common authors (Gøtzsche 1989, Tramèr et al 1997), or report discrepant information about the study design, characteristics, outcomes, and results (von Elm et al 2004, Mayo-Wilson et al 2017a).

Some of the most useful criteria for linking reports are:

  • trial registration numbers;
  • authors’ names;
  • sponsor for the study and sponsor identifiers (e.g. grant or contract numbers);
  • location and setting (particularly if institutions, such as hospitals, are named);
  • specific details of the interventions (e.g. dose, frequency);
  • numbers of participants and baseline data; and
  • date and duration of the study (which also can clarify whether different sample sizes are due to different periods of recruitment), length of follow-up, or subgroups selected to address secondary goals.

Review authors should use as many trial characteristics as possible to link multiple reports. When uncertainties remain after considering these and other factors, it may be necessary to correspond with the study authors or sponsors for confirmation.

5.2.2 Determining which sources might be most useful

A comprehensive search to identify all eligible studies from all possible sources is resource-intensive but necessary for a high-quality systematic review (see Chapter 4 ). Because some data sources are more useful than others (Mayo-Wilson et al 2018), review authors should consider which data sources may be available and which may contain the most useful information for the review. These considerations should be described in the protocol. Table 5.2.a summarizes the strengths and limitations of different data sources (Mayo-Wilson et al 2018). Gaining access to CSRs and IPD often takes a long time. Review authors should begin searching repositories and contact trial investigators and sponsors as early as possible to negotiate data usage agreements (Mayo-Wilson et al 2015, Mayo-Wilson et al 2018).

Table 5.2.a Strengths and limitations of different data sources for systematic reviews

5.2.3 Correspondence with investigators

Review authors often find that they are unable to obtain all the information they seek from available reports about the details of the study design, the full range of outcomes measured and the numerical results. In such circumstances, authors are strongly encouraged to contact the original investigators (see MECIR Box 5.2.c ). Contact details of study authors, when not available from the study reports, often can be obtained from more recent publications, from university or institutional staff listings, from membership directories of professional societies, or by a general search of the web. If the contact author named in the study report cannot be contacted or does not respond, it is worthwhile attempting to contact other authors.

Review authors should consider the nature of the information they require and make their request accordingly. For descriptive information about the conduct of the trial, it may be most appropriate to ask open-ended questions (e.g. how was the allocation process conducted, or how were missing data handled?). If specific numerical data are required, it may be more helpful to request them specifically, possibly providing a short data collection form (either uncompleted or partially completed). If IPD are required, they should be specifically requested (see also Chapter 26 ). In some cases, study investigators may find it more convenient to provide IPD rather than conduct additional analyses to obtain the specific statistics requested.

MECIR Box 5.2.c Relevant expectations for conduct of intervention reviews

5.3 What data to collect

5.3.1 what are data.

For the purposes of this chapter, we define ‘data’ to be any information about (or derived from) a study, including details of methods, participants, setting, context, interventions, outcomes, results, publications, and investigators. Review authors should plan in advance what data will be required for their systematic review, and develop a strategy for obtaining them (see MECIR Box 5.3.a ). The involvement of consumers and other stakeholders can be helpful in ensuring that the categories of data collected are sufficiently aligned with the needs of review users ( Chapter 1, Section 1.3 ). The data to be sought should be described in the protocol, with consideration wherever possible of the issues raised in the rest of this chapter.

The data collected for a review should adequately describe the included studies, support the construction of tables and figures, facilitate the risk of bias assessment, and enable syntheses and meta-analyses. Review authors should familiarize themselves with reporting guidelines for systematic reviews (see online Chapter III and the PRISMA statement; (Liberati et al 2009) to ensure that relevant elements and sections are incorporated. The following sections review the types of information that should be sought, and these are summarized in Table 5.3.a (Li et al 2015).

MECIR Box 5.3.a Relevant expectations for conduct of intervention reviews

Table 5.3.a Checklist of items to consider in data collection

*Full description required for assessments of risk of bias (see Chapter 8 , Chapter 23 and Chapter 25 ).

5.3.2 Study methods and potential sources of bias

Different research methods can influence study outcomes by introducing different biases into results. Important study design characteristics should be collected to allow the selection of appropriate methods for assessment and analysis, and to enable description of the design of each included study in a table of ‘Characteristics of included studies’, including whether the study is randomized, whether the study has a cluster or crossover design, and the duration of the study. If the review includes non-randomized studies, appropriate features of the studies should be described (see Chapter 24 ).

Detailed information should be collected to facilitate assessment of the risk of bias in each included study. Risk-of-bias assessment should be conducted using the tool most appropriate for the design of each study, and the information required to complete the assessment will depend on the tool. Randomized studies should be assessed using the tool described in Chapter 8 . The tool covers bias arising from the randomization process, due to deviations from intended interventions, due to missing outcome data, in measurement of the outcome, and in selection of the reported result. For each item in the tool, a description of what happened in the study is required, which may include verbatim quotes from study reports. Information for assessment of bias due to missing outcome data and selection of the reported result may be most conveniently collected alongside information on outcomes and results. Chapter 7 (Section 7.3.1) discusses some issues in the collection of information for assessments of risk of bias. For non-randomized studies, the most appropriate tool is described in Chapter 25 . A separate tool also covers bias due to missing results in meta-analysis (see Chapter 13 ).

A particularly important piece of information is the funding source of the study and potential conflicts of interest of the study authors.

Some review authors will wish to collect additional information on study characteristics that bear on the quality of the study’s conduct but that may not lead directly to risk of bias, such as whether ethical approval was obtained and whether a sample size calculation was performed a priori.

5.3.3 Participants and setting

Details of participants are collected to enable an understanding of the comparability of, and differences between, the participants within and between included studies, and to allow assessment of how directly or completely the participants in the included studies reflect the original review question.

Typically, aspects that should be collected are those that could (or are believed to) affect presence or magnitude of an intervention effect and those that could help review users assess applicability to populations beyond the review. For example, if the review authors suspect important differences in intervention effect between different socio-economic groups, this information should be collected. If intervention effects are thought constant over such groups, and if such information would not be useful to help apply results, it should not be collected. Participant characteristics that are often useful for assessing applicability include age and sex. Summary information about these should always be collected unless they are not obvious from the context. These characteristics are likely to be presented in different formats (e.g. ages as means or medians, with standard deviations or ranges; sex as percentages or counts for the whole study or for each intervention group separately). Review authors should seek consistent quantities where possible, and decide whether it is more relevant to summarize characteristics for the study as a whole or by intervention group. It may not be possible to select the most consistent statistics until data collection is complete across all or most included studies. Other characteristics that are sometimes important include ethnicity, socio-demographic details (e.g. education level) and the presence of comorbid conditions. Clinical characteristics relevant to the review question (e.g. glucose level for reviews on diabetes) also are important for understanding the severity or stage of the disease.

Diagnostic criteria that were used to define the condition of interest can be a particularly important source of diversity across studies and should be collected. For example, in a review of drug therapy for congestive heart failure, it is important to know how the definition and severity of heart failure was determined in each study (e.g. systolic or diastolic dysfunction, severe systolic dysfunction with ejection fractions below 20%). Similarly, in a review of antihypertensive therapy, it is important to describe baseline levels of blood pressure of participants.

If the settings of studies may influence intervention effects or applicability, then information on these should be collected. Typical settings of healthcare intervention studies include acute care hospitals, emergency facilities, general practice, and extended care facilities such as nursing homes, offices, schools, and communities. Sometimes studies are conducted in different geographical regions with important differences that could affect delivery of an intervention and its outcomes, such as cultural characteristics, economic context, or rural versus city settings. Timing of the study may be associated with important technology differences or trends over time. If such information is important for the interpretation of the review, it should be collected.

Important characteristics of the participants in each included study should be summarized for the reader in the table of ‘Characteristics of included studies’.

5.3.4 Interventions

Details of all experimental and comparator interventions of relevance to the review should be collected. Again, details are required for aspects that could affect the presence or magnitude of an effect or that could help review users assess applicability to their own circumstances. Where feasible, information should be sought (and presented in the review) that is sufficient for replication of the interventions under study. This includes any co-interventions administered as part of the study, and applies similarly to comparators such as ‘usual care’. Review authors may need to request missing information from study authors.

The Template for Intervention Description and Replication (TIDieR) provides a comprehensive framework for full description of interventions and has been proposed for use in systematic reviews as well as reports of primary studies (Hoffmann et al 2014). The checklist includes descriptions of:

  • the rationale for the intervention and how it is expected to work;
  • any documentation that instructs the recipient on the intervention;
  • what the providers do to deliver the intervention (procedures and processes);
  • who provides the intervention (including their skill level), how (e.g. face to face, web-based) and in what setting (e.g. home, school, or hospital);
  • the timing and intensity;
  • whether any variation is permitted or expected, and whether modifications were actually made; and
  • any strategies used to ensure or assess fidelity or adherence to the intervention, and the extent to which the intervention was delivered as planned.

For clinical trials of pharmacological interventions, key information to collect will often include routes of delivery (e.g. oral or intravenous delivery), doses (e.g. amount or intensity of each treatment, frequency of delivery), timing (e.g. within 24 hours of diagnosis), and length of treatment. For other interventions, such as those that evaluate psychotherapy, behavioural and educational approaches, or healthcare delivery strategies, the amount of information required to characterize the intervention will typically be greater, including information about multiple elements of the intervention, who delivered it, and the format and timing of delivery. Chapter 17 provides further information on how to manage intervention complexity, and how the intervention Complexity Assessment Tool (iCAT) can facilitate data collection (Lewin et al 2017).

Important characteristics of the interventions in each included study should be summarized for the reader in the table of ‘Characteristics of included studies’. Additional tables or diagrams such as logic models ( Chapter 2, Section 2.5.1 ) can assist descriptions of multi-component interventions so that review users can better assess review applicability to their context.

5.3.4.1 Integrity of interventions

The degree to which specified procedures or components of the intervention are implemented as planned can have important consequences for the findings from a study. We describe this as intervention integrity ; related terms include adherence, compliance and fidelity (Carroll et al 2007). The verification of intervention integrity may be particularly important in reviews of non-pharmacological trials such as behavioural interventions and complex interventions, which are often implemented in conditions that present numerous obstacles to idealized delivery.

It is generally expected that reports of randomized trials provide detailed accounts of intervention implementation (Zwarenstein et al 2008, Moher et al 2010). In assessing whether interventions were implemented as planned, review authors should bear in mind that some interventions are standardized (with no deviations permitted in the intervention protocol), whereas others explicitly allow a degree of tailoring (Zwarenstein et al 2008). In addition, the growing field of implementation science has led to an increased awareness of the impact of setting and context on delivery of interventions (Damschroder et al 2009). (See Chapter 17, Section 17.1.2.1 for further information and discussion about how an intervention may be tailored to local conditions in order to preserve its integrity.)

Information about integrity can help determine whether unpromising results are due to a poorly conceptualized intervention or to an incomplete delivery of the prescribed components. It can also reveal important information about the feasibility of implementing a given intervention in real life settings. If it is difficult to achieve full implementation in practice, the intervention will have low feasibility (Dusenbury et al 2003).

Whether a lack of intervention integrity leads to a risk of bias in the estimate of its effect depends on whether review authors and users are interested in the effect of assignment to intervention or the effect of adhering to intervention, as discussed in more detail in Chapter 8, Section 8.2.2 . Assessment of deviations from intended interventions is important for assessing risk of bias in the latter, but not the former (see Chapter 8, Section 8.4 ), but both may be of interest to decision makers in different ways.

An example of a Cochrane Review evaluating intervention integrity is provided by a review of smoking cessation in pregnancy (Chamberlain et al 2017). The authors found that process evaluation of the intervention occurred in only some trials and that the implementation was less than ideal in others, including some of the largest trials. The review highlighted how the transfer of an intervention from one setting to another may reduce its effectiveness when elements are changed, or aspects of the materials are culturally inappropriate.

5.3.4.2 Process evaluations

Process evaluations seek to evaluate the process (and mechanisms) between the intervention’s intended implementation and the actual effect on the outcome (Moore et al 2015). Process evaluation studies are characterized by a flexible approach to data collection and the use of numerous methods to generate a range of different types of data, encompassing both quantitative and qualitative methods. Guidance for including process evaluations in systematic reviews is provided in Chapter 21 . When it is considered important, review authors should aim to collect information on whether the trial accounted for, or measured, key process factors and whether the trials that thoroughly addressed integrity showed a greater impact. Process evaluations can be a useful source of factors that potentially influence the effectiveness of an intervention.

5.3.5 Outcome s

An outcome is an event or a measurement value observed or recorded for a particular person or intervention unit in a study during or following an intervention, and that is used to assess the efficacy and safety of the studied intervention (Meinert 2012). Review authors should indicate in advance whether they plan to collect information about all outcomes measured in a study or only those outcomes of (pre-specified) interest in the review. Research has shown that trials addressing the same condition and intervention seldom agree on which outcomes are the most important, and consequently report on numerous different outcomes (Dwan et al 2014, Ismail et al 2014, Denniston et al 2015, Saldanha et al 2017a). The selection of outcomes across systematic reviews of the same condition is also inconsistent (Page et al 2014, Saldanha et al 2014, Saldanha et al 2016, Liu et al 2017). Outcomes used in trials and in systematic reviews of the same condition have limited overlap (Saldanha et al 2017a, Saldanha et al 2017b).

We recommend that only the outcomes defined in the protocol be described in detail. However, a complete list of the names of all outcomes measured may allow a more detailed assessment of the risk of bias due to missing outcome data (see Chapter 13 ).

Review authors should collect all five elements of an outcome (Zarin et al 2011, Saldanha et al 2014):

1. outcome domain or title (e.g. anxiety);

2. measurement tool or instrument (including definition of clinical outcomes or endpoints); for a scale, name of the scale (e.g. the Hamilton Anxiety Rating Scale), upper and lower limits, and whether a high or low score is favourable, definitions of any thresholds if appropriate;

3. specific metric used to characterize each participant’s results (e.g. post-intervention anxiety, or change in anxiety from baseline to a post-intervention time point, or post-intervention presence of anxiety (yes/no));

4. method of aggregation (e.g. mean and standard deviation of anxiety scores in each group, or proportion of people with anxiety);

5. timing of outcome measurements (e.g. assessments at end of eight-week intervention period, events occurring during eight-week intervention period).

Further considerations for economics outcomes are discussed in Chapter 20 , and for patient-reported outcomes in Chapter 18 .

5.3.5.1 Adverse effects

Collection of information about the harmful effects of an intervention can pose particular difficulties, discussed in detail in Chapter 19 . These outcomes may be described using multiple terms, including ‘adverse event’, ‘adverse effect’, ‘adverse drug reaction’, ‘side effect’ and ‘complication’. Many of these terminologies are used interchangeably in the literature, although some are technically different. Harms might additionally be interpreted to include undesirable changes in other outcomes measured during a study, such as a decrease in quality of life where an improvement may have been anticipated.

In clinical trials, adverse events can be collected either systematically or non-systematically. Systematic collection refers to collecting adverse events in the same manner for each participant using defined methods such as a questionnaire or a laboratory test. For systematically collected outcomes representing harm, data can be collected by review authors in the same way as efficacy outcomes (see Section 5.3.5 ).

Non-systematic collection refers to collection of information on adverse events using methods such as open-ended questions (e.g. ‘Have you noticed any symptoms since your last visit?’), or reported by participants spontaneously. In either case, adverse events may be selectively reported based on their severity, and whether the participant suspected that the effect may have been caused by the intervention, which could lead to bias in the available data. Unfortunately, most adverse events are collected non-systematically rather than systematically, creating a challenge for review authors. The following pieces of information are useful and worth collecting (Nicole Fusco, personal communication):

  • any coding system or standard medical terminology used (e.g. COSTART, MedDRA), including version number;
  • name of the adverse events (e.g. dizziness);
  • reported intensity of the adverse event (e.g. mild, moderate, severe);
  • whether the trial investigators categorized the adverse event as ‘serious’;
  • whether the trial investigators identified the adverse event as being related to the intervention;
  • time point (most commonly measured as a count over the duration of the study);
  • any reported methods for how adverse events were selected for inclusion in the publication (e.g. ‘We reported all adverse events that occurred in at least 5% of participants’); and
  • associated results.

Different collection methods lead to very different accounting of adverse events (Safer 2002, Bent et al 2006, Ioannidis et al 2006, Carvajal et al 2011, Allen et al 2013). Non-systematic collection methods tend to underestimate how frequently an adverse event occurs. It is particularly problematic when the adverse event of interest to the review is collected systematically in some studies but non-systematically in other studies. Different collection methods introduce an important source of heterogeneity. In addition, when non-systematic adverse events are reported based on quantitative selection criteria (e.g. only adverse events that occurred in at least 5% of participants were included in the publication), use of reported data alone may bias the results of meta-analyses. Review authors should be cautious of (or refrain from) synthesizing adverse events that are collected differently.

Regardless of the collection methods, precise definitions of adverse effect outcomes and their intensity should be recorded, since they may vary between studies. For example, in a review of aspirin and gastrointestinal haemorrhage, some trials simply reported gastrointestinal bleeds, while others reported specific categories of bleeding, such as haematemesis, melaena, and proctorrhagia (Derry and Loke 2000). The definition and reporting of severity of the haemorrhages (e.g. major, severe, requiring hospital admission) also varied considerably among the trials (Zanchetti and Hansson 1999). Moreover, a particular adverse effect may be described or measured in different ways among the studies. For example, the terms ‘tiredness’, ‘fatigue’ or ‘lethargy’ may all be used in reporting of adverse effects. Study authors also may use different thresholds for ‘abnormal’ results (e.g. hypokalaemia diagnosed at a serum potassium concentration of 3.0 mmol/L or 3.5 mmol/L).

No mention of adverse events in trial reports does not necessarily mean that no adverse events occurred. It is usually safest to assume that they were not reported. Quality of life measures are sometimes used as a measure of the participants’ experience during the study, but these are usually general measures that do not look specifically at particular adverse effects of the intervention. While quality of life measures are important and can be used to gauge overall participant well-being, they should not be regarded as substitutes for a detailed evaluation of safety and tolerability.

5.3.6 Results

Results data arise from the measurement or ascertainment of outcomes for individual participants in an intervention study. Results data may be available for each individual in a study (i.e. individual participant data; see Chapter 26 ), or summarized at arm level, or summarized at study level into an intervention effect by comparing two intervention arms. Results data should be collected only for the intervention groups and outcomes specified to be of interest in the protocol (see MECIR Box 5.3.b ). Results for other outcomes should not be collected unless the protocol is modified to add them. Any modification should be reported in the review. However, review authors should be alert to the possibility of important, unexpected findings, particularly serious adverse effects.

MECIR Box 5.3.b Relevant expectations for conduct of intervention reviews

Reports of studies often include several results for the same outcome. For example, different measurement scales might be used, results may be presented separately for different subgroups, and outcomes may have been measured at different follow-up time points. Variation in the results can be very large, depending on which data are selected (Gøtzsche et al 2007, Mayo-Wilson et al 2017a). Review protocols should be as specific as possible about which outcome domains, measurement tools, time points, and summary statistics (e.g. final values versus change from baseline) are to be collected (Mayo-Wilson et al 2017b). A framework should be pre-specified in the protocol to facilitate making choices between multiple eligible measures or results. For example, a hierarchy of preferred measures might be created, or plans articulated to select the result with the median effect size, or to average across all eligible results for a particular outcome domain (see also Chapter 9, Section 9.3.3 ). Any additional decisions or changes to this framework made once the data are collected should be reported in the review as changes to the protocol.

Section 5.6 describes the numbers that will be required to perform meta-analysis, if appropriate. The unit of analysis (e.g. participant, cluster, body part, treatment period) should be recorded for each result when it is not obvious (see Chapter 6, Section 6.2 ). The type of outcome data determines the nature of the numbers that will be sought for each outcome. For example, for a dichotomous (‘yes’ or ‘no’) outcome, the number of participants and the number who experienced the outcome will be sought for each group. It is important to collect the sample size relevant to each result, although this is not always obvious. A flow diagram as recommended in the CONSORT Statement (Moher et al 2001) can help to determine the flow of participants through a study. If one is not available in a published report, review authors can consider drawing one (available from www.consort-statement.org ).

The numbers required for meta-analysis are not always available. Often, other statistics can be collected and converted into the required format. For example, for a continuous outcome, it is usually most convenient to seek the number of participants, the mean and the standard deviation for each intervention group. These are often not available directly, especially the standard deviation. Alternative statistics enable calculation or estimation of the missing standard deviation (such as a standard error, a confidence interval, a test statistic (e.g. from a t-test or F-test) or a P value). These should be extracted if they provide potentially useful information (see MECIR Box 5.3.c ). Details of recalculation are provided in Section 5.6 . Further considerations for dealing with missing data are discussed in Chapter 10, Section 10.12 .

MECIR Box 5.3.c Relevant expectations for conduct of intervention reviews

5.3.7 Other information to collect

We recommend that review authors collect the key conclusions of the included study as reported by its authors. It is not necessary to report these conclusions in the review, but they should be used to verify the results of analyses undertaken by the review authors, particularly in relation to the direction of effect. Further comments by the study authors, for example any explanations they provide for unexpected findings, may be noted. References to other studies that are cited in the study report may be useful, although review authors should be aware of the possibility of citation bias (see Chapter 7, Section 7.2.3.2 ). Documentation of any correspondence with the study authors is important for review transparency.

5.4 Data collection tools

5.4.1 rationale for data collection forms.

Data collection for systematic reviews should be performed using structured data collection forms (see MECIR Box 5.4.a ). These can be paper forms, electronic forms (e.g. Google Form), or commercially or custom-built data systems (e.g. Covidence, EPPI-Reviewer, Systematic Review Data Repository (SRDR)) that allow online form building, data entry by several users, data sharing, and efficient data management (Li et al 2015). All different means of data collection require data collection forms.

MECIR Box 5.4.a Relevant expectations for conduct of intervention reviews

The data collection form is a bridge between what is reported by the original investigators (e.g. in journal articles, abstracts, personal correspondence) and what is ultimately reported by the review authors. The data collection form serves several important functions (Meade and Richardson 1997). First, the form is linked directly to the review question and criteria for assessing eligibility of studies, and provides a clear summary of these that can be used to identify and structure the data to be extracted from study reports. Second, the data collection form is the historical record of the provenance of the data used in the review, as well as the multitude of decisions (and changes to decisions) that occur throughout the review process. Third, the form is the source of data for inclusion in an analysis.

Given the important functions of data collection forms, ample time and thought should be invested in their design. Because each review is different, data collection forms will vary across reviews. However, there are many similarities in the types of information that are important. Thus, forms can be adapted from one review to the next. Although we use the term ‘data collection form’ in the singular, in practice it may be a series of forms used for different purposes: for example, a separate form could be used to assess the eligibility of studies for inclusion in the review to assist in the quick identification of studies to be excluded from or included in the review.

5.4.2 Considerations in selecting data collection tools

The choice of data collection tool is largely dependent on review authors’ preferences, the size of the review, and resources available to the author team. Potential advantages and considerations of selecting one data collection tool over another are outlined in Table 5.4.a (Li et al 2015). A significant advantage that data systems have is in data management ( Chapter 1, Section 1.6 ) and re-use. They make review updates more efficient, and also facilitate methodological research across reviews. Numerous ‘meta-epidemiological’ studies have been carried out using Cochrane Review data, resulting in methodological advances which would not have been possible if thousands of studies had not all been described using the same data structures in the same system.

Some data collection tools facilitate automatic imports of extracted data into RevMan (Cochrane’s authoring tool), such as CSV (Excel) and Covidence. Details available here https://documentation.cochrane.org/revman-kb/populate-study-data-260702462.html

Table 5.4.a Considerations in selecting data collection tools

5.4.3 Design of a data collection form

Regardless of whether data are collected using a paper or electronic form, or a data system, the key to successful data collection is to construct easy-to-use forms and collect sufficient and unambiguous data that faithfully represent the source in a structured and organized manner (Li et al 2015). In most cases, a document format should be developed for the form before building an electronic form or a data system. This can be distributed to others, including programmers and data analysts, and as a guide for creating an electronic form and any guidance or codebook to be used by data extractors. Review authors also should consider compatibility of any electronic form or data system with analytical software, as well as mechanisms for recording, assessing and correcting data entry errors.

Data described in multiple reports (or even within a single report) of a study may not be consistent. Review authors will need to describe how they work with multiple reports in the protocol, for example, by pre-specifying which report will be used when sources contain conflicting data that cannot be resolved by contacting the investigators. Likewise, when there is only one report identified for a study, review authors should specify the section within the report (e.g. abstract, methods, results, tables, and figures) for use in case of inconsistent information.

If review authors wish to automatically import their extracted data into RevMan, it is advised that their data collection forms match the data extraction templates available via the RevMan Knowledge Base. Details available here https://documentation.cochrane.org/revman-kb/data-extraction-templates-260702375.html.

A good data collection form should minimize the need to go back to the source documents. When designing a data collection form, review authors should involve all members of the team, that is, content area experts, authors with experience in systematic review methods and data collection form design, statisticians, and persons who will perform data extraction. Here are suggested steps and some tips for designing a data collection form, based on the informal collation of experiences from numerous review authors (Li et al 2015).

Step 1. Develop outlines of tables and figures expected to appear in the systematic review, considering the comparisons to be made between different interventions within the review, and the various outcomes to be measured. This step will help review authors decide the right amount of data to collect (not too much or too little). Collecting too much information can lead to forms that are longer than original study reports, and can be very wasteful of time. Collection of too little information, or omission of key data, can lead to the need to return to study reports later in the review process.

Step 2. Assemble and group data elements to facilitate form development. Review authors should consult Table 5.3.a , in which the data elements are grouped to facilitate form development and data collection. Note that it may be more efficient to group data elements in the order in which they are usually found in study reports (e.g. starting with reference information, followed by eligibility criteria, intervention description, statistical methods, baseline characteristics and results).

Step 3. Identify the optimal way of framing the data items. Much has been written about how to frame data items for developing robust data collection forms in primary research studies. We summarize a few key points and highlight issues that are pertinent to systematic reviews.

  • Ask closed-ended questions (i.e. questions that define a list of permissible responses) as much as possible. Closed-ended questions do not require post hoc coding and provide better control over data quality than open-ended questions. When setting up a closed-ended question, one must anticipate and structure possible responses and include an ‘other, specify’ category because the anticipated list may not be exhaustive. Avoid asking data extractors to summarize data into uncoded text, no matter how short it is.
  • Avoid asking a question in a way that the response may be left blank. Include ‘not applicable’, ‘not reported’ and ‘cannot tell’ options as needed. The ‘cannot tell’ option tags uncertain items that may promote review authors to contact study authors for clarification, especially on data items critical to reach conclusions.
  • Remember that the form will focus on what is reported in the article rather what has been done in the study. The study report may not fully reflect how the study was actually conducted. For example, a question ‘Did the article report that the participants were masked to the intervention?’ is more appropriate than ‘Were participants masked to the intervention?’
  • Where a judgement is required, record the raw data (i.e. quote directly from the source document) used to make the judgement. It is also important to record the source of information collected, including where it was found in a report or whether information was obtained from unpublished sources or personal communications. As much as possible, questions should be asked in a way that minimizes subjective interpretation and judgement to facilitate data comparison and adjudication.
  • Incorporate flexibility to allow for variation in how data are reported. It is strongly recommended that outcome data be collected in the format in which they were reported and transformed in a subsequent step if required. Review authors also should consider the software they will use for analysis and for publishing the review (e.g. RevMan).

Step 4. Develop and pilot-test data collection forms, ensuring that they provide data in the right format and structure for subsequent analysis. In addition to data items described in Step 2, data collection forms should record the title of the review as well as the person who is completing the form and the date of completion. Forms occasionally need revision; forms should therefore include the version number and version date to reduce the chances of using an outdated form by mistake. Because a study may be associated with multiple reports, it is important to record the study ID as well as the report ID. Definitions and instructions helpful for answering a question should appear next to the question to improve quality and consistency across data extractors (Stock 1994). Provide space for notes, regardless of whether paper or electronic forms are used.

All data collection forms and data systems should be thoroughly pilot-tested before launch (see MECIR Box 5.4.a ). Testing should involve several people extracting data from at least a few articles. The initial testing focuses on the clarity and completeness of questions. Users of the form may provide feedback that certain coding instructions are confusing or incomplete (e.g. a list of options may not cover all situations). The testing may identify data that are missing from the form, or likely to be superfluous. After initial testing, accuracy of the extracted data should be checked against the source document or verified data to identify problematic areas. It is wise to draft entries for the table of ‘Characteristics of included studies’ and complete a risk of bias assessment ( Chapter 8 ) using these pilot reports to ensure all necessary information is collected. A consensus between review authors may be required before the form is modified to avoid any misunderstandings or later disagreements. It may be necessary to repeat the pilot testing on a new set of reports if major changes are needed after the first pilot test.

Problems with the data collection form may surface after pilot testing has been completed, and the form may need to be revised after data extraction has started. When changes are made to the form or coding instructions, it may be necessary to return to reports that have already undergone data extraction. In some situations, it may be necessary to clarify only coding instructions without modifying the actual data collection form.

5.5 Extracting data from reports

5.5.1 introduction.

In most systematic reviews, the primary source of information about each study is published reports of studies, usually in the form of journal articles. Despite recent developments in machine learning models to automate data extraction in systematic reviews (see Section 5.5.9 ), data extraction is still largely a manual process. Electronic searches for text can provide a useful aid to locating information within a report. Examples include using search facilities in PDF viewers, internet browsers and word processing software. However, text searching should not be considered a replacement for reading the report, since information may be presented using variable terminology and presented in multiple formats.

5.5.2 Who should extract data?

Data extractors should have at least a basic understanding of the topic, and have knowledge of study design, data analysis and statistics. They should pay attention to detail while following instructions on the forms. Because errors that occur at the data extraction stage are rarely detected by peer reviewers, editors, or users of systematic reviews, it is recommended that more than one person extract data from every report to minimize errors and reduce introduction of potential biases by review authors (see MECIR Box 5.5.a ). As a minimum, information that involves subjective interpretation and information that is critical to the interpretation of results (e.g. outcome data) should be extracted independently by at least two people (see MECIR Box 5.5.a ). In common with implementation of the selection process ( Chapter 4, Section 4.6 ), it is preferable that data extractors are from complementary disciplines, for example a methodologist and a topic area specialist. It is important that everyone involved in data extraction has practice using the form and, if the form was designed by someone else, receives appropriate training.

Evidence in support of duplicate data extraction comes from several indirect sources. One study observed that independent data extraction by two authors resulted in fewer errors than data extraction by a single author followed by verification by a second (Buscemi et al 2006). A high prevalence of data extraction errors (errors in 20 out of 34 reviews) has been observed (Jones et al 2005). A further study of data extraction to compute standardized mean differences found that a minimum of seven out of 27 reviews had substantial errors (Gøtzsche et al 2007).

MECIR Box 5.5.a Relevant expectations for conduct of intervention reviews

5.5.3 Training data extractors

Training of data extractors is intended to familiarize them with the review topic and methods, the data collection form or data system, and issues that may arise during data extraction. Results of the pilot testing of the form should prompt discussion among review authors and extractors of ambiguous questions or responses to establish consistency. Training should take place at the onset of the data extraction process and periodically over the course of the project (Li et al 2015). For example, when data related to a single item on the form are present in multiple locations within a report (e.g. abstract, main body of text, tables, and figures) or in several sources (e.g. publications, ClinicalTrials.gov, or CSRs), the development and documentation of instructions to follow an agreed algorithm are critical and should be reinforced during the training sessions.

Some have proposed that some information in a report, such as its authors, be blinded to the review author prior to data extraction and assessment of risk of bias (Jadad et al 1996). However, blinding of review authors to aspects of study reports generally is not recommended for Cochrane Reviews as there is little evidence that it alters the decisions made (Berlin 1997).

5.5.4 Extracting data from multiple reports of the same study

Studies frequently are reported in more than one publication or in more than one source (Tramèr et al 1997, von Elm et al 2004). A single source rarely provides complete information about a study; on the other hand, multiple sources may contain conflicting information about the same study (Mayo-Wilson et al 2017a, Mayo-Wilson et al 2017b, Mayo-Wilson et al 2018). Because the unit of interest in a systematic review is the study and not the report, information from multiple reports often needs to be collated and reconciled. It is not appropriate to discard any report of an included study without careful examination, since it may contain valuable information not included in the primary report. Review authors will need to decide between two strategies:

  • Extract data from each report separately, then combine information across multiple data collection forms.
  • Extract data from all reports directly into a single data collection form.

The choice of which strategy to use will depend on the nature of the reports and may vary across studies and across reports. For example, when a full journal article and multiple conference abstracts are available, it is likely that the majority of information will be obtained from the journal article; completing a new data collection form for each conference abstract may be a waste of time. Conversely, when there are two or more detailed journal articles, perhaps relating to different periods of follow-up, then it is likely to be easier to perform data extraction separately for these articles and collate information from the data collection forms afterwards. When data from all reports are extracted into a single data collection form, review authors should identify the ‘main’ data source for each study when sources include conflicting data and these differences cannot be resolved by contacting authors (Mayo-Wilson et al 2018). Flow diagrams such as those modified from the PRISMA statement can be particularly helpful when collating and documenting information from multiple reports (Mayo-Wilson et al 2018).

5.5.5 Reliability and reaching consensus

When more than one author extracts data from the same reports, there is potential for disagreement. After data have been extracted independently by two or more extractors, responses must be compared to assure agreement or to identify discrepancies. An explicit procedure or decision rule should be specified in the protocol for identifying and resolving disagreements. Most often, the source of the disagreement is an error by one of the extractors and is easily resolved. Thus, discussion among the authors is a sensible first step. More rarely, a disagreement may require arbitration by another person. Any disagreement that cannot be resolved should be addressed by contacting the study authors; if this is unsuccessful, the disagreement should be reported in the review.

The presence and resolution of disagreements should be carefully recorded. Maintaining a copy of the data ‘as extracted’ (in addition to the consensus data) allows assessment of reliability of coding. Examples of ways in which this can be achieved include the following:

  • Use one author’s (paper) data collection form and record changes after consensus in a different ink colour.
  • Enter consensus data onto an electronic form.
  • Record original data extracted and consensus data in separate forms (some online tools do this automatically).

Agreement of coded items before reaching consensus can be quantified, for example using kappa statistics (Orwin 1994), although this is not routinely done in Cochrane Reviews. If agreement is assessed, this should be done only for the most important data (e.g. key risk of bias assessments, or availability of key outcomes).

Throughout the review process informal consideration should be given to the reliability of data extraction. For example, if after reaching consensus on the first few studies, the authors note a frequent disagreement for specific data, then coding instructions may need modification. Furthermore, an author’s coding strategy may change over time, as the coding rules are forgotten, indicating a need for retraining and, possibly, some recoding.

5.5.6 Extracting data from clinical study reports

Clinical study reports (CSRs) obtained for a systematic review are likely to be in PDF format. Although CSRs can be thousands of pages in length and very time-consuming to review, they typically follow the content and format required by the International Conference on Harmonisation (ICH 1995). Information in CSRs is usually presented in a structured and logical way. For example, numerical data pertaining to important demographic, efficacy, and safety variables are placed within the main text in tables and figures. Because of the clarity and completeness of information provided in CSRs, data extraction from CSRs may be clearer and conducted more confidently than from journal articles or other short reports.

To extract data from CSRs efficiently, review authors should familiarize themselves with the structure of the CSRs. In practice, review authors may want to browse or create ‘bookmarks’ within a PDF document that record section headers and subheaders and search key words related to the data extraction (e.g. randomization). In addition, it may be useful to utilize optical character recognition software to convert tables of data in the PDF to an analysable format when additional analyses are required, saving time and minimizing transcription errors.

CSRs may contain many outcomes and present many results for a single outcome (due to different analyses) (Mayo-Wilson et al 2017b). We recommend review authors extract results only for outcomes of interest to the review (Section 5.3.6 ). With regard to different methods of analysis, review authors should have a plan and pre-specify preferred metrics in their protocol for extracting results pertaining to different populations (e.g. ‘all randomized’, ‘all participants taking at least one dose of medication’), methods for handling missing data (e.g. ‘complete case analysis’, ‘multiple imputation’), and adjustment (e.g. unadjusted, adjusted for baseline covariates). It may be important to record the range of analysis options available, even if not all are extracted in detail. In some cases it may be preferable to use metrics that are comparable across multiple included studies, which may not be clear until data collection for all studies is complete.

CSRs are particularly useful for identifying outcomes assessed but not presented to the public. For efficacy outcomes and systematically collected adverse events, review authors can compare what is described in the CSRs with what is reported in published reports to assess the risk of bias due to missing outcome data ( Chapter 8, Section 8.5 ) and in selection of reported result ( Chapter 8, Section 8.7 ). Note that non-systematically collected adverse events are not amenable to such comparisons because these adverse events may not be known ahead of time and thus not pre-specified in the protocol.

5.5.7 Extracting data from regulatory reviews

Data most relevant to systematic reviews can be found in the medical and statistical review sections of a regulatory review. Both of these are substantially longer than journal articles (Turner 2013). A list of all trials on a drug usually can be found in the medical review. Because trials are referenced by a combination of numbers and letters, it may be difficult for the review authors to link the trial with other reports of the same trial (Section 5.2.1 ).

Many of the documents downloaded from the US Food and Drug Administration’s website for older drugs are scanned copies and are not searchable because of redaction of confidential information (Turner 2013). Optical character recognition software can convert most of the text. Reviews for newer drugs have been redacted electronically; documents remain searchable as a result.

Compared to CSRs, regulatory reviews contain less information about trial design, execution, and results. They provide limited information for assessing the risk of bias. In terms of extracting outcomes and results, review authors should follow the guidance provided for CSRs (Section 5.5.6 ).

5.5.8 Extracting data from figures with software

Sometimes numerical data needed for systematic reviews are only presented in figures. Review authors may request the data from the study investigators, or alternatively, extract the data from the figures either manually (e.g. with a ruler) or by using software. Numerous tools are available, many of which are free. Those available at the time of writing include tools called Plot Digitizer, WebPlotDigitizer, Engauge, Dexter, ycasd, GetData Graph Digitizer. The software works by taking an image of a figure and then digitizing the data points off the figure using the axes and scales set by the users. The numbers exported can be used for systematic reviews, although additional calculations may be needed to obtain the summary statistics, such as calculation of means and standard deviations from individual-level data points (or conversion of time-to-event data presented on Kaplan-Meier plots to hazard ratios; see Chapter 6, Section 6.8.2 ).

It has been demonstrated that software is more convenient and accurate than visual estimation or use of a ruler (Gross et al 2014, Jelicic Kadic et al 2016). Review authors should consider using software for extracting numerical data from figures when the data are not available elsewhere.

5.5.9 Automating data extraction in systematic reviews

Because data extraction is time-consuming and error-prone, automating or semi-automating this step may make the extraction process more efficient and accurate. The state of science relevant to automating data extraction is summarized here (Jonnalagadda et al 2015).

  • At least 26 studies have tested various natural language processing and machine learning approaches for facilitating data extraction for systematic reviews.

· Each tool focuses on only a limited number of data elements (ranges from one to seven). Most of the existing tools focus on the PICO information (e.g. number of participants, their age, sex, country, recruiting centres, intervention groups, outcomes, and time points). A few are able to extract study design and results (e.g. objectives, study duration, participant flow), and two extract risk of bias information (Marshall et al 2016, Millard et al 2016). To date, well over half of the data elements needed for systematic reviews have not been explored for automated extraction.

  • Most tools highlight the sentence(s) that may contain the data elements as opposed to directly recording these data elements into a data collection form or a data system.
  • There is no gold standard or common dataset to evaluate the performance of these tools, limiting our ability to interpret the significance of the reported accuracy measures.

At the time of writing, we cannot recommend a specific tool for automating data extraction for routine systematic review production. There is a need for review authors to work with experts in informatics to refine these tools and evaluate them rigorously. Such investigations should address how the tool will fit into existing workflows. For example, the automated or semi-automated data extraction approaches may first act as checks for manual data extraction before they can replace it.

5.5.10 Suspicions of scientific misconduct

Systematic review authors can uncover suspected misconduct in the published literature. Misconduct includes fabrication or falsification of data or results, plagiarism, and research that does not adhere to ethical norms. Review authors need to be aware of scientific misconduct because the inclusion of fraudulent material could undermine the reliability of a review’s findings. Plagiarism of results data in the form of duplicated publication (either by the same or by different authors) may, if undetected, lead to study participants being double counted in a synthesis.

It is preferable to identify potential problems before, rather than after, publication of the systematic review, so that readers are not misled. However, empirical evidence indicates that the extent to which systematic review authors explore misconduct varies widely (Elia et al 2016). Text-matching software and systems such as CrossCheck may be helpful for detecting plagiarism, but they can detect only matching text, so data tables or figures need to be inspected by hand or using other systems (e.g. to detect image manipulation). Lists of data such as in a meta-analysis can be a useful means of detecting duplicated studies. Furthermore, examination of baseline data can lead to suspicions of misconduct for an individual randomized trial (Carlisle et al 2015). For example, Al-Marzouki and colleagues concluded that a trial report was fabricated or falsified on the basis of highly unlikely baseline differences between two randomized groups (Al-Marzouki et al 2005).

Cochrane Review authors are advised to consult with Cochrane editors if cases of suspected misconduct are identified. Searching for comments, letters or retractions may uncover additional information. Sensitivity analyses can be used to determine whether the studies arousing suspicion are influential in the conclusions of the review. Guidance for editors for addressing suspected misconduct will be available from Cochrane’s Editorial Publishing and Policy Resource (see community.cochrane.org ). Further information is available from the Committee on Publication Ethics (COPE; publicationethics.org ), including a series of flowcharts on how to proceed if various types of misconduct are suspected. Cases should be followed up, typically including an approach to the editors of the journals in which suspect reports were published. It may be useful to write first to the primary investigators to request clarification of apparent inconsistencies or unusual observations.

Because investigations may take time, and institutions may not always be responsive (Wager 2011), articles suspected of being fraudulent should be classified as ‘awaiting assessment’. If a misconduct investigation indicates that the publication is unreliable, or if a publication is retracted, it should not be included in the systematic review, and the reason should be noted in the ‘excluded studies’ section.

5.5.11 Key points in planning and reporting data extraction

In summary, the methods section of both the protocol and the review should detail:

  • the data categories that are to be extracted;
  • how extracted data from each report will be verified (e.g. extraction by two review authors, independently);
  • whether data extraction is undertaken by content area experts, methodologists, or both;
  • pilot testing, training and existence of coding instructions for the data collection form;
  • how data are extracted from multiple reports from the same study; and
  • how disagreements are handled when more than one author extracts data from each report.

5.6 Extracting study results and converting to the desired format

In most cases, it is desirable to collect summary data separately for each intervention group of interest and to enter these into software in which effect estimates can be calculated, such as RevMan. Sometimes the required data may be obtained only indirectly, and the relevant results may not be obvious. Chapter 6 provides many useful tips and techniques to deal with common situations. When summary data cannot be obtained from each intervention group, or where it is important to use results of adjusted analyses (for example to account for correlations in crossover or cluster-randomized trials) effect estimates may be available directly.

5.7 Managing and sharing data

When data have been collected for each individual study, it is helpful to organize them into a comprehensive electronic format, such as a database or spreadsheet, before entering data into a meta-analysis or other synthesis. When data are collated electronically, all or a subset of them can easily be exported for cleaning, consistency checks and analysis.

Tabulation of collected information about studies can facilitate classification of studies into appropriate comparisons and subgroups. It also allows identification of comparable outcome measures and statistics across studies. It will often be necessary to perform calculations to obtain the required statistics for presentation or synthesis. It is important through this process to retain clear information on the provenance of the data, with a clear distinction between data from a source document and data obtained through calculations. Statistical conversions, for example from standard errors to standard deviations, ideally should be undertaken with a computer rather than using a hand calculator to maintain a permanent record of the original and calculated numbers as well as the actual calculations used.

Ideally, data only need to be extracted once and should be stored in a secure and stable location for future updates of the review, regardless of whether the original review authors or a different group of authors update the review (Ip et al 2012). Standardizing and sharing data collection tools as well as data management systems among review authors working in similar topic areas can streamline systematic review production. Review authors have the opportunity to work with trialists, journal editors, funders, regulators, and other stakeholders to make study data (e.g. CSRs, IPD, and any other form of study data) publicly available, increasing the transparency of research. When legal and ethical to do so, we encourage review authors to share the data used in their systematic reviews to reduce waste and to allow verification and reanalysis because data will not have to be extracted again for future use (Mayo-Wilson et al 2018).

5.8 Chapter information

Editors: Tianjing Li, Julian PT Higgins, Jonathan J Deeks

Acknowledgements: This chapter builds on earlier versions of the Handbook . For details of previous authors and editors of the Handbook , see Preface. Andrew Herxheimer, Nicki Jackson, Yoon Loke, Deirdre Price and Helen Thomas contributed text. Stephanie Taylor and Sonja Hood contributed suggestions for designing data collection forms. We are grateful to Judith Anzures, Mike Clarke, Miranda Cumpston and Peter Gøtzsche for helpful comments.

Funding: JPTH is a member of the National Institute for Health Research (NIHR) Biomedical Research Centre at University Hospitals Bristol NHS Foundation Trust and the University of Bristol. JJD received support from the NIHR Birmingham Biomedical Research Centre at the University Hospitals Birmingham NHS Foundation Trust and the University of Birmingham. JPTH received funding from National Institute for Health Research Senior Investigator award NF-SI-0617-10145. The views expressed are those of the author(s) and not necessarily those of the NHS, the NIHR or the Department of Health.

5.9 References

Al-Marzouki S, Evans S, Marshall T, Roberts I. Are these data real? Statistical methods for the detection of data fabrication in clinical trials. BMJ 2005; 331 : 267-270.

Allen EN, Mushi AK, Massawe IS, Vestergaard LS, Lemnge M, Staedke SG, Mehta U, Barnes KI, Chandler CI. How experiences become data: the process of eliciting adverse event, medical history and concomitant medication reports in antimalarial and antiretroviral interaction trials. BMC Medical Research Methodology 2013; 13 : 140.

Baudard M, Yavchitz A, Ravaud P, Perrodeau E, Boutron I. Impact of searching clinical trial registries in systematic reviews of pharmaceutical treatments: methodological systematic review and reanalysis of meta-analyses. BMJ 2017; 356 : j448.

Bent S, Padula A, Avins AL. Better ways to question patients about adverse medical events: a randomized, controlled trial. Annals of Internal Medicine 2006; 144 : 257-261.

Berlin JA. Does blinding of readers affect the results of meta-analyses? University of Pennsylvania Meta-analysis Blinding Study Group. Lancet 1997; 350 : 185-186.

Buscemi N, Hartling L, Vandermeer B, Tjosvold L, Klassen TP. Single data extraction generated more errors than double data extraction in systematic reviews. Journal of Clinical Epidemiology 2006; 59 : 697-703.

Carlisle JB, Dexter F, Pandit JJ, Shafer SL, Yentis SM. Calculating the probability of random sampling for continuous variables in submitted or published randomised controlled trials. Anaesthesia 2015; 70 : 848-858.

Carroll C, Patterson M, Wood S, Booth A, Rick J, Balain S. A conceptual framework for implementation fidelity. Implementation Science 2007; 2 : 40.

Carvajal A, Ortega PG, Sainz M, Velasco V, Salado I, Arias LHM, Eiros JM, Rubio AP, Castrodeza J. Adverse events associated with pandemic influenza vaccines: Comparison of the results of a follow-up study with those coming from spontaneous reporting. Vaccine 2011; 29 : 519-522.

Chamberlain C, O'Mara-Eves A, Porter J, Coleman T, Perlen SM, Thomas J, McKenzie JE. Psychosocial interventions for supporting women to stop smoking in pregnancy. Cochrane Database of Systematic Reviews 2017; 2 : CD001055.

Damschroder LJ, Aron DC, Keith RE, Kirsh SR, Alexander JA, Lowery JC. Fostering implementation of health services research findings into practice: a consolidated framework for advancing implementation science. Implementation Science 2009; 4 : 50.

Davis AL, Miller JD. The European Medicines Agency and publication of clinical study reports: a challenge for the US FDA. JAMA 2017; 317 : 905-906.

Denniston AK, Holland GN, Kidess A, Nussenblatt RB, Okada AA, Rosenbaum JT, Dick AD. Heterogeneity of primary outcome measures used in clinical trials of treatments for intermediate, posterior, and panuveitis. Orphanet Journal of Rare Diseases 2015; 10 : 97.

Derry S, Loke YK. Risk of gastrointestinal haemorrhage with long term use of aspirin: meta-analysis. BMJ 2000; 321 : 1183-1187.

Doshi P, Dickersin K, Healy D, Vedula SS, Jefferson T. Restoring invisible and abandoned trials: a call for people to publish the findings. BMJ 2013; 346 : f2865.

Dusenbury L, Brannigan R, Falco M, Hansen WB. A review of research on fidelity of implementation: implications for drug abuse prevention in school settings. Health Education Research 2003; 18 : 237-256.

Dwan K, Altman DG, Clarke M, Gamble C, Higgins JPT, Sterne JAC, Williamson PR, Kirkham JJ. Evidence for the selective reporting of analyses and discrepancies in clinical trials: a systematic review of cohort studies of clinical trials. PLoS Medicine 2014; 11 : e1001666.

Elia N, von Elm E, Chatagner A, Popping DM, Tramèr MR. How do authors of systematic reviews deal with research malpractice and misconduct in original studies? A cross-sectional analysis of systematic reviews and survey of their authors. BMJ Open 2016; 6 : e010442.

Gøtzsche PC. Multiple publication of reports of drug trials. European Journal of Clinical Pharmacology 1989; 36 : 429-432.

Gøtzsche PC, Hróbjartsson A, Maric K, Tendal B. Data extraction errors in meta-analyses that use standardized mean differences. JAMA 2007; 298 : 430-437.

Gross A, Schirm S, Scholz M. Ycasd - a tool for capturing and scaling data from graphical representations. BMC Bioinformatics 2014; 15 : 219.

Hoffmann TC, Glasziou PP, Boutron I, Milne R, Perera R, Moher D, Altman DG, Barbour V, Macdonald H, Johnston M, Lamb SE, Dixon-Woods M, McCulloch P, Wyatt JC, Chan AW, Michie S. Better reporting of interventions: template for intervention description and replication (TIDieR) checklist and guide. BMJ 2014; 348 : g1687.

ICH. ICH Harmonised tripartite guideline: Struture and content of clinical study reports E31995. ICH1995. www.ich.org/fileadmin/Public_Web_Site/ICH_Products/Guidelines/Efficacy/E3/E3_Guideline.pdf .

Ioannidis JPA, Mulrow CD, Goodman SN. Adverse events: The more you search, the more you find. Annals of Internal Medicine 2006; 144 : 298-300.

Ip S, Hadar N, Keefe S, Parkin C, Iovin R, Balk EM, Lau J. A web-based archive of systematic review data. Systematic Reviews 2012; 1 : 15.

Ismail R, Azuara-Blanco A, Ramsay CR. Variation of clinical outcomes used in glaucoma randomised controlled trials: a systematic review. British Journal of Ophthalmology 2014; 98 : 464-468.

Jadad AR, Moore RA, Carroll D, Jenkinson C, Reynolds DJM, Gavaghan DJ, McQuay H. Assessing the quality of reports of randomized clinical trials: Is blinding necessary? Controlled Clinical Trials 1996; 17 : 1-12.

Jelicic Kadic A, Vucic K, Dosenovic S, Sapunar D, Puljak L. Extracting data from figures with software was faster, with higher interrater reliability than manual extraction. Journal of Clinical Epidemiology 2016; 74 : 119-123.

Jones AP, Remmington T, Williamson PR, Ashby D, Smyth RL. High prevalence but low impact of data extraction and reporting errors were found in Cochrane systematic reviews. Journal of Clinical Epidemiology 2005; 58 : 741-742.

Jones CW, Keil LG, Holland WC, Caughey MC, Platts-Mills TF. Comparison of registered and published outcomes in randomized controlled trials: a systematic review. BMC Medicine 2015; 13 : 282.

Jonnalagadda SR, Goyal P, Huffman MD. Automating data extraction in systematic reviews: a systematic review. Systematic Reviews 2015; 4 : 78.

Lewin S, Hendry M, Chandler J, Oxman AD, Michie S, Shepperd S, Reeves BC, Tugwell P, Hannes K, Rehfuess EA, Welch V, McKenzie JE, Burford B, Petkovic J, Anderson LM, Harris J, Noyes J. Assessing the complexity of interventions within systematic reviews: development, content and use of a new tool (iCAT_SR). BMC Medical Research Methodology 2017; 17 : 76.

Li G, Abbade LPF, Nwosu I, Jin Y, Leenus A, Maaz M, Wang M, Bhatt M, Zielinski L, Sanger N, Bantoto B, Luo C, Shams I, Shahid H, Chang Y, Sun G, Mbuagbaw L, Samaan Z, Levine MAH, Adachi JD, Thabane L. A scoping review of comparisons between abstracts and full reports in primary biomedical research. BMC Medical Research Methodology 2017; 17 : 181.

Li TJ, Vedula SS, Hadar N, Parkin C, Lau J, Dickersin K. Innovations in data collection, management, and archiving for systematic reviews. Annals of Internal Medicine 2015; 162 : 287-294.

Liberati A, Altman DG, Tetzlaff J, Mulrow C, Gøtzsche PC, Ioannidis JPA, Clarke M, Devereaux PJ, Kleijnen J, Moher D. The PRISMA statement for reporting systematic reviews and meta-analyses of studies that evaluate health care interventions: explanation and elaboration. PLoS Medicine 2009; 6 : e1000100.

Liu ZM, Saldanha IJ, Margolis D, Dumville JC, Cullum NA. Outcomes in Cochrane systematic reviews related to wound care: an investigation into prespecification. Wound Repair and Regeneration 2017; 25 : 292-308.

Marshall IJ, Kuiper J, Wallace BC. RobotReviewer: evaluation of a system for automatically assessing bias in clinical trials. Journal of the American Medical Informatics Association 2016; 23 : 193-201.

Mayo-Wilson E, Doshi P, Dickersin K. Are manufacturers sharing data as promised? BMJ 2015; 351 : h4169.

Mayo-Wilson E, Li TJ, Fusco N, Bertizzolo L, Canner JK, Cowley T, Doshi P, Ehmsen J, Gresham G, Guo N, Haythomthwaite JA, Heyward J, Hong H, Pham D, Payne JL, Rosman L, Stuart EA, Suarez-Cuervo C, Tolbert E, Twose C, Vedula S, Dickersin K. Cherry-picking by trialists and meta-analysts can drive conclusions about intervention efficacy. Journal of Clinical Epidemiology 2017a; 91 : 95-110.

Mayo-Wilson E, Fusco N, Li TJ, Hong H, Canner JK, Dickersin K, MUDS Investigators. Multiple outcomes and analyses in clinical trials create challenges for interpretation and research synthesis. Journal of Clinical Epidemiology 2017b; 86 : 39-50.

Mayo-Wilson E, Li T, Fusco N, Dickersin K. Practical guidance for using multiple data sources in systematic reviews and meta-analyses (with examples from the MUDS study). Research Synthesis Methods 2018; 9 : 2-12.

Meade MO, Richardson WS. Selecting and appraising studies for a systematic review. Annals of Internal Medicine 1997; 127 : 531-537.

Meinert CL. Clinical trials dictionary: Terminology and usage recommendations . Hoboken (NJ): Wiley; 2012.

Millard LAC, Flach PA, Higgins JPT. Machine learning to assist risk-of-bias assessments in systematic reviews. International Journal of Epidemiology 2016; 45 : 266-277.

Moher D, Schulz KF, Altman DG. The CONSORT Statement: revised recommendations for improving the quality of reports of parallel-group randomised trials. Lancet 2001; 357 : 1191-1194.

Moher D, Hopewell S, Schulz KF, Montori V, Gøtzsche PC, Devereaux PJ, Elbourne D, Egger M, Altman DG. CONSORT 2010 explanation and elaboration: updated guidelines for reporting parallel group randomised trials. BMJ 2010; 340 : c869.

Moore GF, Audrey S, Barker M, Bond L, Bonell C, Hardeman W, Moore L, O'Cathain A, Tinati T, Wight D, Baird J. Process evaluation of complex interventions: Medical Research Council guidance. BMJ 2015; 350 : h1258.

Orwin RG. Evaluating coding decisions. In: Cooper H, Hedges LV, editors. The Handbook of Research Synthesis . New York (NY): Russell Sage Foundation; 1994. p. 139-162.

Page MJ, McKenzie JE, Kirkham J, Dwan K, Kramer S, Green S, Forbes A. Bias due to selective inclusion and reporting of outcomes and analyses in systematic reviews of randomised trials of healthcare interventions. Cochrane Database of Systematic Reviews 2014; 10 : MR000035.

Ross JS, Mulvey GK, Hines EM, Nissen SE, Krumholz HM. Trial publication after registration in ClinicalTrials.Gov: a cross-sectional analysis. PLoS Medicine 2009; 6 .

Safer DJ. Design and reporting modifications in industry-sponsored comparative psychopharmacology trials. Journal of Nervous and Mental Disease 2002; 190 : 583-592.

Saldanha IJ, Dickersin K, Wang X, Li TJ. Outcomes in Cochrane systematic reviews addressing four common eye conditions: an evaluation of completeness and comparability. PloS One 2014; 9 : e109400.

Saldanha IJ, Li T, Yang C, Ugarte-Gil C, Rutherford GW, Dickersin K. Social network analysis identified central outcomes for core outcome sets using systematic reviews of HIV/AIDS. Journal of Clinical Epidemiology 2016; 70 : 164-175.

Saldanha IJ, Lindsley K, Do DV, Chuck RS, Meyerle C, Jones LS, Coleman AL, Jampel HD, Dickersin K, Virgili G. Comparison of clinical trial and systematic review outcomes for the 4 most prevalent eye diseases. JAMA Ophthalmology 2017a; 135 : 933-940.

Saldanha IJ, Li TJ, Yang C, Owczarzak J, Williamson PR, Dickersin K. Clinical trials and systematic reviews addressing similar interventions for the same condition do not consider similar outcomes to be important: a case study in HIV/AIDS. Journal of Clinical Epidemiology 2017b; 84 : 85-94.

Stewart LA, Clarke M, Rovers M, Riley RD, Simmonds M, Stewart G, Tierney JF, PRISMA-IPD Development Group. Preferred reporting items for a systematic review and meta-analysis of individual participant data: the PRISMA-IPD statement. JAMA 2015; 313 : 1657-1665.

Stock WA. Systematic coding for research synthesis. In: Cooper H, Hedges LV, editors. The Handbook of Research Synthesis . New York (NY): Russell Sage Foundation; 1994. p. 125-138.

Tramèr MR, Reynolds DJ, Moore RA, McQuay HJ. Impact of covert duplicate publication on meta-analysis: a case study. BMJ 1997; 315 : 635-640.

Turner EH. How to access and process FDA drug approval packages for use in research. BMJ 2013; 347 .

von Elm E, Poglia G, Walder B, Tramèr MR. Different patterns of duplicate publication: an analysis of articles used in systematic reviews. JAMA 2004; 291 : 974-980.

Wager E. Coping with scientific misconduct. BMJ 2011; 343 : d6586.

Wieland LS, Rutkow L, Vedula SS, Kaufmann CN, Rosman LM, Twose C, Mahendraratnam N, Dickersin K. Who has used internal company documents for biomedical and public health research and where did they find them? PloS One 2014; 9 .

Zanchetti A, Hansson L. Risk of major gastrointestinal bleeding with aspirin (Authors' reply). Lancet 1999; 353 : 149-150.

Zarin DA, Tse T, Williams RJ, Califf RM, Ide NC. The ClinicalTrials.gov results database: update and key issues. New England Journal of Medicine 2011; 364 : 852-860.

Zwarenstein M, Treweek S, Gagnier JJ, Altman DG, Tunis S, Haynes B, Oxman AD, Moher D. Improving the reporting of pragmatic trials: an extension of the CONSORT statement. BMJ 2008; 337 : a2390.

For permission to re-use material from the Handbook (either academic or commercial), please see here for full details.

Research Methods

  • Getting Started
  • Literature Review Research
  • Research Design
  • Research Design By Discipline
  • SAGE Research Methods
  • Teaching with SAGE Research Methods

Literature Review

  • What is a Literature Review?
  • What is NOT a Literature Review?
  • Purposes of a Literature Review
  • Types of Literature Reviews
  • Literature Reviews vs. Systematic Reviews
  • Systematic vs. Meta-Analysis

Literature Review  is a comprehensive survey of the works published in a particular field of study or line of research, usually over a specific period of time, in the form of an in-depth, critical bibliographic essay or annotated list in which attention is drawn to the most significant works.

Also, we can define a literature review as the collected body of scholarly works related to a topic:

  • Summarizes and analyzes previous research relevant to a topic
  • Includes scholarly books and articles published in academic journals
  • Can be an specific scholarly paper or a section in a research paper

The objective of a Literature Review is to find previous published scholarly works relevant to an specific topic

  • Help gather ideas or information
  • Keep up to date in current trends and findings
  • Help develop new questions

A literature review is important because it:

  • Explains the background of research on a topic.
  • Demonstrates why a topic is significant to a subject area.
  • Helps focus your own research questions or problems
  • Discovers relationships between research studies/ideas.
  • Suggests unexplored ideas or populations
  • Identifies major themes, concepts, and researchers on a topic.
  • Tests assumptions; may help counter preconceived ideas and remove unconscious bias.
  • Identifies critical gaps, points of disagreement, or potentially flawed methodology or theoretical approaches.
  • Indicates potential directions for future research.

All content in this section is from Literature Review Research from Old Dominion University 

Keep in mind the following, a literature review is NOT:

Not an essay 

Not an annotated bibliography  in which you summarize each article that you have reviewed.  A literature review goes beyond basic summarizing to focus on the critical analysis of the reviewed works and their relationship to your research question.

Not a research paper   where you select resources to support one side of an issue versus another.  A lit review should explain and consider all sides of an argument in order to avoid bias, and areas of agreement and disagreement should be highlighted.

A literature review serves several purposes. For example, it

  • provides thorough knowledge of previous studies; introduces seminal works.
  • helps focus one’s own research topic.
  • identifies a conceptual framework for one’s own research questions or problems; indicates potential directions for future research.
  • suggests previously unused or underused methodologies, designs, quantitative and qualitative strategies.
  • identifies gaps in previous studies; identifies flawed methodologies and/or theoretical approaches; avoids replication of mistakes.
  • helps the researcher avoid repetition of earlier research.
  • suggests unexplored populations.
  • determines whether past studies agree or disagree; identifies controversy in the literature.
  • tests assumptions; may help counter preconceived ideas and remove unconscious bias.

As Kennedy (2007) notes*, it is important to think of knowledge in a given field as consisting of three layers. First, there are the primary studies that researchers conduct and publish. Second are the reviews of those studies that summarize and offer new interpretations built from and often extending beyond the original studies. Third, there are the perceptions, conclusions, opinion, and interpretations that are shared informally that become part of the lore of field. In composing a literature review, it is important to note that it is often this third layer of knowledge that is cited as "true" even though it often has only a loose relationship to the primary studies and secondary literature reviews.

Given this, while literature reviews are designed to provide an overview and synthesis of pertinent sources you have explored, there are several approaches to how they can be done, depending upon the type of analysis underpinning your study. Listed below are definitions of types of literature reviews:

Argumentative Review      This form examines literature selectively in order to support or refute an argument, deeply imbedded assumption, or philosophical problem already established in the literature. The purpose is to develop a body of literature that establishes a contrarian viewpoint. Given the value-laden nature of some social science research [e.g., educational reform; immigration control], argumentative approaches to analyzing the literature can be a legitimate and important form of discourse. However, note that they can also introduce problems of bias when they are used to to make summary claims of the sort found in systematic reviews.

Integrative Review      Considered a form of research that reviews, critiques, and synthesizes representative literature on a topic in an integrated way such that new frameworks and perspectives on the topic are generated. The body of literature includes all studies that address related or identical hypotheses. A well-done integrative review meets the same standards as primary research in regard to clarity, rigor, and replication.

Historical Review      Few things rest in isolation from historical precedent. Historical reviews are focused on examining research throughout a period of time, often starting with the first time an issue, concept, theory, phenomena emerged in the literature, then tracing its evolution within the scholarship of a discipline. The purpose is to place research in a historical context to show familiarity with state-of-the-art developments and to identify the likely directions for future research.

Methodological Review      A review does not always focus on what someone said [content], but how they said it [method of analysis]. This approach provides a framework of understanding at different levels (i.e. those of theory, substantive fields, research approaches and data collection and analysis techniques), enables researchers to draw on a wide variety of knowledge ranging from the conceptual level to practical documents for use in fieldwork in the areas of ontological and epistemological consideration, quantitative and qualitative integration, sampling, interviewing, data collection and data analysis, and helps highlight many ethical issues which we should be aware of and consider as we go through our study.

Systematic Review      This form consists of an overview of existing evidence pertinent to a clearly formulated research question, which uses pre-specified and standardized methods to identify and critically appraise relevant research, and to collect, report, and analyse data from the studies that are included in the review. Typically it focuses on a very specific empirical question, often posed in a cause-and-effect form, such as "To what extent does A contribute to B?"

Theoretical Review      The purpose of this form is to concretely examine the corpus of theory that has accumulated in regard to an issue, concept, theory, phenomena. The theoretical literature review help establish what theories already exist, the relationships between them, to what degree the existing theories have been investigated, and to develop new hypotheses to be tested. Often this form is used to help establish a lack of appropriate theories or reveal that current theories are inadequate for explaining new or emerging research problems. The unit of analysis can focus on a theoretical concept or a whole theory or framework.

* Kennedy, Mary M. "Defining a Literature."  Educational Researcher  36 (April 2007): 139-147.

All content in this section is from The Literature Review created by Dr. Robert Larabee USC

Robinson, P. and Lowe, J. (2015),  Literature reviews vs systematic reviews.  Australian and New Zealand Journal of Public Health, 39: 103-103. doi: 10.1111/1753-6405.12393

literature review data collection method

What's in the name? The difference between a Systematic Review and a Literature Review, and why it matters . By Lynn Kysh from University of Southern California

literature review data collection method

Systematic review or meta-analysis?

A  systematic review  answers a defined research question by collecting and summarizing all empirical evidence that fits pre-specified eligibility criteria.

A  meta-analysis  is the use of statistical methods to summarize the results of these studies.

Systematic reviews, just like other research articles, can be of varying quality. They are a significant piece of work (the Centre for Reviews and Dissemination at York estimates that a team will take 9-24 months), and to be useful to other researchers and practitioners they should have:

  • clearly stated objectives with pre-defined eligibility criteria for studies
  • explicit, reproducible methodology
  • a systematic search that attempts to identify all studies
  • assessment of the validity of the findings of the included studies (e.g. risk of bias)
  • systematic presentation, and synthesis, of the characteristics and findings of the included studies

Not all systematic reviews contain meta-analysis. 

Meta-analysis is the use of statistical methods to summarize the results of independent studies. By combining information from all relevant studies, meta-analysis can provide more precise estimates of the effects of health care than those derived from the individual studies included within a review.  More information on meta-analyses can be found in  Cochrane Handbook, Chapter 9 .

A meta-analysis goes beyond critique and integration and conducts secondary statistical analysis on the outcomes of similar studies.  It is a systematic review that uses quantitative methods to synthesize and summarize the results.

An advantage of a meta-analysis is the ability to be completely objective in evaluating research findings.  Not all topics, however, have sufficient research evidence to allow a meta-analysis to be conducted.  In that case, an integrative review is an appropriate strategy. 

Some of the content in this section is from Systematic reviews and meta-analyses: step by step guide created by Kate McAllister.

  • << Previous: Getting Started
  • Next: Research Design >>
  • Last Updated: Aug 21, 2023 4:07 PM
  • URL: https://guides.lib.udel.edu/researchmethods

Have a language expert improve your writing

Run a free plagiarism check in 10 minutes, automatically generate references for free.

  • Knowledge Base
  • Methodology
  • Data Collection Methods | Step-by-Step Guide & Examples

Data Collection Methods | Step-by-Step Guide & Examples

Published on 4 May 2022 by Pritha Bhandari .

Data collection is a systematic process of gathering observations or measurements. Whether you are performing research for business, governmental, or academic purposes, data collection allows you to gain first-hand knowledge and original insights into your research problem .

While methods and aims may differ between fields, the overall process of data collection remains largely the same. Before you begin collecting data, you need to consider:

  • The  aim of the research
  • The type of data that you will collect
  • The methods and procedures you will use to collect, store, and process the data

To collect high-quality data that is relevant to your purposes, follow these four steps.

Table of contents

Step 1: define the aim of your research, step 2: choose your data collection method, step 3: plan your data collection procedures, step 4: collect the data, frequently asked questions about data collection.

Before you start the process of data collection, you need to identify exactly what you want to achieve. You can start by writing a problem statement : what is the practical or scientific issue that you want to address, and why does it matter?

Next, formulate one or more research questions that precisely define what you want to find out. Depending on your research questions, you might need to collect quantitative or qualitative data :

  • Quantitative data is expressed in numbers and graphs and is analysed through statistical methods .
  • Qualitative data is expressed in words and analysed through interpretations and categorisations.

If your aim is to test a hypothesis , measure something precisely, or gain large-scale statistical insights, collect quantitative data. If your aim is to explore ideas, understand experiences, or gain detailed insights into a specific context, collect qualitative data.

If you have several aims, you can use a mixed methods approach that collects both types of data.

  • Your first aim is to assess whether there are significant differences in perceptions of managers across different departments and office locations.
  • Your second aim is to gather meaningful feedback from employees to explore new ideas for how managers can improve.

Prevent plagiarism, run a free check.

Based on the data you want to collect, decide which method is best suited for your research.

  • Experimental research is primarily a quantitative method.
  • Interviews , focus groups , and ethnographies are qualitative methods.
  • Surveys , observations, archival research, and secondary data collection can be quantitative or qualitative methods.

Carefully consider what method you will use to gather data that helps you directly answer your research questions.

When you know which method(s) you are using, you need to plan exactly how you will implement them. What procedures will you follow to make accurate observations or measurements of the variables you are interested in?

For instance, if you’re conducting surveys or interviews, decide what form the questions will take; if you’re conducting an experiment, make decisions about your experimental design .

Operationalisation

Sometimes your variables can be measured directly: for example, you can collect data on the average age of employees simply by asking for dates of birth. However, often you’ll be interested in collecting data on more abstract concepts or variables that can’t be directly observed.

Operationalisation means turning abstract conceptual ideas into measurable observations. When planning how you will collect data, you need to translate the conceptual definition of what you want to study into the operational definition of what you will actually measure.

  • You ask managers to rate their own leadership skills on 5-point scales assessing the ability to delegate, decisiveness, and dependability.
  • You ask their direct employees to provide anonymous feedback on the managers regarding the same topics.

You may need to develop a sampling plan to obtain data systematically. This involves defining a population , the group you want to draw conclusions about, and a sample, the group you will actually collect data from.

Your sampling method will determine how you recruit participants or obtain measurements for your study. To decide on a sampling method you will need to consider factors like the required sample size, accessibility of the sample, and time frame of the data collection.

Standardising procedures

If multiple researchers are involved, write a detailed manual to standardise data collection procedures in your study.

This means laying out specific step-by-step instructions so that everyone in your research team collects data in a consistent way – for example, by conducting experiments under the same conditions and using objective criteria to record and categorise observations.

This helps ensure the reliability of your data, and you can also use it to replicate the study in the future.

Creating a data management plan

Before beginning data collection, you should also decide how you will organise and store your data.

  • If you are collecting data from people, you will likely need to anonymise and safeguard the data to prevent leaks of sensitive information (e.g. names or identity numbers).
  • If you are collecting data via interviews or pencil-and-paper formats, you will need to perform transcriptions or data entry in systematic ways to minimise distortion.
  • You can prevent loss of data by having an organisation system that is routinely backed up.

Finally, you can implement your chosen methods to measure or observe the variables you are interested in.

The closed-ended questions ask participants to rate their manager’s leadership skills on scales from 1 to 5. The data produced is numerical and can be statistically analysed for averages and patterns.

To ensure that high-quality data is recorded in a systematic way, here are some best practices:

  • Record all relevant information as and when you obtain data. For example, note down whether or how lab equipment is recalibrated during an experimental study.
  • Double-check manual data entry for errors.
  • If you collect quantitative data, you can assess the reliability and validity to get an indication of your data quality.

Data collection is the systematic process by which observations or measurements are gathered in research. It is used in many different contexts by academics, governments, businesses, and other organisations.

When conducting research, collecting original data has significant advantages:

  • You can tailor data collection to your specific research aims (e.g., understanding the needs of your consumers or user testing your website).
  • You can control and standardise the process for high reliability and validity (e.g., choosing appropriate measurements and sampling methods ).

However, there are also some drawbacks: data collection can be time-consuming, labour-intensive, and expensive. In some cases, it’s more efficient to use secondary data that has already been collected by someone else, but the data might be less reliable.

Quantitative research deals with numbers and statistics, while qualitative research deals with words and meanings.

Quantitative methods allow you to test a hypothesis by systematically collecting and analysing data, while qualitative methods allow you to explore ideas and experiences in depth.

Reliability and validity are both about how well a method measures something:

  • Reliability refers to the  consistency of a measure (whether the results can be reproduced under the same conditions).
  • Validity   refers to the  accuracy of a measure (whether the results really do represent what they are supposed to measure).

If you are doing experimental research , you also have to consider the internal and external validity of your experiment.

In mixed methods research , you use both qualitative and quantitative data collection and analysis methods to answer your research question .

Operationalisation means turning abstract conceptual ideas into measurable observations.

For example, the concept of social anxiety isn’t directly observable, but it can be operationally defined in terms of self-rating scores, behavioural avoidance of crowded places, or physical anxiety symptoms in social situations.

Before collecting data , it’s important to consider how you will operationalise the variables that you want to measure.

Cite this Scribbr article

If you want to cite this source, you can copy and paste the citation or click the ‘Cite this Scribbr article’ button to automatically add the citation to our free Reference Generator.

Bhandari, P. (2022, May 04). Data Collection Methods | Step-by-Step Guide & Examples. Scribbr. Retrieved 22 February 2024, from https://www.scribbr.co.uk/research-methods/data-collection-guide/

Is this article helpful?

Pritha Bhandari

Pritha Bhandari

Other students also liked, qualitative vs quantitative research | examples & methods, triangulation in research | guide, types, examples, what is a conceptual framework | tips & examples.

Have a language expert improve your writing

Run a free plagiarism check in 10 minutes, generate accurate citations for free.

  • Knowledge Base

Methodology

  • How to Write a Literature Review | Guide, Examples, & Templates

How to Write a Literature Review | Guide, Examples, & Templates

Published on January 2, 2023 by Shona McCombes . Revised on September 11, 2023.

What is a literature review? A literature review is a survey of scholarly sources on a specific topic. It provides an overview of current knowledge, allowing you to identify relevant theories, methods, and gaps in the existing research that you can later apply to your paper, thesis, or dissertation topic .

There are five key steps to writing a literature review:

  • Search for relevant literature
  • Evaluate sources
  • Identify themes, debates, and gaps
  • Outline the structure
  • Write your literature review

A good literature review doesn’t just summarize sources—it analyzes, synthesizes , and critically evaluates to give a clear picture of the state of knowledge on the subject.

Instantly correct all language mistakes in your text

Upload your document to correct all your mistakes in minutes

upload-your-document-ai-proofreader

Table of contents

What is the purpose of a literature review, examples of literature reviews, step 1 – search for relevant literature, step 2 – evaluate and select sources, step 3 – identify themes, debates, and gaps, step 4 – outline your literature review’s structure, step 5 – write your literature review, free lecture slides, other interesting articles, frequently asked questions, introduction.

  • Quick Run-through
  • Step 1 & 2

When you write a thesis , dissertation , or research paper , you will likely have to conduct a literature review to situate your research within existing knowledge. The literature review gives you a chance to:

  • Demonstrate your familiarity with the topic and its scholarly context
  • Develop a theoretical framework and methodology for your research
  • Position your work in relation to other researchers and theorists
  • Show how your research addresses a gap or contributes to a debate
  • Evaluate the current state of research and demonstrate your knowledge of the scholarly debates around your topic.

Writing literature reviews is a particularly important skill if you want to apply for graduate school or pursue a career in research. We’ve written a step-by-step guide that you can follow below.

Literature review guide

The only proofreading tool specialized in correcting academic writing - try for free!

The academic proofreading tool has been trained on 1000s of academic texts and by native English editors. Making it the most accurate and reliable proofreading tool for students.

literature review data collection method

Try for free

Writing literature reviews can be quite challenging! A good starting point could be to look at some examples, depending on what kind of literature review you’d like to write.

  • Example literature review #1: “Why Do People Migrate? A Review of the Theoretical Literature” ( Theoretical literature review about the development of economic migration theory from the 1950s to today.)
  • Example literature review #2: “Literature review as a research methodology: An overview and guidelines” ( Methodological literature review about interdisciplinary knowledge acquisition and production.)
  • Example literature review #3: “The Use of Technology in English Language Learning: A Literature Review” ( Thematic literature review about the effects of technology on language acquisition.)
  • Example literature review #4: “Learners’ Listening Comprehension Difficulties in English Language Learning: A Literature Review” ( Chronological literature review about how the concept of listening skills has changed over time.)

You can also check out our templates with literature review examples and sample outlines at the links below.

Download Word doc Download Google doc

Before you begin searching for literature, you need a clearly defined topic .

If you are writing the literature review section of a dissertation or research paper, you will search for literature related to your research problem and questions .

Make a list of keywords

Start by creating a list of keywords related to your research question. Include each of the key concepts or variables you’re interested in, and list any synonyms and related terms. You can add to this list as you discover new keywords in the process of your literature search.

  • Social media, Facebook, Instagram, Twitter, Snapchat, TikTok
  • Body image, self-perception, self-esteem, mental health
  • Generation Z, teenagers, adolescents, youth

Search for relevant sources

Use your keywords to begin searching for sources. Some useful databases to search for journals and articles include:

  • Your university’s library catalogue
  • Google Scholar
  • Project Muse (humanities and social sciences)
  • Medline (life sciences and biomedicine)
  • EconLit (economics)
  • Inspec (physics, engineering and computer science)

You can also use boolean operators to help narrow down your search.

Make sure to read the abstract to find out whether an article is relevant to your question. When you find a useful book or article, you can check the bibliography to find other relevant sources.

You likely won’t be able to read absolutely everything that has been written on your topic, so it will be necessary to evaluate which sources are most relevant to your research question.

For each publication, ask yourself:

  • What question or problem is the author addressing?
  • What are the key concepts and how are they defined?
  • What are the key theories, models, and methods?
  • Does the research use established frameworks or take an innovative approach?
  • What are the results and conclusions of the study?
  • How does the publication relate to other literature in the field? Does it confirm, add to, or challenge established knowledge?
  • What are the strengths and weaknesses of the research?

Make sure the sources you use are credible , and make sure you read any landmark studies and major theories in your field of research.

You can use our template to summarize and evaluate sources you’re thinking about using. Click on either button below to download.

Take notes and cite your sources

As you read, you should also begin the writing process. Take notes that you can later incorporate into the text of your literature review.

It is important to keep track of your sources with citations to avoid plagiarism . It can be helpful to make an annotated bibliography , where you compile full citation information and write a paragraph of summary and analysis for each source. This helps you remember what you read and saves time later in the process.

Prevent plagiarism. Run a free check.

To begin organizing your literature review’s argument and structure, be sure you understand the connections and relationships between the sources you’ve read. Based on your reading and notes, you can look for:

  • Trends and patterns (in theory, method or results): do certain approaches become more or less popular over time?
  • Themes: what questions or concepts recur across the literature?
  • Debates, conflicts and contradictions: where do sources disagree?
  • Pivotal publications: are there any influential theories or studies that changed the direction of the field?
  • Gaps: what is missing from the literature? Are there weaknesses that need to be addressed?

This step will help you work out the structure of your literature review and (if applicable) show how your own research will contribute to existing knowledge.

  • Most research has focused on young women.
  • There is an increasing interest in the visual aspects of social media.
  • But there is still a lack of robust research on highly visual platforms like Instagram and Snapchat—this is a gap that you could address in your own research.

There are various approaches to organizing the body of a literature review. Depending on the length of your literature review, you can combine several of these strategies (for example, your overall structure might be thematic, but each theme is discussed chronologically).

Chronological

The simplest approach is to trace the development of the topic over time. However, if you choose this strategy, be careful to avoid simply listing and summarizing sources in order.

Try to analyze patterns, turning points and key debates that have shaped the direction of the field. Give your interpretation of how and why certain developments occurred.

If you have found some recurring central themes, you can organize your literature review into subsections that address different aspects of the topic.

For example, if you are reviewing literature about inequalities in migrant health outcomes, key themes might include healthcare policy, language barriers, cultural attitudes, legal status, and economic access.

Methodological

If you draw your sources from different disciplines or fields that use a variety of research methods , you might want to compare the results and conclusions that emerge from different approaches. For example:

  • Look at what results have emerged in qualitative versus quantitative research
  • Discuss how the topic has been approached by empirical versus theoretical scholarship
  • Divide the literature into sociological, historical, and cultural sources

Theoretical

A literature review is often the foundation for a theoretical framework . You can use it to discuss various theories, models, and definitions of key concepts.

You might argue for the relevance of a specific theoretical approach, or combine various theoretical concepts to create a framework for your research.

Like any other academic text , your literature review should have an introduction , a main body, and a conclusion . What you include in each depends on the objective of your literature review.

The introduction should clearly establish the focus and purpose of the literature review.

Depending on the length of your literature review, you might want to divide the body into subsections. You can use a subheading for each theme, time period, or methodological approach.

As you write, you can follow these tips:

  • Summarize and synthesize: give an overview of the main points of each source and combine them into a coherent whole
  • Analyze and interpret: don’t just paraphrase other researchers — add your own interpretations where possible, discussing the significance of findings in relation to the literature as a whole
  • Critically evaluate: mention the strengths and weaknesses of your sources
  • Write in well-structured paragraphs: use transition words and topic sentences to draw connections, comparisons and contrasts

In the conclusion, you should summarize the key findings you have taken from the literature and emphasize their significance.

When you’ve finished writing and revising your literature review, don’t forget to proofread thoroughly before submitting. Not a language expert? Check out Scribbr’s professional proofreading services !

This article has been adapted into lecture slides that you can use to teach your students about writing a literature review.

Scribbr slides are free to use, customize, and distribute for educational purposes.

Open Google Slides Download PowerPoint

If you want to know more about the research process , methodology , research bias , or statistics , make sure to check out some of our other articles with explanations and examples.

  • Sampling methods
  • Simple random sampling
  • Stratified sampling
  • Cluster sampling
  • Likert scales
  • Reproducibility

 Statistics

  • Null hypothesis
  • Statistical power
  • Probability distribution
  • Effect size
  • Poisson distribution

Research bias

  • Optimism bias
  • Cognitive bias
  • Implicit bias
  • Hawthorne effect
  • Anchoring bias
  • Explicit bias

A literature review is a survey of scholarly sources (such as books, journal articles, and theses) related to a specific topic or research question .

It is often written as part of a thesis, dissertation , or research paper , in order to situate your work in relation to existing knowledge.

There are several reasons to conduct a literature review at the beginning of a research project:

  • To familiarize yourself with the current state of knowledge on your topic
  • To ensure that you’re not just repeating what others have already done
  • To identify gaps in knowledge and unresolved problems that your research can address
  • To develop your theoretical framework and methodology
  • To provide an overview of the key findings and debates on the topic

Writing the literature review shows your reader how your work relates to existing research and what new insights it will contribute.

The literature review usually comes near the beginning of your thesis or dissertation . After the introduction , it grounds your research in a scholarly field and leads directly to your theoretical framework or methodology .

A literature review is a survey of credible sources on a topic, often used in dissertations , theses, and research papers . Literature reviews give an overview of knowledge on a subject, helping you identify relevant theories and methods, as well as gaps in existing research. Literature reviews are set up similarly to other  academic texts , with an introduction , a main body, and a conclusion .

An  annotated bibliography is a list of  source references that has a short description (called an annotation ) for each of the sources. It is often assigned as part of the research process for a  paper .  

Cite this Scribbr article

If you want to cite this source, you can copy and paste the citation or click the “Cite this Scribbr article” button to automatically add the citation to our free Citation Generator.

McCombes, S. (2023, September 11). How to Write a Literature Review | Guide, Examples, & Templates. Scribbr. Retrieved February 25, 2024, from https://www.scribbr.com/dissertation/literature-review/

Is this article helpful?

Shona McCombes

Shona McCombes

Other students also liked, what is a theoretical framework | guide to organizing, what is a research methodology | steps & tips, how to write a research proposal | examples & templates, what is your plagiarism score.

U.S. flag

An official website of the United States government

The .gov means it’s official. Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

The site is secure. The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

  • Publications
  • Account settings
  • Advanced Search
  • Journal List

Logo of f1000res

  • PMC8361807.1 ; 2021 May 19
  • ➤ PMC8361807.2; 2023 Oct 9

Data extraction methods for systematic review (semi)automation: Update of a living systematic review

Lena schmidt.

1 NIHR Innovation Observatory, Newcastle University, Newcastle upon Tyne, NE4 5TG, UK

2 Sciome LLC, Research Triangle Park, North Carolina, 27713, USA

3 Bristol Medical School, University of Bristol, Bristol, BS8 2PS, UK

Ailbhe N. Finnerty Mutlu

4 UCL Social Research Institute, University College London, London, WC1H 0AL, UK

Rebecca Elmore

Babatunde k. olorisade.

5 Evaluate Ltd, London, SE1 2RE, UK

6 Cardiff School of Technologies, Cardiff Metropolitan University, Cardiff, CF5 2YB, UK

James Thomas

Julian p. t. higgins, associated data, underlying data.

Harvard Dataverse: Appendix for base review. https://doi.org/10.7910/DVN/LNGCOQ . 127

This project contains the following underlying data:

  • • Appendix_A.zip (full database with all data extraction and other fields for base review data)
  • • Appendix B.docx (further information about excluded publications)
  • • Appendix_C.zip (code, weights, data, scores of abstract classifiers for Web of Science content)
  • • Appendix_D.zip (full database with all data extraction and other fields for LSR update)
  • • Supplementary_key_items.docx (overview of items extracted for each included study)
  • • table 1. csv and table 1_long.csv (Table A1 in csv format, the long version includes extra data)
  • • table 1_long_updated.csv (LSR update for Table A1 in csv format, the long version includes extra data)
  • • included.ris and background.ris (literature references from base review)

Harvard Dataverse: Available datasets for SR automation. https://doi.org/10.7910/DVN/0XTV25 . 128

  • • Datasets shared by authors of the included publications

Data are available under the terms of the Creative Commons Zero “No rights reserved” data waiver (CC0 1.0 Public domain dedication).

Extended data

Open Science Framework: Data Extraction Methods for Systematic Review (semi)Automation: A Living Review Protocol. https://doi.org/10.17605/OSF.IO/ECB3T . 15

This project contains the following extended data:

  • • Review protocol
  • • Additional_Fields.docx (overview of data fields of interest for text mining in clinical trials)
  • • Search.docx (additional information about the searches, including full search strategies)
  • • PRISMA P checklist for ‘Data extraction methods for systematic review (semi)automation: A living review protocol.’

Data are available under the terms of the Creative Commons Attribution 4.0 International license (CC-BY 4.0).

Reporting guidelines

Harvard Dataverse: PRISMA checklist for ‘Data extraction methods for systematic review (semi)automation: A living systematic review’ https://doi.org/10.7910/DVN/LNGCOQ . 127

Software availability

The development version of the software for automated searching is available from Github: https://github.com/mcguinlu/COVID_suicide_living .

Archived source code at time of publication: http://doi.org/10.5281/zenodo.3871366 . 17

License: MIT

Version Changes

Updated. changes from version 1.

This version of the LSR includes 23 new papers, a change in the title indicates that the current version is an update. Ailbhe Finnerty and Rebecca Elmore joined the author team after contributing to screening and data extraction; Luke A. McGuinness contributed to the base-review but is not listed as an author in this update. The abstract and conclusions were updated to reflect changes and new research trends such as increased availability of datasets, source code, more papers describing relation extraction and summarisation. We updated existing figures and tables with the exception of Table 1(pre-processing techniques), because reliance on pre-processing has decreased in recent years. Table 1 in the appendix was renamed as ‘Table A1’ to avoid confusion with Table 1 in the main text.  In the base-review we assessed the included publications based on a list of 17 items in the domains of reproducibility (3.4.1), transparency (3.4.2), description of testing (3.4.3), data availability (3.4.4), and internal and external validity (3.4.5). The list of items was reduced to six items for the update, more information about the removed items can be found in the methods section of this LSR. We still include the following items: 

  • 3.4.2.2 Is there a description of the dataset used and of its characteristics? 
  • 3.4.2.4 Is the source code available? 
  • 3.4.3.2 Are basic metrics reported (true/false positives and negatives)? 
  • 3.4.4.1 Can we obtain a runnable version of the software based on the information in the publication? 
  • 3.4.4.2 Persistence: Can data be retrieved based on the information given in the publication? 
  • 3.4.5.1 Does the dataset or assessment measure provide a possibility to compare to other tools in the same domain? 

Additionally, spreadsheets with all extracted data and updated figures are available as Appendix D.

Background: The reliable and usable (semi)automation of data extraction can support the field of systematic review by reducing the workload required to gather information about the conduct and results of the included studies. This living systematic review examines published approaches for data extraction from reports of clinical studies.

Methods: We systematically and continually search PubMed, ACL Anthology, arXiv, OpenAlex via EPPI-Reviewer, and the  dblp computer science bibliography . Full text screening and data extraction are conducted within an open-source living systematic review application created for the purpose of this review. This living review update includes publications up to December 2022 and OpenAlex content up to March 2023.

Results: 76 publications are included in this review. Of these, 64 (84%) of the publications addressed extraction of data from abstracts, while 19 (25%) used full texts. A total of 71 (93%) publications developed classifiers for randomised controlled trials. Over 30 entities were extracted, with PICOs (population, intervention, comparator, outcome) being the most frequently extracted. Data are available from 25 (33%), and code from 30 (39%) publications. Six (8%) implemented publicly available tools

Conclusions:  This living systematic review presents an overview of (semi)automated data-extraction literature of interest to different types of literature review. We identified a broad evidence base of publications describing data extraction for interventional reviews and a small number of publications extracting epidemiological or diagnostic accuracy data. Between review updates, trends for sharing data and code increased strongly: in the base-review, data and code were available for 13 and 19% respectively, these numbers increased to 78 and 87% within the 23 new publications. Compared with the base-review, we observed another research trend, away from straightforward data extraction and towards additionally extracting relations between entities or automatic text summarisation. With this living review we aim to review the literature continually.

1. Introduction

In a systematic review, data extraction is the process of capturing key characteristics of studies in structured and standardised form based on information in journal articles and reports. It is a necessary precursor to assessing the risk of bias in individual studies and synthesising their findings. Interventional, diagnostic, or prognostic systematic reviews routinely extract information from a specific set of fields that can be predefined. 1 The most common fields for extraction in interventional reviews are defined in the PICO framework (population, intervention, comparison, outcome) and similar frameworks are available for other review types. The data extraction task can be time-consuming and repetitive when done by hand. This creates opportunities for support through intelligent software, which identify and extract information automatically. When applied to the field of health research, this (semi) automation sits at the interface between evidence-based medicine (EBM) and data science, and as described in the following section, interest in its development has grown in parallel with interest in AI in other areas of computer science.

1.1. Related systematic reviews and overviews

This review is, to the best of our knowledge, the only living systematic review (LSR) of data extraction methods. We identified four previous reviews of tools and methods in the first iteration of this living review (called base-review hereafter), 2 – 5 and two documents providing overviews and guidelines relevant to our topic. 3 , 6 , 7 Between base-review and this update, we identified six more related (systematic) literature reviews that will be summarised in the following paragraphs. 8 – 13

Related reviews up to 2014: The systematic reviews from 2014 to 2015 present an overview of classical machine learning and natural language processing (NLP) methods applied to tasks such as data mining in the field of evidence-based medicine. At the time of publication of these documents, methods such as topic modelling (Latent Dirichlet Allocation) and support vector machines (SVM) were considered state-of-the art for language models.

In 2014, Tsafnat et al. provided a broad overview on automation technologies for different stages of authoring a systematic review. 5 O’Mara-Eves et al . published a systematic review focusing on text-mining approaches in 2015. 4 It includes a summary of methods for the evaluation of systems, such as recall, accuracy, and F1 score (the harmonic mean of recall and precision, a metric frequently used in machine-learning). The reviewers focused on tasks related to PICO classification and supporting the screening process. In the same year, Jonnalagadda, Goyal and Huffman 3 described methods for data extraction, focusing on PICOs and related fields. The age of these publications means that the latest static or contextual embedding-based and neural methods are not included. These newer methods, 14 however, are used in contemporary systematic review automation software which will be reviewed in the scope of this living review.

Related reviews up to 2020: Reviews up to 2020 focus on discussions around tool development and integration in practice, and mark the starting date of the inclusion of automation methods based on neural networks. Beller et al. describe principles for development and integration of tools for systematic review automation. 6 Marshall and Wallace 7 present a guide to automation technology, with a focus on availability of tools and adoption into practice. They conclude that tools facilitating screening are widely accessible and usable, while data extraction tools are still at piloting stages or require a higher amount of human input.

A systematic review of machine-learning for systematic review automation, published in Portuguese in 2020, included 35 publications. The authors examined journals in which publications about systematic review automation are published, and conducted a term-frequency and citation analysis. They categorised papers by systematic review task, and provided a brief overview of data extraction methods. 2

Related reviews after 2020: These six reviews include and discuss end-user tools and cover different tasks across the SR workflow, including data extraction. Compared with this LSR, these reviews are broader in scope but have less included references on the automation of data extraction. Ruiz and Duffy 10 did a literature and trend analysis showing that the number of published references about SR automation is steadily increasing. Sundaram and Berleant 11 analyse 29 references applying text mining to different parts of the SR process and note that 24 references describe automation in study selection while research gaps are most prominent for data extraction, monitoring, quality assessment, and synthesis. 11 Khalil et al. 9 include 47 tools and descriptions of validation studies in a scoping review, of which 8 are available end-user tools that mostly focus on screening, but also cover data extraction and risk of bias assessments. They discuss limitations of tools such as lack of generalisability, integration, funding, and limited performance or access. 9 Cierco Jimenez et al. 8 included 63 references in a mapping review of machine-learning to assist SRs during different workflow steps, of which 41 were available end-user tools for use by researchers without informatics background. In accordance with other reviews they describe screening as the most frequently automated step, while automated data extraction tools are lacking due to the complexity of the task. Zhang et al. 12 included 49 references on automation of data extraction fields such as diseases, outcomes, or metadata. They focussed on extraction from traditional Chinese medicine texts such as published clinical trial texts, health records, or ancient literature. 12 Schmidt et al. 13 published a narrative review of tools with a focus on living systematic review automation. They discuss tools that automate or support the constant literature retrieval that is the hallmark of LSRs, while well-integrated (semi) automation of data extraction and automatic dissemination or visualisation of results between official review updates is supported by some, but less common.

We aim to review published methods and tools aimed at automating or (semi) automating the process of data extraction in the context of a systematic review of medical research studies. We will do this in the form of a living systematic review, keeping information up to date and relevant to the challenges faced by systematic reviewers at any time.

Our objectives in reviewing this literature are two-fold. First, we want to examine the methods and tools from the data science perspective, seeking to reduce duplicate efforts, summarise current knowledge, and encourage comparability of published methods. Second, we seek to highlight the added value of the methods and tools from the perspective of systematic reviewers who wish to use (semi) automation for data extraction, i.e., what is the extent of automation? Is it reliable? We address these issues by summarising important caveats discussed in the literature, as well as factors that facilitate the adoption of tools in practice.

2.1. Registration/protocol

This review was conducted following a preregistered and published protocol. 15 PROSPERO was initially considered as platform for registration, but it is limited to reviews with health-related outcomes. Any deviations from the protocol have been described below.

2.2. Living review methodology

We are conducting a living review because the field of systematic review (semi) automation is evolving rapidly along with advances in language processing, machine-learning and deep-learning.

The process of updating started as described in the protocol 15 and is shown in Figure 1 . In short, we will continuously update the literature search results, using the search strategies and methods described in the section ‘Search’ below. PubMed and arXiv search results are updated daily in a completely automated fashion via APIs. Articles from the dblp, ACL, and OpenAlex via EPPI-Reviewer are added every two months. All search results are automatically imported to our living review screening and data extraction web-application, which is described in the section ‘Data collection and analysis’ below.

An external file that holds a picture, illustration, etc.
Object name is f1000research-10-151999-g0000.jpg

This image is reproduced under the terms of a Creative Commons Attribution 4.0 International license (CC-BY 4.0) from Schmidt et al. 15

The decision for full review updates is made every six months based on the number of new publications added to the review. For more details about this, please refer to the protocol or to the Cochrane living systematic review guidance . In between updates, the screening process and current state of the data extraction is visible via the living review website .

2.3. Eligibility criteria

  • • We included full text publications that describe an original NLP approach for extracting data related to systematic reviewing tasks. Data fields of interest (referred to here as entities or as sentences) were adapted from the Cochrane Handbook for Systematic Reviews of Interventions, 1 and are defined in the protocol. 15 We included the full range of NLP methods (e.g., regular expressions, rule-based systems, machine learning, and deep neural networks).
  • • Publications must describe a full cycle of the implementation and evaluation of a method. For example, they must report training and at least one measure of evaluating the performance of a data extraction algorithm.
  • • We included reports published from 2005 until the present day, similar to previous work. 3 We would have translated non-English reports, had we found any.
  • • The data that the included publications use for mining must be texts from randomised controlled trials, comparative cohort studies, case control studies or comparative cross-sectional studies (e.g., for diagnostic test accuracy). The scope of data extraction methods can be applied to the full text or to abstracts within each eligible publication’s corpus. We included publications that extracted data from other study types, as long as at least one of our study types of interest was contained in the corpus.

We excluded publications reporting:

  • • Methods and tools related solely to image processing and importing biomedical data from PDF files without any NLP approach, including data extraction from graphs.
  • • Any research that focuses exclusively on protocol preparation, synthesis of already extracted data, write-up, solely the pre-processing of text or its dissemination.
  • • Methods or tools that provided no natural language processing approach and offered only organisational interfaces, document management, databases, or version control
  • • Any publications related to electronic health reports or mining genetic data.

2.4. Search

Base-review: We searched five electronic databases, using the search methods previously described in our protocol. 15 In short, we searched MEDLINE via Ovid, using a search strategy developed with the help of an information specialist, and searched Web of Science Core Collection and IEEE using adaptations of this strategy, which were made by the review authors. Searches on the arXiv (computer science) and dblp were conducted on full database dumps using the search functionality described by McGuinness and Schmidt. 16 The full search results and further information about document retrieval are available in Underlying data: Appendix A and B. 127

Originally, we planned to include a full literature search from the Web of Science Core Collection. Due to the large number of publications retrieved via this search (n = 7822) we decided to first screen publications from all other sources, to train a machine-learning ensemble classifier, and to only add publications that were predicted as relevant for our living review. This reduced the Web of Science Core Collection publications to 547 abstracts, which were added to the studies in the initial screening step. The dataset, code and weights of trained models are available in Underlying data: Appendix C. 127 This includes plots of each model’s evaluation in terms of area under the curve (AUC), accuracy, F1, recall, and variance of cross-validation results for every metric.

Update: As planned, we changed to the PubMed API for searching MEDLINE. This decision was made to facilitate continuous reference retrieval. We searched only for pre-print or published literature and therefore did not search sources such as GITHUB or other source code repositories.

Update: We searched PubMed via its API, arXiv (computer science), ACL-Anthology, dblp, and used EPPI-Reviewer to collect citations from MicrosoftAcademic and later OpenAlex using the ‘Bi-Citation AND Recommendations’ method.

2.5. Data collection and analysis

2.5.1 Selection of studies

Initial screening and data extraction were conducted as stated in the protocol. In short, for the base-review we screened all retrieved publications using the Abstrackr tool. All abstracts were screened by two independent reviewers. Conflicting judgements were resolved by the authors who made the initial screening decisions. Full texts screening was conducted in a similar manner to abstract screening but used our web application for LSRs described in the following section.

For the updated review we used our living review web application to retrieve all publications with the exception of the items retrieved by EPPI-Reviewer (these are added to the dataset separately). We further used our application to de-duplicate, screen, and data-extract all publications.

A methodological update to the screening process included a change to single-screening to assess eligibility on both abstract and full-text level, reducing dual-screening to 10% of the publications.

2.5.2 Data extraction, assessment, and management

We previously developed a web application to automate reference retrieval for living review updates (see Software availability 17 ), to support both abstract and full text screening for review updates, and to manage the data extraction process throughout. 17 For future updates of this living review we will use the web application, and not Abstrackr, for screening references. This web application is already in use by another living review. 18 It automates daily reference retrieval from the included sources and has a screening and data extraction interface. All extracted data are stored in a database. Figures and tables can be exported on a daily basis and the progress in between review updates is shared on our living review website. The full spreadsheet of items extracted from each included reference is available in the Underlying data. 127 As previously described in the protocol, quality of reporting and reproducibility was initially assessed based on a previously published checklist for reproducibility in text mining, but some of the items were removed from the scope of this review update. 19

As planned in the protocol, a single reviewer conducted data extraction, and a random 10% of the included publications were checked by a second reviewer.

2.5.3 Visualisation

The creation of all figures and interactive plots on the living review website and in this review’s ‘Results’ section was automated based on structured content from our living review database (see Appendix A and D, Underlying data 127 ). We automated the export of PDF reports for each included publication. Calculation of percentages, export of extracted text, and creation of figures was also automated.

2.5.4 Accessibility of data

All data and code are free to access. A detailed list of sources is given in the ‘Data availability’ and ‘Software availability’ sections.

2.6. Changes from protocol and between updates

In the protocol we stated that data would be available via an OSF repository. Instead, the full review data are available via the Harvard Dataverse, as this repository allows us to keep an assigned DOI after updating the repository with new content for each iteration of this living review. We also stated that we would screen all publications from the Web of Science search. Instead, we describe a changed approach in the Methods section, under ‘Search’. For review updates, Web of Science was dropped and replaced with OpenAlex searches via EPPI-Reviewer.

We added a data extraction item for the type of information which a publication mines (e.g. P, IC, O) into the section of primary items of interest, and we moved the type of input and output format from primary to secondary items of interest. We grouped the secondary item of interest ‘Other reported metrics, such as impacts on systematic review processes (e.g., time saved during data extraction)’ with the primary item of interest ‘Reported performance metrics used for evaluation’.

The item ‘Persistence: is the dataset likely to be available for future use?’ was changed to: ‘Can data be retrieved based on the information given in the publication?’. We decided not to speculate if a dataset is likely to be available in the future and chose instead to record if the dataset was available at the time when we tried to access it.

The item ‘Can we obtain a runnable version of the software based on the information in the publication?’ was changed to ‘Is an app available that does the data mining, e.g. a web-app or desktop version?’.

In this current version of the review we did not yet contact the authors of the included publications. This decision was made due to time constraints, however reaching out to authors is planned as part of the first update to this living review.

In the base-review we assessed the included publications based on a list of 17 items in the domains of reproducibility (3.4.1), transparency (3.4.2), description of testing (3.4.3), data availability (3.4.4), and internal and external validity (3.4.5). The list of items was reduced to six items for the update:

  • • 3.4.2.2 Is there a description of the dataset used and of its characteristics?
  • • 3.4.2.4 Is the source code available?
  • • 3.4.3.2 Are basic metrics reported (true/false positives and negatives)?
  • • 3.4.4.1 Can we obtain a runnable version of the software based on the information in the publication?
  • • 3.4.4.2 Persistence: Can data be retrieved based on the information given in the publication?
  • • 3.4.5.1 Does the dataset or assessment measure provide a possibility to compare to other tools in the same domain?

The following items were removed, although the results and discussion from the assessment of these items in the base-review remains within the review text:

  • • 3.4.1.1 Are the sources for training/testing data reported?
  • • 3.4.1.2 If pre-processing techniques were applied to the data, are they described?
  • • 3.4.2.1 Is there a description of the algorithms used?
  • • 3.4.2.3 Is there a description of the hardware used?
  • • 3.4.3.1 Is there a justification/an explanation of the model assessment?
  • • 3.4.3.3 Does the assessment include any information about trade-offs between recall or precision (also known as sensitivity and positive predictive value)?
  • • 3.4.4.3 Is the use of third-party frameworks reported and are they accessible?
  • • 3.4.5.2 Are explanations for the influence of both visible and hidden variables in the dataset given?
  • • 3.4.5.3 Is the process of avoiding overfitting or underfitting described?
  • • 3.4.5.4 Is the process of splitting training from validation data described?
  • • 3.4.5.5 Is the model’s adaptability to different formats and/or environments beyond training and testing data described?

3.1. Results of the search

Our database searches identified 10,107 publications after duplicates were removed (see Figure 2 ). We identified one more publication manually.

An external file that holds a picture, illustration, etc.
Object name is f1000research-10-151999-g0001.jpg

This iteration of the living review includes 76 publications, summarised in Table A1 in Underlying data 127 ).

3.1.1 Excluded publications

Across the base-review and the update, 216 publications were excluded at the full text screening stage, with the most common reason for exclusion being that it did not fit target entities or target data. In most cases, this was due to the text-types mined in the publications. Electronic health records and non-trial data were common, and we created a list of datasets that would be excluded in this category (see more information in Underlying data: Appendix B 127 ). Some publications addressed the right kind of text but were excluded for not mining data of interest to this review. For example, Norman, Leeflang and Névéol 23 performed data extraction for diagnostic test accuracy reviews, but focused on extracting the results and data for statistical analyses. Millard, Flach and Higgins 24 and Marshall, Kuiper and Wallace 25 looked at risk of bias classification, which is beyond the scope of this review. Boudin, Nie and Dawes 26 developed a weighing scheme based on an analysis of PICO element locations, leaving the detection of single PICO elements for future work. Luo et al . 27 extracted data from clinical trial registrations but focused on parsing inclusion criteria into event or temporal entities to aid participant selection for randomised controlled trials (RCTs).

The second most common reason for study exclusion was that they had ‘no original data extraction approach’. Rathbone et al ., 28 for example, used hand-crafted Boolean searches specific to a systematic review’s PICO criteria to support the screening process of a review within Endnote. We classified this article as not having any original data extraction approach because it does not create any structured outputs specific to P, IC, or O. Malheiros et al. 29 performed visual text mining, supporting systematic review authors by document clustering and text highlighting. Similarly, Fabbri et al. 30 implemented a tool that supports the whole systematic review workflow, from protocol to data extraction, performing clustering and identification of similar publications. Other systematic reviewing tasks that can benefit from automation but were excluded from this review are listed in Underlying data: Appendix B. 127

3.2. Results from the data extraction: Primary items of interest

3.2.1 Automation approaches used

Figure 3 shows aspects of the system architectures implemented in the included publications. A short summary of these for each publication is provided in Table A1 in Underlying data. 127 Where possible, we tried to break down larger system architectures into smaller components. For example, an architecture combining a word embedding + long short-term memory (LSTM) network would have been broken down into the two respective sub-components. We grouped binary classifiers, such as naïve Bayes and logistic regression. Although SVM is also binary classifier, it was assigned as separate category due to its popularity. The final categories are a mixture of non-machine-leaning automation (application programming interface (API) and metadata retrieval, PDF extraction, rule-base), classic machine-learning (naïve Bayes, decision trees, SVM, or other binary classifiers) and neural or deep-learning approaches (convolutional neural network (CNN), LSTM, transformers, or word embeddings). This figure shows that there is no obvious choice of system architecture for this task. For the LSR update, the strongest trend was the increasing application of BERT (Bidirectional Encoder Representations from Transformers). BERT was published in 2018 and other architecturally-identical versions of it tailored to using scientific text, such as SciBERT, are summarised under the same category in this review. 14 , 31 In the base-review, BERT was used three times, whilst now appearing 21 times. Other transformer-based architectures such as the bio-pretrained version of ELECTRA, are also gaining attention, 32 , 33 as well as FLAIR-based models. 34 – 36

An external file that holds a picture, illustration, etc.
Object name is f1000research-10-151999-g0002.jpg

Results are divided into different categories of machine-learning and natural language processing approaches and coloured by the year of publication. More than one architecture component per publication is possible. Where API, application programming interface; BERT, bidirectional encoder representations from Transformers; CNN, convolutional neural network; CRF, conditional random fields; LSTM, long short-term memory; PICO, population, intervention, comparison, outcome; RNN, recurrent neural networks; SVM, support vector machines.

Rule-bases, including approaches using heuristics, wordlists, and regular expressions, were one of the earliest techniques used for data extraction in EBM literature. It remains the most frequently used approaches to automation. Nine publications (12%) use rule-bases alone, while the rest of the publications use them in combination with other classifiers (data shown in Underlying data: Appendix A and D 127 ). Although used more frequently in the past, the 11 publications published between 2017 and now that use this approach alongside other architectures such as BERT, 33 , 37 – 39 conditional random fields (CRF), 40 use it with SVM 41 or other binary classifiers. 42 In practice, these systems use rule-bases in the form of hand-crafted lists to identify candidate phrases for amount entities such as sample size 42 , 43 or to refine a result obtained by a machine-learning classifier on the entity level (e.g., instances where a specific intervention or outcome is extracted from a sentence). 40

Binary classifiers, most notably naïve Bayes and SVMs, are also frequently used system components in the data extraction literature. They are frequently used in studies published between 2005 and now but their usage started declining with the advent of neural models.

Embedding and neural architectures are increasingly being used in literature over the past seven years. Recurrent neural networks (RNN), CNN, and LSTM networks require larger amounts of training data; by using transformer-based embeddings with pre-training algorithms based on unlabelled data they have become increasingly more interesting in fields such as data extraction for EBM- where high-quality training data are difficult and expensive to obtain.

In the ‘Other’ category, tools mentioned were mostly other classifiers such as maximum entropy classifiers (n = 3), kLog, J48, and various position or document-length classification algorithms. We also added methods such as supervised distant supervision (n = 3, see Ref. 44 ) and novel training approaches to existing neural architectures in this category.

3.2.2 Reported performance metrics used for evaluation

Precision (i.e., positive predictive value), recall (i.e., sensitivity), and F1 score (harmonic mean of precision and recall) are the most widely used metrics for evaluating classifiers. This is reflected in Figure 4 , which shows that at least one of these metrics was used in the majority of the included publications. Accuracy and area under the curve - receiver operator characteristics (AUC-ROC) were less frequently used.

An external file that holds a picture, illustration, etc.
Object name is f1000research-10-151999-g0003.jpg

More than one metric per publication is possible, which means that the total number of included publications (n = 76) is lower than the sum of counts of the bars within this figure. AUC-ROC, area under the curve - receiver operator characteristics; F1, harmonic mean of precision and recall.

There were several approaches and justifications of using macro- or micro-averaged precision, recall, or F1 scores in the included publications. Micro or macro scores are computed in multi-class cases, and the final scores can differ whenever the classes in a dataset are imbalanced (as is the case in most datasets used for automating data extraction in SR automation).

Both micro and macro scores were reported by Singh et al. (2021), 45 Kilicoglu et al. (2021), 38 Kiritchenko et al. (2010), 46 Fiszman et al. (2007) 47 whereas Karystianis et al. (2014, 2017) 48 , 49 reported micro across documents, and macro across the classes.

Macro-scores were used in one publication. 37

Micro scores were used by Fiszman et al. 47 for class-level results. In one publication harmonic mean was used for precision and recall, while micro-scoring was used for F1. 50 Micro scores were most widely used, including Al-Hussaini et al. (2022), 32 Sanchez-Graillet et al. (2022), 51 Kim et al. (2011), 52 Verbeke et al. (2012), 53 and Jin and Szolovits (2020) 54 were used in the evaluation script of Nye et al. (2018). 55

In the category ‘Other’ we added several instances where a relaxation of a metric was introduced, e.g., precision using top-n classified sentences 44 , 46 , 56 or mean average precision and the metric ‘precision @rank 10’ for sentence ranking exercises. 57 , 58 Another type of relaxation for standard metrics is a distance relaxation when normalising entities into concepts in medical subject headings (MesH) or unified medical language system (UMLS), to allow N hops between predicted and target concepts. 59

The LSR update showed an increasing trend of text summarisation and relation extraction algorithms. ROGUE, ∆EI, or Jaccard similarity were metrics for summarisation. 60 , 61 For relation extraction F1, precision, and recall remained the most common metrics. 62 , 63

Other metrics were kappa, 58 random shuffling 64 or binomial proportion test 65 to test statistical significance, given with confidence intervals. 41 Further metrics included under ‘Other’ were odds ratios, 66 normalised discounted cumulative gain, 44 , 67 ‘sentences needed to screen per article’ in order to find one relevant sentence, 68 McNemar test, 65 C-statistic (with 95% CI) and Brier score (with 95% CI). 69 Barnett (2022) 70 extracted sample sizes and reported the mean difference between true and extracted numbers.

Real-life evaluations, such as the percentage of outputs needing human correction, or time saved per article, were reported by two publications, 32 , 46 and an evaluation as part of a wider screening system was done in another. 71

3.2.3 Type of data

3.2.3.1 Scope and data

Most data extraction is carried out on abstracts (See Table A1 in Underlying data , 127 and the supplementary table giving an overview of all included publications). Abstracts are the most practical choice, due to the possibility of exporting them along with literature search results from databases such as MEDLINE. In total, 84% (N=64) of the included publications directly reported using abstracts. Within the 19 references (25%) that reported usage of full texts, eight specifically mentioned that this also included abstracts but it is unclear if all full texts included abstract text. Descriptions of the benefits of using full texts for data extraction include having access to a more complete dataset, while the benefits of using titles (N=4, 5%) include lower complexity for the data extraction task. 43 Xu et al. (2010) 72 exclusively used titles, while the other three publications that specifically mentioned titles also used abstracts in their datasets. 43 , 73 , 74

Figure 5 shows that RCTs are the most common study design texts used for data extraction in the included publications (see also extended Table A1 in Underlying data 127 ). This is not surprising, because systematic reviews of interventions are the most common type of systematic review, and they are usually focusing on evidence from RCTs. Therefore, the literature for automation of data extraction focuses on RCTs, and their related PICO elements. Systematic reviews of diagnostic test accuracy are less frequent, and only one included publication specifically focused on text and entities related to these studies, 75 while two mentioned diagnostic procedures among other fields of interest. 35 , 76 Eight publications focused on extracting data specifically from epidemiology research, non-randomised interventional studies, or included text from cohort studies as well as RCT text. 48 , 49 , 61 , 72 – 74 , 76 , 77 More publications mining data from surveys, animal RCTs, or case series might have been found if our search and review had concentrated on these types of texts.

An external file that holds a picture, illustration, etc.
Object name is f1000research-10-151999-g0004.jpg

Commonly, randomized controlled trials (RCT) text was at least one of the target text types used in the included publications.

3.2.3.2 Data extraction targets

Mining P, IC, and O elements is the most common task performed in the literature of systematic review (semi-) automation (see Table A1 in Underlying data , 127 and Figure 6 ). In the base-review, P was the most common entity. After the LSR update, O (n=52, 68%) has become the most popular, due to the emerging trend of relation-extraction models that focus on the relationship between O and I entities and therefore may omit the automatic extraction of P. Some of the less-frequent data extraction targets in the literature can be categorised as sub-classes of a PICO, 55 for example, by annotating hierarchically multiple entity types such as health condition, age, and gender under the P class. The entity type ‘P (Condition and disease)’, was the most common entity closely related to the P class, appearing in twelve included publications, of which four were published in 2021 or later. 35 , 36 , 51 , 55 , 63 , 71 , 75 , 76 , 78 – 81

An external file that holds a picture, illustration, etc.
Object name is f1000research-10-151999-g0005.jpg

More than one entity type per publication is common, which means that the total number of included publications (n = 76) is lower than the sum of counts within this figure. P, population; I, intervention; C, comparison; O, outcome.

Notably, eleven publications annotated or worked with datasets that differentiated between intervention and control arms, four of these published after 2020 with a trend towards relation extraction and summarisation tasks requiring this type of data. 46 , 47 , 51 , 56 , 62 , 63 , 66 , 82 – 84 Usually, I and C are merged (n=47). Most data extraction approaches focused on recognising instances of entity or sentence classes, and a small number of publications went one step further to normalise to actual concepts and including data sources such as UMLS (Unified Medical Language System). 35 , 39 , 59 , 73 , 85

The ‘Other’ category includes some more detailed drug annotations 65 or information such as confounders 49 and other entity types (see the full dataset in Underlying data: Appendix A and D for more information 127 ).

3.3. Results from the data extraction: Secondary items of interest

3.3.1 Granularity of data extraction

A total of 54 publications (71%) extracted at least one type of information at the entity level, while 46 publications (60%) used sentence level (see Table A1 extended version in Underlying data 127 ). We defined the entity level as any number of words that is shorter than a whole sentence, e.g., noun-phrases or other chunked text. Data types such as P, IC, or O commonly appeared to be extracted on both entity and sentence level, whereas ‘N’, the number of people participating in a study, was commonly extracted on entity level only.

3.3.2 Type of input

The majority of publications and benchmark corpora mentioned MEDLINE, via PubMed, as the data source for text. Text files (n = 64), next to XML (n = 8), or HTML (n = 3), are the most common format of the data downloaded from these sources. Therefore, most systems described using, or were assumed to use, text files as input data. Eight included publications described using PDF files as input. 44 , 46 , 59 , 68 , 75 , 81 , 86 , 87

3.3.3 Type of output

A limited number of publications described structured summaries as output of their extracted data (n = 14, increasing trend between LSR updates). Alternatives to exporting structured summaries were JSON (n = 4), XML, and HTML (n = 2 each). Two publications mentioned structured data outputs in the form of an ontology. 51 , 88 Most publications mentioned only classification scores without specifying an output type. In these cases, we assumed that the output would be saved as text files, for example as entity span annotations or lists of sentences (n = 55).

3.4. Assessment of the quality of reporting

In the base-review we used a list of 17 items to investigate reproducibility, transparency, description of testing, data availability, and internal and external validity of the approaches in each publication. The maximum and minimum number of items that were positively rated were 16 and 1, respectively, with a median of 10 (see Table A1 in Underlying data 127 ). Scores were added up and calculated based on the data provided in Appendix A and D (see Underlying data 127 ), using the sum and median functions integrated in Excel. Publications from recent years up to 2021 showed a trend towards more complete and clear reporting.

3.4.1 Reproducibility

3.4.1.1 Are the sources for training/testing data reported?

Of the included publications in the base-review, 50 out of 53 (94%) clearly stated the sources of their data used for training and evaluation. MEDLINE was the most popular source of data, with abstracts usually described as being retrieved via searches on PubMed, or full texts from PubMed Central. A small number of publications described using text from specific journals such as PLoS Clinical Trials, New England Journal of Medicine, The Lancet, or BMJ. 56 , 83 Texts and metadata from Cochrane, either provided in full or retrieved via PubMed, were used in five publications. 57 , 59 , 68 , 75 , 86 Corpora such as the ebm-nlp dataset, 55 or PubMed-PICO 54 are available for direct download. Publications published in recent years are increasingly reporting that they are using these benchmark datasets rather than creating and annotating their own corpora (see 4 for more details).

3.4.1.2 If pre-processing techniques were applied to the data, are they described?

Of the included publications in the base-review, 47 out of 53 (89%) reported processing the textual data before applying/training algorithms for data extraction. Different types of pre-processing, with representative examples for usage and implementation, are listed in Table 1 below.

After the publication of the base-review, transformer models such as BERT became dominant in the literature (see Figure 3 ). With their word-piece vocabulary, contextual embeddings, and self-supervised pre-training on large unlabelled corpora these models have essentially removed the need for most pre-processing beyond automatically-applied lower-casing. 14 , 31 We are therefore not going to update this table in this, or any future iterations of this LSR. We leave it for reference to publications that may still use these methods in the future.

3.4.2 Transparency of methods

3.4.2.1 Is there a description of the algorithms used?

Figure 7 shows that 43 out of 53 publications in the base-review (81%) provided descriptions of their data extraction algorithm. In the case of machine learning and neural networks, we looked for a description of hyperparameters and feature generation, and for the details of implementation (e.g. the machine-learning framework). Hyperparameters were rarely described in full, but if the framework (e.g., Scikit-learn, Mallet, or Weka) was given, in addition to a description of implementation and important parameters for each classifier, then we rated the algorithm as fully described. For rule-based methods we looked for a description of how rules were derived, and for a list of full or representative rules given as examples. Where multiple data extraction approaches were described, we gave a positive rating if the best-performing approach was described.

An external file that holds a picture, illustration, etc.
Object name is f1000research-10-151999-g0006.jpg

3.4.2.2 Is there a description of the dataset used and of its characteristics?

Of the included publications in the review update, 73 out of 76 (97%) provided descriptions of their dataset and its characteristics.

Most publications provided descriptions of the dataset(s) used for training and evaluation. The size of each dataset, as well as the frequencies of classes within the data, were transparent and described for most included publications. All dataset citations, along with a short description and availability of the data, are shown in Table 4 .

RCT, randomized controlled trials; IR, information retrieval; PICO, population, intervention, comparison, outcome; UMLS, unified medical language system.

3.4.2.3 Is there a description of the hardware used?

Most included publications in the base-review did not report their hardware specifications, though five publications (9%) did. One, for example, applied their system to new, unlabelled data and reported that classifying the whole of PubMed takes around 20 hours using a graphics processing unit (GPU). 69 In another example, the authors reported using Google Colab GPUs, along with estimates of computing time for different training settings. 95

3.4.2.4 Is the source code available?

Figure 8 shows that most of the included publications did not provide any source code, although there is a very strong trend towards better code-availabilty in the publications from the review update (n=19 published code, 83% of the new publications provided code). Publications that did provide the source code were exclusively published or last updated in the last seven years. GitHub is the most popular platform for making code accessible. Some publications also provided links to notebooks on Google Colab, which is a cloud-based platform to develop and execute code online. Two publications provided access to parts of the code, or access was restricted. A full list of code repositories from the included publications is available in Table 2 .

An external file that holds a picture, illustration, etc.
Object name is f1000research-10-151999-g0007.jpg

3.4.3 Testing

3.4.3.1 Is there a justification/an explanation of the model assessment?

Of the included publications in the base-review, 47 out of 53 (89%) gave a detailed assessment of their data extraction algorithms. We rated this item as negative if only the performance scores were given, i.e., if no error analysis was performed and no explanations or examples were given to illustrate model performance. In most publications a brief error analysis was common, for example discussions on representative examples for false negatives and false positives, 47 major error sources 90 or highlighting errors with respect to every entity class. 76 Both Refs. 52 , 53 used structured and unstructured abstracts, and therefore discussed the implications of unstructured text data for classification scores.

A small number of publications did a real-life assessment, where the data extraction algorithm was applied to different, unlabelled, and often much larger datasets or tested while conducting actual systematic reviews. 46 , 58 , 63 , 69 , 48 , 95 , 101 , 102

3.4.3.2 Are basic metrics reported (true/false positives and negatives)?

Figure 9 shows the extent to which all raw basic metrics, such as true-positives, were reported in the included publications in the LSR update. In most publications (n = 62) these basic metrics are not reported, and there is a trend between base-review and this update towards not reporting these. However, basic metrics could be obtained since the majority of new included publications made source code available and used publicly available datasets. When dealing with entity-level data extraction it can be challenging to define the quantity of true negative entities. This is true especially if entities are labelled and extracted based on text chunks, because there can be many combinations of phrases and tokens that constitute an entity. 47 This problem was solved in more recent publications by conducting a token-based evaluation that computes scores across every single token, hence gaining the ability to score partial matches for multi-word entities. 55

An external file that holds a picture, illustration, etc.
Object name is f1000research-10-151999-g0008.jpg

For each included paper. More than one selection is possible, which means that the total number of included publications (n=76) is lower than the sum of counts within this figure.

3.4.3.3 Does the assessment include any information about trade-offs between recall or precision (also known as sensitivity and positive predictive value)?

Of the included publications in the base-review, 17 out of 53 (32%) described trade-offs or provided plots or tables showing the development of evaluation scores if certain parameters were altered or relaxed. Recall (i.e., sensitivity) is often described as the most important metric for systematic review automation tasks, as it is a methodological demand that systematic reviews do not exclude any eligible data.

References 56 and 76 showed how the decision of extracting the top two or N predictions impacts the evaluation scores, for example precision or recall. Reference 102 shows precision-recall plots for different classification thresholds. Reference 72 shows four cut-offs, whereas Ref. 95 shows different probability thresholds for their classifier, and describe the impacts of this on precision, recall, and F1 curves.

Some machine-learning architectures need to convert text into features before performing classification. A feature can be, for example, the number of times that a certain word occurs, or the length of an abstract. The number of features used, e.g. for CRF algorithms, which was given in multiple publications, 92 together with a discussion of classifiers that should be used in high recall is needed. 42 , 103 show ROC curves quantifying the amount of training data and its impact on the scores.

3.4.4 Availability of the final model or tool

3.4.4.1 Can we obtain a runnable version of the software based on the information in the publication?

Compiling and testing code from every publication is outside the scope of this review. Instead, in Figure 10 and Table 3 we recorded the publications where a (web) interface or finished application was available. Counting RobotReviewer and Trialstreamer as separate projects, 12% of the included publications had an application associated with it, but only 5 (6%) are available and directly usable via web-apps. Applications were available as open-source, completely free, or free basic versions with optional features that can be purchased or subscribed to.

An external file that holds a picture, illustration, etc.
Object name is f1000research-10-151999-g0009.jpg

3.4.4.2 Persistence: Can data be retrieved based on the information given in the publication?

We observed an increasing trend of dataset availability and publications re-using benchmark corpora within the LSR update. Only seven of the included publications in the base-review (13%) made their datasets publicly available, out of the 36 unique corpora found then.

After the LSR update we accumulated 55 publications that describe unique new corpora. Of these, 23 corpora were available online and a total of 40 publication mentioned using one of these public benchmarking sets. Table 4 shows a summary of the corpora, their size, classes, links to the datasets, and cross-reference to known publications re-using each data set. For the base review, we collected the corpora, provide a central link to all datasets, and will add datasets as they become available during the life span of this living review (see Underlying data 127 , 128 below). Due to the increased number of available corpora we stopped downloading the data and provide links instead. When a dataset is made freely available without barriers (i.e., direct downloads of text and labels), then any researcher can re-use the data and publish results from different models, which become comparable to one another. Copyright issues surrounding data sharing were noted by Ref. 75 , therefore they shared the gold-standard annotations used as training or evaluation data and information on how to obtain the texts.

3.4.4.3 Is the use of third-party frameworks reported and are they accessible?

Of the included publications in the base-review, 47 out of 53 (88%) described using at least one third-party framework for their data extraction systems. The following list is likely to be incomplete, due to non-available code and incomplete reporting in the included publications. Most commonly, there was a description of machine-learning toolkits (Mallet, N = 12; Weka, N = 6; tensorflow, N = 5; scikit-learn, N = 3). Natural language processing toolkits such as Stanford parser/CoreNLP (N = 12) or NLTK (N = 3), were also commonly reported for the pre-processing and dependency parsing steps within publications. The MetaMap tool was used in nine publications, and the GENIA tagger in four. For the complete list of frameworks please see Appendix A and D in Underlying data. 127

3.4.5 Internal and external validity of the model

3.4.5.1 Does the dataset or assessment measure provide a possibility to compare to other tools in the same domain?

With this item we aimed to assess publications to see if the evaluation results from models are comparable with the results from other models. Ideally, a publication would have reported the results of another classification model on the same dataset, either by re-implementing the model themselves 96 or by describing results of other models when using benchmark datasets. 64 This was rarely the case for the publications in the base-review, as most datasets were curated and used in single publications only. However, the re-use of benchmark corpora increased with the publications in the LSR update, where we found 40 publications that report results on one of the previously published benchmark datasets (see Table 4 ).

Addtionally, in the base-review, in 40 publications (75%) data were well described, and they utilised commonly used entities and common assessment metrics, such as precision, recall, and F1-scores, leading to a limited comparability of results. In these cases, the comparability is limited because those publications used different data sets, which can influence the difficulty of the data extraction task and lead to better results within for example structured datasets or topic-specific datasets.

3.4.5.2 Are explanations for the influence of both visible and hidden variables in the dataset given?

This item relates only to publications using machine learning or neural networks. Rule-based classification systems (N = 8, 15% reporting rule-base as sole approach) are not applicable to this item, because the rules leading to decisions are intentionally chosen by the creators of the system and are therefore always visible.

Ten publications in the base-review (19%) discussed hidden variables. 83 discussed that the identification of the treatment group entity yielded the best results. However, when neither the words ‘group’ nor ‘arm’ were present in the text then the system had problems with identifying the entity. ‘Trigger tokens’ 104 and the influence of common phrases were also described by Ref. 68 , the latter showed that their system was able to yield some positive classifications in the absence of common phrases. 103 went a step further and provided a table with words that had the most impact on the prediction of each class. 57 describes removing sentence headings in structured abstracts in order to avoid creating a system biased towards common terms, while Ref. 90 discussed abbreviations and grammar as factors influencing the results. Length of input text 59 and position of a sentence within a paragraph or abstract, e.g. up to 10% lower classification scores for certain sentence combinations in unstructured abstracts, were shown in several publications. 46 , 66 , 102

3.4.5.3 Is the process of avoiding overfitting or underfitting described?

‘Overfitted’ is a term used to describe a system that shows particularly good evaluation results on a specific dataset because it has learned to classify noise and other intrinsic variations in the data as part of its model. 105

Of the included publications in the base-review, 33 out of 53 (62%) reported that they used methods to avoid overfitting. Eight (15%) of all publications reported rule-based classification as their only approach, allowing them to not be susceptible to overfitting by machine learning.

Furthermore, 28 publications reported cross-validation to avoid overfitting. Mostly these classifiers were in the domain of machine-learning, e.g. SVMs. Most commonly, 10 folds were used (N = 15), but depending on the size of evaluation corpora, 3, 6, 5 or 15 folds were also described. Two publications 55 , 85 cautioned that cross-validation with a high amount of folds (e.g. 10) causes high variance in evaluation results when using small datasets such as NICTA-PIBOSO. One publication 104 stratified folds by class in order to avoid this variance in evaluation results in a fold which is caused by a sparsity of positive instances.

Publications in the neural and deep-learning domain described approaches such as early stopping, dropout, L2-regularisation, or weight decay. 59 , 96 , 106 Some publications did not specifically discuss overfitting in the text, but their open-source code indicated that the latter techniques were used. 55 , 75

3.4.5.4 Is the process of splitting training from validation data described?

Random allocation to treatment groups is an important item when assessing bias in RCTs, because selective allocation can lead to baseline differences. 1 Similarly the process of splitting a dataset randomly, or in a stratified manner, into training (or rule-crafting) and test data is important when constructing classifiers and intelligent systems. 117

All included publications in the base-review gave indications of how different train and evaluation datasets were obtained. Most commonly there was one dataset and the splitting ratio which indicated that splits were random. This information was provided in 36 publications (68%).

For publications mentioning cross-validation (N = 28, 53%) we assumed that splits were random. The ratio of splitting (e.g. 80:20 for training and test data) was clear in the cross-validation cases and was described in the remainder of publications.

It was also common for publications to use completely different datasets, or multiple iterations of splitting, training and testing (N = 13, 24%). For example Ref. 56 used cross-validation to train and evaluate their model, and then used an additional corpus after the cross-validation process. Similarly Ref. 59 , used 60:40 train/test splits, but then created an additional corpus of 88 documents to further validate the model’s performance on previously unseen data.

3.4.5.5 Is the model’s adaptability to different formats and/or environments beyond training and testing data described?

For this item we aimed to find out how many of the included publications in the base-review tested their data extraction algorithms on different datasets. A limitation often noted in the literature was that gold-standard annotators have varying styles and preferences, and that datasets were small and limited to a specific literature search. Evaluating a model on multiple independent datasets provides the possibility of quantifying how well data can be extracted across domains and how flexible a model is in real-life application with completely new data sets. Of the included publications, 19 (36%) discussed how their model performed on datasets with characteristics that were different to those used for training and testing. In some instances, however, this evaluation was qualitative where the models were applied to large unlabelled, real-life datasets. 46 , 58 , 69 , 48 , 95 , 101 , 102

3.4.6 Other

3.4.6.1 Caveats

Caveats were extracted as free text. Included publications (N = 64, 86%) reported a variety of caveats. After extraction we structured them into six different domains:

  • 1. Label-quality and inter-annotator disagreements
  • 2. Variations in text
  • 3. Domain adaptation and comparability
  • 4. Computational or system architecture implications
  • 5. Missing information in text or knowledge base
  • 6. Practical implications

These are further discussed in the ‘Discussion’ section of this living review.

3.4.6.2 Sources of funding and conflict of interest

Figure 11 shows that most of the included publications in the base review did not declare any conflict of interest. This is true for most publications published before 2010, and about 50% of the literature published in more recent years. However, sources of funding were declared more commonly, with 69% of all publications including statements for this item. This reflects a trend of more complete reporting in more recent years.

An external file that holds a picture, illustration, etc.
Object name is f1000research-10-151999-g0010.jpg

4. Discussion

4.1. summary of key findings.

4.1.1 System architectures

Systems described within the included publications are changing over time. Non-machine-learning data extraction via rule-base and API is one of the earliest and most frequently used approaches. Various classical machine-learning classifiers such as naïve Bayes and SVMs are very common in the literature published between 2005-2018. Up until 2020 there was a trend towards word embeddings and neural networks such as LSTMs. Between 2020 and 2022 we observed a trend towards transformers, especially the BERT, RoBERTa and ELECTRA architectures pre-trained on biomedical or scientific text.

4.1.2 Evaluation

We found that precision, recall, and F1 were used as evaluation metrics in most publications, although sometimes these metrics were adapted or relaxed in order to account for partial or similar matches.

4.1.3 Scope

Most of the included publications focused on extracting data from abstracts. The reasons for this include the availability of data and ease of access, as well as the high coverage of information and the availability of structured abstracts that can automatically derive labelled training data. A much smaller number of the included publications (n=19, 25%) extracted data from full texts. Half of the systems that extract data from full text were published within the last seven years. In systematic review practice, manually extracting data from abstracts is quicker and easier than manually extracting data from full texts. Therefore, the potential time-saving and utility of full text data extraction is much higher because more time can be saved by automation and it provides automation that more closely reflects the work done by systematic reviewers in practice. However, the data extraction literature on full text is still sparse and extraction from abstracts may be of limited value to reviewers in practice because it carries the risk of missing information. Whenever a publication reported full-text extraction we tried to find out if this also included abstract text, in which case we would count the publication in both categories. However, this information was not always clearly reported.

4.1.4 Target texts

Reports of randomised controlled trials were the most common texts used for data extraction. Evidence concerning data extraction from other study types was rare and is discussed further in the following sections.

4.2. Assessment of the quality of reporting

We only assessed full quality of reporting in the base-review, and assessed selected items during the review update. The quality of reporting in the included studies in the base-review is improving over time We assessed the included publications based on a list of 17 items in the domains of reproducibility, transparency, description of testing, data availability, and internal and external validity.

Base-review: Reproducibility was high throughout, with information about sources of training and evaluation data reported in 94% of all publications and pre-processing described in 89%.

Base-review: In terms of transparency, 81% of the publications provided a clear description of their algorithm, 94% described the characteristics of their datasets, but only 9% mentioned hardware specifications or feasibility of using their algorithm on large real-world datasets such as PubMed.

Update: Availability of source code was high in the publications added in the LSR update (N=19, 83%). Before the update, 15% of all included publications had made their code available. Overall, 39% (N=30) now have their code available and all links to code repositories are shown in Table 2 .

Base-review: Testing of the systems was generally described, 89% gave a detailed assessment of their algorithms. Trade-offs between precision and recall were discussed in 32%.

Update: Basic metrics were reported in only 19% (N=14) of the included publications, which is a downward trend from 24% in the base-review. However, more complete reporting of source-code and public datasets still leads to increased transparency and comparability.

Update: Availability of the final models as end-user tools was very poor. Only 12% of the included publications had an application associated with it, but only 5 (6%) are available and directly usable via web-apps (see Table 3 for links). Furthermore, it is unclear how many of the other tools described in the literature are used in practice, even if only used internally within their authors research groups. There was a surprisingly strong trend towards sharing and re-using already published corpora in the LSR update. Earlier, labelled training and evaluation data were available from 13% of the publications, and only a further 32% of all publications reported using one of these available datasets. Within the LSR update, 22 corpora were available online and at least 40 other included publication mention using them. Table 4 provides the sources of all corpora and publications using them. For named-entity recognition, EBM-NLP 55 is the most popular dataset, used by at least 10 other publications and adapted and used by another four. For sentence classification the NICTA gold-standard 52 is used by eight others, and the automatically labelled corpus by Jin and Szolovits 96 is used by five others and was adapted once. For relation extraction the EvidenceInference 2.0 corpus is gaining attention, being used in at least three other publications.

Base-review: A total of 88% of the publications described using at least one accessible third-party framework for their data extraction system. Internal and external validity of each model was assessed based on its comparability to other tools (75%), assessment of visible and hidden variables in the data (19%), avoiding overfitting (62%, not applicable to non-machine learning systems), descriptions of splitting training from validation data (100%), and adaptability and external testing on datasets with different characteristics (36%). These items, together with caveats and limitations noted in the included publications are discussed in the following section.

4.3. Caveats and challenges for systematic review (semi)automation

In the following section we discuss caveats and challenges highlighted by the authors of the included publications. We found a variety of topics discussed in these publications and summarised them under seven different domains. Due to the increasing trend of relation-extraction and text summarisation models we now summarise any challenges or caveats related to these within the updated text at the end of each applicable domain.

4.3.1 Label-quality and inter-annotator disagreements

The quality of labels in annotated datasets was identified as a problem by several authors. The length of the entity being annotated, for example O or P entities, often caused disagreements between annotators. 46 , 48 , 58 , 69 , 95 , 101 , 102 We created an example in Figure 12 , which shows two potentially correct, but nevertheless different annotations on the same sentence.

An external file that holds a picture, illustration, etc.
Object name is f1000research-10-151999-g0011.jpg

P, population; I, intervention; C, comparison; O, outcome.

Similar disagreements, 65 , 85 , 104 along with missed annotations, 72 are time-intensive to reconciliate 97 and make the scores less reliable. 95 As examples of this, two publications observed that their system performed worse on classes with high disagreement. 75 , 104 There exist different explanations for worse performance in these cases. It is possibly harder for models to learn from labelled data with systematic differences within. Another reason is that the model learns predictions based on one annotation style and therefore artificial errors are produced when evaluated against differently labelled data, or that the annotation task itself is naturally harder in cases with high inter-annotator disagreement, and therefore lower performance from the models might be explainable. An overview of the included publications discussing this, together with their inter-annotator disagreement scores, is given in Table 5 .

Please see each included publication for further details on corpus quality.

To mitigate these problems, careful training and guides for expert annotators are needed. 58 , 77 For example, information should be provided on whether multiple basic entities or one longer entity annotation are preferred. 85 Crowd-sourced annotations can contain noisy or incorrect information and have low interrater reliability. However, they can be aggregated to improve quality. 55 In recent publications, partial entity matches (i.e., token-wise evaluation) downstream were generally favoured above complete detection, which helps to mitigate this problem’s impact on final evaluation scores. 55 , 83

For automatically labelled or distantly supervised data, label quality is generally lower. This is primarily caused by incomplete annotation due to missing headings, or by ambiguity in sentence data, which is discussed as part of the next domain. 44 , 57 , 103

4.3.2 Ambiguity

The most common source of ambiguity in labels described in the included publications is associated with automatically labelled sentence-level data. Examples of this are sentences that could belong to multiple categories, e.g., those that should have both ‘P’ and an ‘I’ label, or sentences that were assigned to the class ‘other’ while containing PICO information (Refs. 54 , 95 , 96 , among others). Ambiguity was also discussed with respect to intervention terms 76 or when distinguishing between ‘control’ and ‘intervention’ arms. 46 When using, or mapping to UMLS concepts, ambiguity was discussed in Refs. 41 , 52 , 72 .

At the text level, ambiguity around the meaning of specific wordings was discussed as a challenge, e.g., the word 'concentration' can be a quantitative measure or a mental concept. 41 Numbers were also described as challenging due to ambiguity, because they can refer to the total number of participants, number per arm of a trial, or can just refer to an outcome-related number. 84 , 113 When classifying participants, the P entity or sentence is often overloaded because it includes too much information on different, smaller, entities within it, such as age, gender, or diagnosis. 89

Ambiguity in relation-extraction can include cases where interventions and comparators are classified separately in a trial with more than two arms, thus leading to an increased complexity in correctly grouping and extracting data for each separate comparison.

4.3.3 Variations in text

Variations in natural language, wording, or grammar were identified as challenges in many references that looked closer at the texts within their corpora. Such variation may arise when describing entities or sentences (e.g., Refs. 48 , 79 , 97 ) or may reflect idiosyncrasies specific to one data source, e.g., the position of entities in a specific journal. 46 In particular, different styles or expressions were noted as caveats in rule-based systems. 42 , 48 , 80

There is considerable variation in how an entity is reported, for example between intervention types (drugs, therapies, routes of application) 56 or in outcome measures. 46 In particular, variations in style between structured and unstructured abstracts 65 , 78 and the description lengths and detail 59 , 79 can cause inconsistent results in the data extraction, for example by not detecting information correctly or extracting unexpected information. Complex sentence structure was mentioned as a caveat especially for rule-based systems. 80 An example of a complex structure is when more than one entity is described (e.g., Refs. 93 , 102 ) or when entities such as ‘I’ and ‘O’ are mentioned close to each other. 57 Finally, different names for the same entity within an abstract are a potential source of problems. 84 When using non-English texts, such as Spanish articles, it was noted that mandatory translation of titles can lead to spelling mistakes and translation errors. 35

Another common variation in text was implied information. For example, rather than stating dosage specifically, a trial text might report dosages of ‘10 or 20 mg’, where the ‘mg’ unit is implied for the number 10, making it a ‘dosage’ entity. 46 , 48 , 90

Implied information was also mentioned as problem in the field of relation-extraction, with Nye et al. (2021) 63 discussing importance of correctly matching and resolving intervention arm names that only imply which intervention was used. Examples are using ‘Group 1’ instead of referring to the actual intervention name, or implying effects across a group of outcomes, such as all adverse events. 63

4.3.4 Domain adaptation and comparability

Because of the wide variation across medical domains, there is no guarantee that a data extraction system developed on one dataset automatically adapts to produce reliable results across different datasets relating to other domains. The hyperparameter configuration or rule-base used to conceive a system may not retrieve comparable results in a different medical domain. 40 , 68 Therefore, scores might not be similar between different datasets, especially for rule-based classifiers, 80 when datasets are small, 35 , 49 when structure and distribution of class of interest varies, 40 or when the annotation guidelines vary. 85 A model for outcome detection, for example, might learn to be biased towards outcomes frequently appearing in a certain domain, such as chemotherapy-related outcomes for cancer literature or it might favour to detect outcomes more frequent in older trial texts if the underlying training data are older or outdated. 73 Another caveat mentioned by Refs. 59 , 85 is that the size of the label space must be considered when comparing scores, as models that normalise to specific concepts rather than detecting entities tend to have lower precision, recall, and F1 scores.

Comparability between models might be further decreased by comparing results between publications that use relaxed vs. strict evaluation approaches for token-based evaluation, 34 or publications that use the same dataset but with different random seeds to split training and testing data. 33 , 118

Therefore, several publications discuss that a larger amount of benchmarking datasets with standardised splits for train, development, and evaluation datasets and standardised evaluation scripts could increase the comparability between published systems. 46 , 92 , 114

4.3.5 Computational or system architecture implications

Computational cost and scalability were described in two publications. 53 , 114 Problems within the system, e.g., encoding 97 or PDF extraction errors 75 lead to problems downstream and ultimately result in bias, favouring articles from big publishers with better formatted data. 75 Similarly, grammar and parsing part-of-speech and/or chunking errors (Refs. 76 , 80 , 90 , among others) or faulty parse-trees 78 can reduce a system’s performance if it relies on access to correct grammatical structure. In terms of system evaluation, 10-fold cross-validation causes high variance in results when using small datasets such as NICTA-PIBOSO, 54 , 85 , 104 described that the same problem needs to be addressed through stratification of the positive instances of each class within folds.

4.3.6 Missing information in text or knowledge base

Information in text can be incomplete. 114 For example, the number of patients in a study might not be explicitly reported, 76 or abstracts lacking information about study design and methods can appear, especially in unstructured abstracts and older trial texts. 91 , 96 In some cases, abstracts can be missing entirely. These problems can sometimes be solved by considering the use of full texts as input. 71 , 87

Where a model relies on features, e.g., MetaMap, then missing UMLS coverage causes errors. 72 , 76 This also applies to models like CNNs that assign specific concepts, where unseen entities are not defined in the output label space. 59

In terms of automatic summarisation and relation extraction it was also cautioned that relying on abstracts will lead to a low sensitivity of retrieved information, as not all information of interest may be reported in sufficient detail to allow comprehensive summaries or statements about relationships between interventions and outcomes to be made. 60 , 63

4.3.7 Practical and other implications

In contrast to the problem of missing information, too much information can also have practical implications. For instance, often there are multiple sentences with each label, of which one is ‘key’, e.g., the descriptions of inclusion and exclusion criteria often span multiple sentences, and for a data extraction system it can be challenging to work out which sentence is the key sentence. The same problem applies to methods that select and rank the top-n sentences for each data extraction target, where a system risks including too much, or not enough results depending on the amount of sentences that are kept. 46

Low recall is an important practical implication, 53 especially in entities that appear infrequently in the training data, and are therefore not well represented in the training process of the classification system. 48 In other words, an entity such as ‘Race’ might not be labelled very often is a training corpus, and systematically missed or wrongly classified when the data extraction system is used on new texts. Therefore, human involvement is needed, 86 and scores need to be improved. 41 It is challenging to find the best set of hyperparameters 106 and to adjust precision and recall trade-offs to maximise the utility of a system while being transparent about the number of data points that might be missed when increasing system precision to save work for a human reviewer. 69 , 95 , 101

For relation extraction or normalisation tasks, error-propagation was noted as a practical issue in joint models. 63 , 67 To extract relations, first a model to identify entities is needed, and then another model to classify relationships is applied in a pipeline. Neither human nor machine can instantly perform perfect data extraction or labelling, 37 and thus errors done in earlier classification steps can be carried forwards and accumulate.

For relation extraction and summarisation, the importance of qualitative real-world evaluation was discussed. This was due to missing clarity of how well summarisation metrics relate to the actual usefulness or completeness of a summary and because challenges such as contradictions or negations within and between trial texts need to be evaluated within the context of a review and not just a trial itself. 61 , 63

A separate practical caveat with relation-extraction models are longer dependencies, i.e. bigger gaps between salient pieces of information in text that lead to a conclusion. This leads to increased complexity of the task and thus to reduced performance. 99

In their statement on ethical concerns, DeYoung et al. (2021) 61 mention that these complex relation and summarisation models can produce correct-looking but factually incorrect statements and are risky to be applied in practice without extra caution.

4.4. Explainability and interpretability of data extraction systems

The neural networks or machine-learning models from publications included in this review learn to classify and extract data by adjusting numerical weights and by applying mathematical functions to these sets of weights. The decision-making process behind the classification of a sentence or an entity is therefore comparable with a black box, because it is very hard to comprehend how, or why the model made its predictions. A recent comment published in Nature has called for a more in-depth analysis and explanation of the decision-making process within neural networks. 117 Ultimately, hidden tendencies in the training data can influence the decision-making processes of a data extraction model in a non-transparent way. Many of the examples discussed in the comment are related to healthcare, but in practice there is a very limited understanding of their inherent biases despite the broad application of machine learning and neural networks. 117

A deeper understanding of what occurs between data entry and the point of prediction can benefit the general performance of a system, because it uncovers shortcomings in the training process. These shortcomings can be related to the composition of training data (e.g. overrepresentation or underrepresentation of groups), the general system architecture, or to other unintended tendencies in a system’s prediction. 119 A small number of included publications in the base-review (N = 10) discussed issues related to hidden variables as part of an extensive error analysis (see section 3.5.2). The composition of training and testing data were described in most publications, but no publication that specifically addresses the issues of interpretability or explainability was found.

4.5. Availability of corpora, and copyright issues

There are several corpora described in the literature, many with manual gold-standard labels (see Table 4 ). There are still publications with custom, unshared datasets. Possible reasons for this are concerns over copyright, or malfunctioning download links from websites mentioned in older publications. Ideally, data extraction algorithms should be evaluated on different datasets in order to detect over-fitting, to test how the systems react to data from different domains and different annotators, and to enable the comparison of systems in a reliable way. As a supplement to this manuscript, we have collected links to datasets in Table 4 and encourage researchers to share their automatically or manually annotated labels and texts so that other researchers may use them for development and evaluation of new data extraction systems.

4.6. Latest developments and upcoming research

This is a timely LSR update, since it has a cut-off just before a the arrival of a new generation of tools: generative ‘Large Language Models’ (LLMs), such as ChatGPT from OpenAI, based on the GPT-3.5 model [ 1 ]. 120 As such, it may mark the current state of the field at the end of a challenging period of investigation, where the limitations of recent machine learning approaches have been apparent, and the automation of data extraction was quite limited.

The arrival of transformer-based methods in 2018 marked the last big change in the field, as documented by this LSR. Methods of our included papers only rarely progressed beyond the original BERT architecture, 14 varying mostly just in terms of datasets used in pre-training. Few used models only marginally different to BERT, such as RoBERTa with its altered pre-training strategy. 121 However, Figure 13 (reproduced from Yang et al. (2023) 122 ) shows that there has been a vast amount of NLP research and whole families of new methods that have not yet been tested to advance our target task of data extraction. For example within the new GPT-4 technical report, OpenAI describe increased performance, predictability, and closer adherence to the expected behaviour of their model, 123 and some other (open-source) LLMs shown in Figure 13 may have similar potential.

An external file that holds a picture, illustration, etc.
Object name is f1000research-10-151999-g0012.jpg

Early evaluations of LLMs suggest that these models may produce a step-change in both the accuracy and the efficiency of automated information extraction, while in parallel reducing the need for expensive labelled training data: a pre-print by Shaib et al. 124 describes a new dataset [ 2 ] and an evaluation of GPT-3-produced RCT summaries; 124 Wadhwa, DeYoung, et al. 125 use the Evidence Inference dataset and it’s annotations of RCT intervention-comparator-outcome triplets to train and evaluate BRAN, DyGIE++, ELI, BART, T5-base, and several FLAN models in a pre-print; 125 and in a separate pre-print Wadhwa, Amir, et al. 126 used the Flan-T5 and GPT-3 models to extract and predict relations between drugs and adverse events. 126 In the near future we expect the number of studies in this review to grow, as more evaluations of LLMs move into pre-print or published literature.

4.6.1 Limitations of this living review

This review focused on data extraction from reports of clinical trials and epidemiological research. This mostly includes data extraction from reports of randomised controlled trials where intervention and comparators are usually jointly extracted, and only a very small fraction of the evidence that addresses other important study types (e.g., diagnostic accuracy studies). During screening we excluded all publications related to clinical data (such as electronic health records) and publications extracting disease, population, or intervention data from genetic and biological research. There is a wealth of evidence and potential training and evaluation data in these publications, but it was not feasible to include them in the living review.

5. Conclusion

This LSR presents an overview of the data-extraction literature of interest to different types of systematic review. We included a broad evidence base of publications describing data extraction for interventional systematic reviews (focusing on P, IC, and O classes and RCT data), and a very small number of publications extracting epidemiological and diagnostic accuracy data. Within the LSR update we identified research trends such as the emergence of relation-extraction methods, the current dominance of transformer neural networks, or increased code and dataset availability between 2020-2022. However, the number of accessible tools that can help systematic reviewers with data extraction is still very low. Currently, only around one in ten publications is linked to a usable tool or describes an ongoing implementation.

The data extraction algorithms and the characteristics of the data they were trained and evaluated on were well reported. Around three in ten publications made their datasets available to the public, and more than half of all included publications reported training or evaluating on these datasets. Unfortunately, usage of different evaluation scripts, different methods for averaging of results, or custom adaptations to datasets still make it difficult to draw conclusions on which is the best performing system. Additionally, data extraction is a very hard task. It usually requires conflict resolution between expert systematic reviewers when done manually, and consequently creates problems when creating the gold standards used for training and evaluation of the algorithms in this review.

We listed many ongoing challenges in the field of data extraction for systematic review (semi) automation, including ambiguity in clinical trial texts, incomplete data, and previously unseen data. With this living review we aim to review the literature continuously as it becomes available. Therefore, the most current review version, along with the number of abstracts screened and included after the publication of this review iteration, is available on our website.

Data availability

Author contributions.

LS: Conceptualization, Investigation, Methodology, Software, Visualization, Writing – Original Draft Preparation

ANFM: Data Curation, Investigation, Writing – Review & Editing

RE: Data Curation, Investigation, Writing – Review & Editing

BKO: Conceptualization, Investigation, Methodology, Software, Writing – Review & Editing

JT: Conceptualization, Investigation, Methodology, Writing – Review & Editing

JPTH: Conceptualization, Funding Acquisition, Investigation, Methodology, Writing – Review & Editing

Acknowledgements

We thank Luke McGuinness for his contribution to the base-review, specifically the LSR web-app programming, screening, conflict-resolution, and his feedback to the base-review manuscript.

We thank Patrick O’Driscoll for his help with checking data, counts, and wording in the manuscript and the appendix.

We thank Sarah Dawson for developing and evaluating the search strategy, and for providing advice on databases to search for this review. Many thanks also to Alexandra McAleenan and Vincent Cheng for providing valuable feedback on this review and its protocol.

[version 2; peer review: 3 approved]

Funding Statement

We acknowledge funding from NIHR (LAM through NIHR Doctoral Research Fellowship (DRF-2018-11-ST2-048), and LS through NIHR Systematic Reviews Fellowship (RM-SR-2017-09-028)). LAM is a member of the MRC Integrative Epidemiology Unit at the University of Bristol. The views expressed in this article are those of the authors and do not necessarily represent those of the NHS, the NIHR, MRC, or the Department of Health and Social Care.

The funders had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript.

1 https://openai.com/blog/chatgpt (last accessed 22/05/2023).

2 https://github.com/cshaib/summarizing-medical-evidence (last accessed 22/05/2022).

  • Version 2. F1000Res. 2021; 10: 401.

Reviewer response for version 1

Carmen amezcua-prieto.

1 Department of Preventive Medicine and Public Health, University of Granada, Granada, Spain

Data extraction in a systematic review is a hard and time-consuming task. The (semi) automation of data extraction in systematic reviews is an advantage for researchers and ultimately for evidence-based clinical practice. This living systematic review examines published approaches for data extraction from reports of clinical studies published up to a cut-off date of 22 April 2020. The authors included more than 50 publications in this version of their review that addressed extraction of data from abstracts, while less (26%) used full texts. They identified more publications describing data extraction for interventional reviews.  Publications extracting epidemiological or diagnostic accuracy data were limited.

Main important issues have been addressed in the systematic review:

  • This living systematic review has been justified. The field of systematic review (semi) automation is evolving rapidly along with advances in language processing, machine learning, and deep learning.
  • Searching and update schedules have been clearly defined, shown in Figure 1.
  • There are sufficient details of the methods and analysis provided to allow replication.
  • Conclusions are drawn adequately supported by the results presented in the review.

A minor consideration is suggested:

  •  An incomplete sentence in Methods: ‘We included reports published from 2005 until the present day, similar to’.

Are the rationale for, and objectives of, the Systematic Review clearly stated?

Is the statistical analysis and its interpretation appropriate?

Not applicable

Have the search and update schedule been clearly defined and justified?

Is the living method justified?

Are sufficient details of the methods and analysis provided to allow replication by others?

Are the conclusions drawn adequately supported by the results presented in the review?

Reviewer Expertise:

I confirm that I have read this submission and believe that I have an appropriate level of expertise to confirm that it is of an acceptable scientific standard.

Kathryn A. Kaiser

1 Department of Health Behavior, School of Public Health, University of Alabama at Birmingham, Birmingham, AL, USA

The authors have undertaken and documented the steps taken to monitor an area of research methods that is important to many around the world by use of a “living systematic review”. The specific focus is on automated or semi-automated data extraction around the PICO structure often used in biomedicine, whether it be to summarize a body of literature narratively or using meta-analysis techniques. A significant irony about the body of papers included in this review is that there is a large amount of missingness related to the performance of such methods. Those who conduct systematic reviews know well the degree of missing information sought to summarize a group of studies.

Readers who will be most interested in this ongoing work can maintain an eye on the authors’ progress in identifying activities in this space. It is not clear, however, how long the funding will support this effort or how long the authors will remain engaged in advancing this project. The data represented in this paper does not give readers confidence that the community is approaching acceptable methods that are superior to other, less automated methods (the latter of which are not well-discussed).

Some aspects of the paper would benefit from additional detail (in no particular order of importance):

  • The end game for the tracking of this area of literature is not explicitly described in the abstract, nor is it discussed to a great extent at the end of the paper. Much of the results presented do not paint a bright future for this area of research as conditions presently are. While the aim is laid out well in section 1.2, the large amount of missing performance data (reported to be 87%) is unable to address the “Is it reliable?” question. One might suspect that if particularly stellar performance were demonstrated by a project, those data would be prominently advertised. Thus, the yet-to-be-done contacting of authors step would be enlightening if either performance data can be obtained, or if authors remain silent on that request. This follow-up task will be a major point of interest for many who will follow updates to this paper. It is likely that the particular research context (e.g. see Pham  et al ., 2021 1 ) will have a large degree of influence on the performance metrics to be had if they can be determined.
  • The description of how the 17 “Key items of interest” were determined and if there is a plan to put these forth as methodological guidelines or a reporting checklist would be helpful. Either of these would help to advance the field further.
  • On Page 5, the exclusions listed have the use of pre-processing of text, yet the results discuss the many papers that appear to have used that in their methods. Perhaps this is a deviation from the original protocol after the review began (an understandable decision)?
  • In section 2.4 about searching Pubmed, can the authors clarify that the Pubmed 2.0 API or GUI will be used to access candidate literature?
  • Also relevant to section 2.4 on searching, since GITHUB is so popular, might this also be a fruitful place to routinely search?
  • Clarification of the ability to obtain cited software packages (whether for no cost or at some cost) would be helpful.
  • Figure 3 explanation of PICO is a typo – “PCIO”.
  • Table 5 is shown before Table 1. Please check and correct flow and references to table numbers (5,1,4,2,3 is the flow now).
  • One of the major limitations to be noted is the unfortunate issue of the lack of specific data in abstracts about interventions and comparators.

Systematic reviews in biomedicine topics, issues with time and effort required to complete reviews with generally available tools.

Emma McFarlane

1 Centre for Guidelines, National Institute for Health and Care Excellence, London, UK

This is a living systematic review of published methods and tools aimed at automating or semi-automating the process of data extraction in the context of a systematic review. Automating data extraction is an area of interest among evidence-based medicine. 

The methods are sufficiently described to be replicated, but further details of analysis to determine the items of interest would be helpful to link into the results. Additionally, the authors may want to consider commenting on the topic areas covered by the included studies and whether that has an impact on any of the metrics measured. 

In the discussion section, it's interesting that fewer studies extracted data from the full text. Could the authors comment on the implications of this in terms of using tools in a live review as it's not common to manually only extract data from an abstract.

Evidence-based medicine, systematic reviews, automation techniques.

Systematic reviews

  • Identifying
  • Collecting and combining
  • Summarizing

literature review data collection method

Extracting information from selected studies

Steps for collecting and combining data include:

  • Plan out your synthesis methods
  • List all your data elements to be extracted
  • Develop your data collection methods
  • Develop your form and collect your data
  • Complete your synthesis and explain your findings; tell your story

Three examples of a data extractions form are below:

  • Data Extraction Form Example (suitable for small-scale literature review of a few dozen studies) This example was used to gather data for a poster reporting a literature review of studies of interventions to increase Emergency Department throughput (view poster (PDF) ).
  • Data Extraction Form for the Cochrane Review Group (uncoded & used to extract fine-detail/many variables) (PDF) This is one example of a form, illustrating the thoroughness of the Cochrane research methodology. You could devise a simpler one page data extraction form for a more simple literature review.
  • Coded data extraction form (fillable form fields that can be computerized for data analysis) See Table 1 of Brown, Upchurch & Acton (2013)

Combining data in a table

There are various potential data collection tools that could be used to combine your data.

For more information, please contact:

A Systematic Literature Review on Survey Data Collection System

Ieee account.

  • Change Username/Password
  • Update Address

Purchase Details

  • Payment Options
  • Order History
  • View Purchased Documents

Profile Information

  • Communications Preferences
  • Profession and Education
  • Technical Interests
  • US & Canada: +1 800 678 4333
  • Worldwide: +1 732 981 0060
  • Contact & Support
  • About IEEE Xplore
  • Accessibility
  • Terms of Use
  • Nondiscrimination Policy
  • Privacy & Opting Out of Cookies

A not-for-profit organization, IEEE is the world's largest technical professional organization dedicated to advancing technology for the benefit of humanity. © Copyright 2024 IEEE - All rights reserved. Use of this web site signifies your agreement to the terms and conditions.

The Library Is Open

The Wallace building is now open to the public. More information on services available.

  • RIT Libraries
  • Social/Behavioral Sciences Research Guide

Data Collection Methods

This InfoGuide assists students starting their research proposal and literature review.

  • Introduction
  • Research Process
  • Types of Research Methodology
  • Anatomy of a Scholarly Article
  • Finding a topic
  • Identifying a Research Problem
  • Problem Statement
  • Research Question
  • Research Design
  • Search Strategies
  • Psychology Database Limiters
  • Literature Review Search
  • Annotated Bibliography
  • Writing a Literature Review
  • Writing a Research Proposal

Quantitative and qualitative data can be collected using various methods. It is important to use a data collection method to help answer your research question(s).

Many data collection methods can be either qualitative or quantitative. For example, in surveys, observational studies or case studies, your data can be represented as numbers (e.g., using rating scales or counting frequencies) or as words (e.g., with open-ended questions or descriptions of what you observe).

However, some methods are more commonly used in one type or the other.

Quantiative & Qualitative Data Collection Methods

Cover Art

  • << Previous: Types of Research Methodology
  • Next: Anatomy of a Scholarly Article >>

Edit this Guide

Log into Dashboard

Use of RIT resources is reserved for current RIT students, faculty and staff for academic and teaching purposes only. Please contact your librarian with any questions.

Facebook icon

Help is Available

literature review data collection method

Email a Librarian

A librarian is available by e-mail at [email protected]

Meet with a Librarian

Call reference desk voicemail.

A librarian is available by phone at (585) 475-2563 or on Skype at llll

Or, call (585) 475-2563 to leave a voicemail with the reference desk during normal business hours .

Chat with a Librarian

Social/behavioral sciences research guide infoguide url.

https://infoguides.rit.edu/researchguide

Use the box below to email yourself a link to this guide

1Library

  • No results found

Data Collection Methods

2. literature review, 3.6 data collection methods.

A mixed set of data collection methods has been accurately and meticulously planned and developed to ensure to collect the required data to ultimately get to what needs to be measured and known. This naturally led to how to extract such information, the methods, and what needs to be done with it, the analysis. Over and above this, the validity and reliability of the entire process needed to be sound. A mixture of quantitative and qualitative instruments were employed depending on the medium and nature of what data was being collected. The proposed methods for this empirical research study were the following five data collection (DC) instruments:

DC 1 – Pre-test using a survey tool for data collection;

This quantitative instrument was designed to extract information about the participants prior to their exposure to the proposed system. The survey tool itself was adopted and adapted from the validated Technology Acceptance Model (TAM) instrument (Davis, 1993) whereby the attitudes and level of technology acceptance by the participants were captured. The reason this model was chosen is due to its popularity and the frequency of its use in such situations (Ma & Liu, 2004; Kim & Chang, 2007; Yarbrough & Smith, 2007). The technology acceptance model is intention-based and developed specifically for justifying user acceptance of computer technology. Masrom (2007) makes extensive use of the TAM within an e-learning environment to investigate the effects of user acceptance and attitudes on the use of e-learning within an application.

The pre-test survey employed, shown in Appendix E, contained twenty-four (24) items which are subdivided into eight (8) sections. Apart from the basic personal information, qualifications and work related details, the sections included personal use of technology, and the participants’ views about e-learning courses, e-learning design and online assessment. The data collected in this pre-test were employed as a baseline to create a realistic contrast with the post-test together with additional data that was collected.

DC 2 – Intermediate participant opinion using dichotomous questions;

Quick participant opinion were recorded at different intervals during the progress of the delivery mode under investigation. Simple questions, similar to the ones shown in Appendix G, were purposely designed to minimise the interruption of the flow of instruction while gathering minute yet frequent input from the participant. Such a methodology is similar to the momentary time sampling methodology (Meany-Daboul, Roscoe, Bourret, & Ahearn, 2007) from which it was adapted. The data collected was meant to record the participants’ evolving sentiments and opinions that could not be captured with the other data collection methods adopted.

DC 3 – Intermediate assessment using a questionnaire as an evaluation tool;

A series of assessments following the completion of each part of the course are employed to collect participants’ scores on their understanding of the presented content. This is in no way meant to measure the ability or the academic achievement of the participants, but merely to complement and support the results from the other methods employed. Similar studies (Neuhauser, 2010; Joy & Garcia, 2000; Domenic, 2010) have employed this instrument to assist them in measuring learning effectiveness. In this study the participants’ scores resulting from the various assessments were used to shed additional light on the hypothesis stated in Chapter 1. The questionnaire, shown in Appendix H, were entirely based on the content and was distributed in a printed format in the case of the face-to-face mode of delivery, while in the other two modalities it was made available as a soft copy at the end of the static and dynamic sessions.

DC 4 – Final experience evaluation using a number of focus group sessions;

The final data collection method employed was at the very end of the empirical study with the help of focus group sessions. The reason behind these focus group sessions was to understand further the participants’ perceptions and attitudes towards the proposed dynamic learning environment. A semi-structured focus group tool (Appendix I) was used with randomly selected participants in three (3) groups of between 8 to 10

participants. The structure and content of these sessions have been adopted and adapted from Wilkinson (2012) and were meant to mainly discuss the following questions:

Ø Q1: Which modality was most effective and functional?

Ø Q2: Were the personal interests effective and add value to the experience? Ø Q3: Which mode or combination of modes would you prefer/recommend?

DC 5 – Post-test using a survey tool for data collection;

This final quantitative instrument was designed in tandem with the DC 1 pre-test survey to extract information from the participants prior to their experience and exposure to the intelligent personal learning environment. The TAM model, introduced earlier, was also adapted and employed as an instrument to design and develop this data collection survey tool. The post-test survey, as seen in Appendix F, contains thirty (30) items concentrated within five (5) sections. The first section covered the basic participant information, while the other sections tackled the main issues under investigation, namely, effectiveness of the proposed medium in comparison to the other two modes, any changes related to e-learning, its design and online assessment.

Table 3.2 - Data collection legend & summary

Data Collection Method What is being measured

DC1 – Pre-test survey Technology use, e-learning familiarity

DC2 – Intermediate opinion Transitional participant attitudes

DC3 – Intermediate questionnaire Assessment of content acquisition

DC4 – Final focus groups Overall attitudes of experience

The data collection methods have been tabulated in Table 3.2 above together with a short summary of each to serve as a legend. The same DC methods can be seen within the overall data collection plan in Figure 3.1 below. This helps to visualise the administration of the different methods during the empirical study in chronological order.

  • Approaches Adopted
  • Learning Theories
  • Data Collection Methods (You are here)
  • Chapter Closure

Related documents

Log in using your username and password

  • Search More Search for this keyword Advanced search
  • Latest content
  • For authors
  • Browse by collection
  • BMJ Journals More You are viewing from: Google Indexer

You are here

  • Volume 14, Issue 2
  • German primary care data collection projects: a scoping review
  • Article Text
  • Article info
  • Citation Tools
  • Rapid Responses
  • Article metrics

Download PDF

  • http://orcid.org/0000-0001-8645-4350 Konstantin Moser 1 , 2 ,
  • Janka Massag 2 ,
  • Thomas Frese 1 ,
  • Rafael Mikolajczyk 2 ,
  • Jan Christoph 3 ,
  • Joshi Pushpa 1 ,
  • Johanna Straube 1 ,
  • http://orcid.org/0000-0002-0108-0415 Susanne Unverzagt 1
  • 1 Medical Faculty of the Martin Luther University Halle-Wittenberg , Institute of General Practice and Family Medicine , Halle , Germany
  • 2 Medical Faculty of the Martin Luther University Halle-Wittenberg , Institute of Medical Epidemiology, Biometrics, and Informatics , Halle , Germany
  • 3 Medical Faculty of the Martin Luther University Halle-Wittenberg , Junior Research Group (Bio-)Medical Data Science , Halle , Germany
  • Correspondence to Konstantin Moser; konstantin_moser{at}web.de

Background The widespread use of electronic health records (EHRs) has led to a growing number of large routine primary care data collection projects globally, making these records a valuable resource for health services and epidemiological and clinical research. This scoping review aims to comprehensively assess and compare strengths and limitations of all German primary care data collection projects and relevant research publications that extract data directly from practice management systems (PMS).

Methods A literature search was conducted in the electronic databases in May 2021 and in June 2022. The search string included terms related to general practice, routine data, and Germany. The retrieved studies were classified as applied studies and methodological studies, and categorised by type of research, subject area, sample of publications, disease category, or main medication analysed.

Results A total of 962 references were identified, with 241 studies included from six German projects in which databases are populated by EHRs from PMS. The projects exhibited significant heterogeneity in terms of size, data collection methods, and variables collected. The majority of the applied studies (n = 205, 85%) originated from one database with a primary focus on pharmacoepidemiological topics (n = 127, 52%) including prescription patterns (n = 68, 28%) and studies about treatment outcomes, compliance, and treatment effectiveness (n = 34, 14%). Epidemiological studies (n = 77, 32%) mainly focused on incidence and prevalence studies (n = 41, 17%) and risk and comorbidity analysis studies (n = 31, 12%). Only 10% (n = 23) of studies were in the field of health services research, such as hospitalisation.

Conclusion The development and durability of primary care data collection projects in Germany is hindered by insufficient public funding, technical issues of data extraction, and strict data protection regulations. There is a need for further research and collaboration to improve the usability of EHRs for health services and research.

  • Primary Care
  • GENERAL MEDICINE (see Internal Medicine)
  • Health informatics

Data availability statement

All data relevant to the study are included in the article or uploaded as supplementary information.

This is an open access article distributed in accordance with the Creative Commons Attribution Non Commercial (CC BY-NC 4.0) license, which permits others to distribute, remix, adapt, build upon this work non-commercially, and license their derivative works on different terms, provided the original work is properly cited, appropriate credit is given, any changes made indicated, and the use is non-commercial. See:  http://creativecommons.org/licenses/by-nc/4.0/ .

https://doi.org/10.1136/bmjopen-2023-074566

Statistics from Altmetric.com

Request permissions.

If you wish to reuse any or all of this article please use the link below which will take you to the Copyright Clearance Center’s RightsLink service. You will be able to get a quick price and instant permission to reuse the content in many different ways.

Strengths and limitations of this study

This scoping review is the first in the literature to conduct a comprehensive literature search in electronic databases, spanning two time points (May 2021 and June 2022). It ensures a thorough overview of primary care data collection projects and research publications in Germany dedicated to extracting data from practice management systems.

The inclusion of 241 studies from six German projects enabled a detailed analysis, revealing significant heterogeneity in terms of project size, data collection methods, and variables collected. This provided valuable insights into the diversity of approaches.

The study effectively identifies and discusses key challenges in primary care data collection projects in Germany, such as the extraction of data from diverse practice management systems, the lack of standardised interfaces, and issues related to data quality.

A limitation of the study is the development of an independent classification system due to the absence of a common method in the literature. This poses a challenge as some publications may have been excluded or misclassified, impacting the accuracy of the analysis.

Introduction

Electronic health records (EHRs) serve as a comprehensive record of a patient’s health information, capturing crucial details from each medical visit. 1 While originally created for clinical purposes, EHRs are now widely utilised in epidemiological and clinical research, as well as for improving healthcare services. 2 3 Currently, about 36 large routine primary care data collection projects exist globally, in which EHRs are directly collected from practice management systems (PMS). These projects, which allow millions of patients to anonymously contribute data for health sciences, are mainly carried out in English-speaking (UK, USA, and Canada) and European countries. The success and longevity of these projects is influenced by factors such as strong academic and governmental support as well as the use of comprehensive technical facilities for data extraction and analysis. 4

In Germany, the analysis of EHRs in primary care is largely based on health insurance data rather than primary care data collection projects. 5 However, health insurance data are primarily recorded for accounting purposes and lack valuable information such as clinical input data, reasons for encounters, or diagnostic procedures. 6 Additionally, privately insured patients, which account for approximately 13% of the German population, are often not included in such health insurance databases, potentially leading to selection bias. 7

Primary care in Germany is predominantly delivered by general practitioners (GPs), but may also encompass any outpatient physician accessible without a referral, irrespective of their specialty. 8 Between 2002 and 2010, the Federal Ministry of Education and Research (Bundesministerium für Bildung und Forschung (BMBF)) recognised the importance of family medicine in the improvement of healthcare services and research. 9 During this time, the ministry also funded two primary care data collection projects, MedVip (Medizinische Versorgung in Praxen) and CONTENT (CONTinuous morbidity registration Epidemiologic NeTwork). 10 However, these projects ended due to limited funding and technical challenges, and a standardised interface for extracting EHRs is still lacking, even though there are over 132 different PMS available on the German market. 11–13 Despite these challenges, the use of EHRs in outpatient care continues to grow due to the vast amount of data available. In 2020, for example, approximately 688 million outpatient cases were treated by 161 400 outpatient physicians in Germany, representing a ‘real-world data treasure’. 14

EHRs have evolved from their initial purpose of billing to becoming a valuable tool for epidemiological and clinical research. 2 3 The increasing functionality and quality of EHRs have made them an affordable and accessible data source. 15 In clinical research, for example, EHRs can facilitate patient identification and recruitment, assess study feasibility, and streamline data collection at baseline and follow-up. 15–17

The aim of this scoping review is to identify and describe all primary care data collection projects and research publications in Germany dedicated to extracting data from PMS. This might facilitate further research by describing the methodological problems, amplifying possible solutions, and proposing the potential of the projects to inform health policy and practice. To this end, we chose to conduct a scoping review, since our goal is to identify and map study characteristics and not to answer a clinically meaningful question. 18

Search strategy

This scoping review follows the Preferred Reporting Items for Systematic reviews and Meta-Analyses extension for Scoping Reviews (PRISMA-ScR) checklist. 19 In order to identify studies relevant for our research question, we explored two electronic databases, Medline (via OVID) and LIVIVO, the latter of which is a German database for life sciences. The search was conducted in May 2021 and updated in June 2022, searching for all records until this time point without any time restrictions. The search string combined the terms ‘general practice’ with synonyms like ‘family physician’ as well as ‘routine data’. Other terms such as ‘electronic health record’ or ‘Germany’ were included to cover all relevant aspects of our research questions. For each keyword, relevant Medical Subject Headings terms were identified for the Medline exploration. The LIVIVO search was conducted in German with the equivalent terms. When relevant projects were identified, the project names were added to the search string to find further publications. In addition, we searched the project websites and contacted the project’s principal investigators (PIs) using a comprehensive checklist that included a list of publications retrieved by the search to identify any missing project information that was not publicly available. With encouragement from the PI of the IQVIA disease analyser (DA), we also conducted a search on PubMed (National Library of Medicine) using the keywords ‘DA’ and ‘Germany’ to gather all relevant publications from this database, since a considerable number of publications were identified through the PubMed search which were not previously found through the Ovid Medline search. The complete search strategy can be found in the online supplemental table S1 .

Supplemental material

Inclusion/exclusion criteria.

Abstract, title, and subsequently full-texts were reviewed independently by three researchers (KM, JM, and JS) and checked for eligibility. All disagreements were resolved through consensus. If no consensus was reached, a fourth researcher was consulted (SU). We used two online tools for the screening process. Rayyan ( https://www.rayyan.ai/ ) was used for title and abstract screening and Covidence ( https://www.covidence.org/ ) was used for full-text screening. Both tools allow for each reviewer to decide if the text should be included, excluded, or if it is undecided and to add a reason for this decision. Decisions were blinded until both reviewers were done with the screening. After both reviewers were able to see if they agreed or disagreed on the inclusion of a text.

Studies were eligible if they met the following inclusion criteria: (1) the study population consisted of patients who received treatment from primary care physicians but could also include patients who received care from other specialists who were not considered primary care physicians; (2) use of EHR data that were initially entered into the PMS independently of primary or secondary purpose; (3) EHR data were extracted from PMS and transferred to a database; (4) studies utilising data collected as part of routine clinical practice; and (5) full-text publications in English or German language. The following were excluded: (1) health research studies using primary data, health insurance data, and data from disease registries; (2) conference contributions and publications in languages other than English or German; and (3) studies collecting supplementary data beyond usual care.

Data management

The identified references were downloaded into the reference manager EndNote V.X7.8 where potential duplicates were identified with the respective tool. Duplicates that were not identified by the automated tool due to different spelling were removed manually during the review process.

Data extraction

Information from the retrieved publications was extracted by KM, JM, and JS. JM and JS each reviewed the included publications using a standardised data extraction template created with Microsoft Word. The data were double checked by KM and entered in online supplemental table S2 . We extracted information on the following: German primary care data collection projects including general information, data collection methods, data evaluation, and recruitment strategies, and classified studies as applied studies and methodological studies and categorised type of research into subject area, sample of publications, disease category, or main medication analysed.

Patient and public involvement

We identified 962 references, screened a 291 of those as potentially eligible studies, and included 241 studies conducted with data from six German projects in which databases are filled with EHR from PMS (see figure 1 ).

  • Download figure
  • Open in new tab
  • Download powerpoint

PRISMA 2020 flow diagram for new systematic reviews which included searches of databases only. PRISMA, Preferred Reporting Items for Systematic reviews and Meta-Analyses.

Database characteristics

Four out of six primary healthcare data collection projects are currently active and two have been completed ( table 1 ). This overview is sorted by the year in which data collection began.

  • View inline

Overview of German primary care data collection projects

Of the six, the IQVIA DA is the only German project out of the six identified by this review that is exclusively funded by the pharmaceutical sector. It is specialised in pharmacoepidemiological research and is used as an information system for federal health monitoring. 20 Currently, it includes patient records from around 2815 practices, mostly general practices but also including other specialties like cardiology, dermatology, and paediatrics, which are not linked across practices. 21 With approximately 34 million cases included, it is the largest German primary data collection database and considered to be nationally representative. 22

The other five primary care data collection databases are publicly funded and organised by local academic research groups. Main financiers are the BMBF and the German Research Foundation. The MedVip project aimed to realise first solutions for the use of routine data documentation in the general practice setting. At its peak, a total of 165 practices with approximately 153 000 patient data sets were extracted from 21 different PMS providers. The CONTENT project was based on the International Classification of Primary Care (ICPC) of episodes of care as the primary classification system. 23 24 Up to 23 practices provided data including approximately 200 000 cases. The project ended because of very high costs and organisational demand. BeoNet (Beobachtungspraxen-Netzwerk)-Hannover was integrated within the German Centre for Lung Research with an initial focus on lung diseases and collects data from approximately 16 practices. Currently, the database includes 343 796 cases. 25 RADARplus (Routine Anonymised Data for Advanced Health Services Research plus) aims to develop the infrastructure and technologies, including electronic consent management due to the German data protection regulations, and collects data from seven practices including 100 pseudonymous cases. 21 BeoNet-Halle is the most recent database and includes anonymised as well as linked pseudonymised data sets from general practices and other types of practices in Germany. 26 The database includes 71 911 anonymised and 471 pseudonymised data sets from five practices in Saxony-Anhalt region.

The frequency of data collection by the projects ranges from weekly (BeoNet-Hannover), monthly (DA and BeoNet-Halle), and quarterly (CONTENT), to time points without a fixed interval (MedVip and RADARplus). It is crucial to note that in principle the data export interval can be configured to any desired value, including very short intervals.

Data collection methods

Anonymised data are exclusively collected by the DA and BeoNet-Halle, whereas all other projects except for the DA obtain pseudonymised data. In order to collect pseudonymised data, BeoNet-Hannover, RADARplus, and BeoNet-Halle have instituted informed consent procedures ( table 2 ). RADARplus and BeoNet-Halle employ an adapted version of the modular Broad Consent, as per the template provided by the Medical Informatics Initiative (MII), allowing for the transfer of identifiable data in compliance with data protection regulations. 27 Using Broad Consent, patients have the option to provide consent for various modules, encompassing data collection, processing, scientific utilisation of their patient data, as well as the transfer and scientific use of their health insurance data, along with the possibility for further contact. BeoNet-Hannover has introduced a study-specific consent procedure. The projects exhibit significant heterogeneity in their workflows related to data collection, transfer, and storage, including the integration of trust offices in the cases of RADARplus and BeoNet-Halle.

Three projects (MedVip, BeoNet-Hannover, and RADARplus) extract data using a universal interface (Behandlungsdatentransfer (BDT)). BDT was implemented by the central institute for statutory healthcare to support data exchange between different PMS. The MedVip project has shown the feasibility of data extraction using BDT with various implementations by different software providers. However, its use requires partly that PMS providers assist onsite in extracting the requested data. Despite several updates to the BDT interface, it may still cause inadequate data quality when extracting data from different PMS. Since June 2021, an ‘archive and exchange interface’ is mandatory in PMS which shall replace BDT. It is based on the interoperability standard HL7 FHIR (Health Level Seven International Fast Healthcare Interoperability Resources), which has gained widespread adoption in the healthcare industry and facilitates interoperability.

The other projects (DA, CONTENT, and BeoNet-Halle) developed their own software solutions to extract predefined data sets. The CONTENT project developed a tailored data extraction software and a modular ICPC software. For BeoNet-Halle, specific exporting modules allow anonymised or pseudonymised data extraction depending on a patient’s consent status.

Some projects (DA, CONTENT, BeoNet-Hannover, and BeoNet-Halle) provide training on how to use the software and others provide onsite support to extract data (MedVip and RADARplus). For most projects, data can be uploaded manually by the physician or the research team. Some projects (BeoNet-Hannover and BeoNet-Halle) have also implemented automatic upload to a secure network within the database location. Data validation and integrity checks are run in all projects before data is uploaded to the database and subsequently to an analysis server that can be assessed by researchers. This process is generally facilitated by a database administrator.

Anonymisation and pseudonymisation processes

We could not find publications on specific details of the anonymisation process by the DA. In the case of MedVip, a custom Java programme in doctors’ offices removes identifiable BDT fields, except for the patient ID, and encrypts BDT files. For CONTENT, the patient’s name is replaced with a unique case number before export. BeoNet Hannover generates automatic pseudonyms from patient IDs for studies, and data are pseudonymised again before leaving the practice, with data processing managed by the data manager. RADARplus follows a privacy-by-design approach, manually documenting consented patients and separating identifiable and medical data. Identifiable data are encrypted and replaced by a pseudonym provided by a trusted third party. For anonymised data, BeoNet Halle assigns unique 35-character keys to patients created from the patient ID which changes from export to export. For pseudonymised data, it creates temporary pseudonyms for consenting patients sent to a trusted third party for generating permanent pseudonyms, allowing data linkage across multiple sources.

Collected variables and data quality

Most projects collect data that are part of health insurance records, encompassing basic patient demographics, diagnoses, drug prescriptions, and billing codes ( online supplemental table S3 ). 28

Laboratory tests, such as HbA1c, and health utilisation variables like referrals or hospitalisations, are documented by most projects. Additionally, the majority of ongoing projects (DA, MedVip, BeoNet-Hannover, and BeoNet-Halle) capture essential vital signs, including blood pressure, height, weight, and Body Mass Index, as well as lifestyle-related factors such as smoking status and allergies (DA, BeoNet-Hannover, and BeoNet-Halle). Regarding sociodemographic variables (eg, education and income), number of children, or substance abuse, these variables are not systematically recorded in German PMS. These variables may be entered into structured or free-text fields. To fill this information gap, some projects use standardised questionnaires (BeoNet-Hannover, BeoNet-Halle) given out to patients who consented.

As for the extraction of free-text data, limited information is available, except for BeoNet-Halle, which extracts pseudonymised free text. The MedVip project has partially extracted free-text data due to the absence of data protection regulations during that period.

The CONTENT project can be considered as the only project that attempted to improve data quality at the point of data entry. Several quality circles were implemented and proposed solutions were discussed on a regular basis including training on ICPC-2 coding.

Recruitment strategies

Strategies to recruit GPs and other specialists comprise various financial and non-financial incentives ( online supplemental table S4 ). The DA provides financial incentives of an undisclosed amount, supports practices by using the exporting software, and provides quarterly feedback reports. Its popularity further seems to contribute to its recruitment success.

Publicly funded projects use only some of these recruitment strategies along their project trajectories. Snowball recruitment is usually implemented at the start of the project to get it running. There have been some ‘cold’ acquisition attempts (MedVip and RADARplus) including the distribution of circulars, but they were associated with low recruitment rates. Some projects use regular or one-time financial incentives (MedVip, BeoNet-Halle, and CONTENT), while others claim to support practices with establishing a research infrastructure (BeoNet-Hannover, BeoNet-Halle, and CONTENT). Regular feedback reports are provided by some projects (DA, MedVip, CONTENT, and BeoNet-Halle). CONTENT particularly targeted practices with long-term commitment and willingness to code with ICPC. It is also the only project that developed a protected access area where the patients’ own data could be accessed. BeoNet-Halle and RADARplus favour practices that integrate consent management.

Applications of the databases

A total of 241 publications were identified ( online supplemental table S2 ). Most articles described applied studies (n=230, 95%) and 5% (n=11) of the articles described methods ( figure 2 ). Methodological studies mainly deal with project-specific issues, such as project descriptions or data collection issues; 30% (n=72) of the studies were industry-funded, while only 9% (n=21) of the publications used data from more than one database. The mean time of recruitment varied from study to study. However, the overall mean time of recruitment across all studies was 7 years in the DA, 4.75 years in MedVip, and 3 years in CONTENT.

Flow diagram of the extracted articles and their arrangement.

Of the 241 publications included, 85% (n=205) were contributed by the DA ( figure 2 and online supplemental table S2 ).

In total, 52% (n=127) of the studies deal with pharmacoepidemiological topics including prescription patterns (n=68, 28%) and studies on treatment outcomes, compliance, and treatment effectiveness (n=34, 14%). Epidemiological studies (n=77, 32%) mainly focused on incidence and prevalence (n=41, 17%) along with risk and comorbidity analysis (n=31, 12%). A small proportion included health services research studies (n=23, 10%) with topics such as hospitalisation.

The findings presented in the results section shed light on the landscape of primary care data collection projects in Germany, where databases are populated with EHRs from PMS. In this discussion, we delve into the implications of these findings, drawing comparisons with other countries and addressing key challenges and potential avenues for improvement.

In Germany, one notable challenge arises from the extraction of data from more than 132 different PMS, which currently hinders the uniform consolidation of data for research purposes. 13 29 Despite the existence of mandatory exchange interfaces, such as BDT or the ‘archive and exchange’ interface, no discernible improvements in the ambulatory sector have manifested in this regard. In contrast, the hospital sector boasts well-established standardised interfaces for research. 11 The development of standardised interfaces has proven to be a complex and collaborative effort, engaging various stakeholders, including patients, PMS vendors, standards organisations, and academic institutions. 3 30 Further complicating the situation is the resistance of PMS vendors to external software modifications. 31

One challenge associated with extracting data from diverse PMS lies in the limited control over the data collection process, thereby compromising the assurance of data quality. 32 To illustrate, data may be gathered as part of routine patient care, encompassing information inputted by physicians for primary purposes such as patient care, billing processes, or documentation requirements. Alternatively, data may be collected supplementary to routine care, serving secondary purposes like research, quality improvement, or public health initiatives. The differentiation between these purposes becomes challenging due to the integration of data collected through a complex array of modules and interfaces from various PMS. This complexity is particularly pronounced in cases involving industrial funding, which was evident in a significant proportion of studies (n=72, 30%). It underscores the critical need for transparency and rigour in such studies to maintain scientific integrity, particularly in light of the increasing use of real-world evidence in early benefit assessments of novel therapies. 33

Another challenge in data quality is a predominance of free-text entries in PMS, making complete anonymisation a complex task. 34 EHRs encompass structured data, which is organised, quantifiable, and easily analysable due to its mostly standardised format, and unstructured data, including free-text and images. A comprehensive understanding of a patients’ health history necessitates the integration of both types. 3 Collaboration with the MII has introduced a Broad Consent concept that allows patients to agree to the scientific use of their data, potentially easing the extraction of free-text information in the future. 27 Therefore, informed consent emerges as a vital component for advancing EHR-based research.

The limited progress and short duration of publicly funded projects, as observed in this review, may be attributed to insufficient funding and inadequate government support. Recent projects have received notably meagre funding, especially when compared with government-supported initiatives in other nations. 4 The initial projects highlighted in this review enjoyed comparatively substantial public funding, indicating the need for sustained investment in healthcare research. 9 The private funding of the DA by pharmaceutical companies appears to be a contributing factor to its success.

The results indicate that Germany ranks 16th out of 20 analysed countries in terms of EHR implementation. This ranking places Germany behind countries like Sweden, Estonia, and the UK, which have emerged as pioneers in EHR adoption and integration. 35 36 Therefore, we conclude that the rapid digitalisation of healthcare systems has significantly influenced the development of primary care data collection initiatives. 4 It is crucial to examine the reasons behind this disparity in EHR adoption and its impact on healthcare research.

Sweden, for example, has efficiently collected and managed patient data through an integrated system including a unique personal identity number, focusing on patient consent and supporting research and quality enhancement. 37 Estonia adopted a comprehensive eHealth strategy in 2008, utilising incentives and penalties to establish a cohesive eHealth infrastructure. 38 The UK’s Clinical Practice Research Datalink stands out as a prominent real-world research service that has contributed data to over 3000 publications, surpassing all German projects combined by more than 12-fold. 39 The success of these initiatives can be attributed to factors like opt-out regulations, data quality improvements, and the engagement of healthcare providers. 40

Our findings, as presented in the results section, also hold implications for the use of databases filled with EHR in healthcare and epidemiological research. The results highlight the versatility of such databases in addressing a wide range of healthcare-related questions, such as evaluating prescription patterns, treatment outcomes, and analysing incidence, prevalence, and comorbidities.

Limitations

One major limitation of this scoping review is incomplete information about some projects. Some information, especially from the DA, is not publicly available due to company confidentiality reasons. A second limitation was mainly identified during the phase of classifying the publications. We developed our own classification system, as we were not able to identify a common classification method in the literature. Some publications listed by the projects’ homepages were not included in our final analysis, because we were not able to verify that they included data from PMS. Out of the 241 included publications, we retrieved full-text for 210 papers and extracted information from the abstracts for the remaining 31. Many studies did not describe their study design in detail and might have been classified wrongly. Finally, we only used three literature databases for our investigation, including one database (LIVIVO) that also includes grey literature.

The development and sustainability of German primary care data collection projects face several challenges, including limited funding, technical issues related to data extraction, and stringent data protection regulations. Interfaces for data exchange and research remain inadequately implemented. Furthermore, questions regarding data quality and the broad utilisation of ambulatory EHRs for research persist, largely due to the significant amount of information entered in free-text fields. This data can only be partially extracted with patients’ informed consent, thereby constraining the range of research publications, primarily focusing on (pharmaco)epidemiological topics derived from a privately funded database. As a result, Germany has yet to fully realise the potential for research made possible by EHRs.

Ethics statements

Patient consent for publication.

Not required.

Ethics approval

Acknowledgments.

For proofreading we acknowledge Dawn M Bielawski, PhD.

  • The Office of the National Coordinator for Health Information Technology (ONC)
  • Levaux HP ,
  • Becnel LB , et al
  • Gentil M-L ,
  • Fiquet L , et al
  • Hoffmann F , et al
  • Schubert I ,
  • Küpper-Nybelen J , et al
  • Akmatov MK ,
  • Holstiege J ,
  • Steffen A , et al
  • Spranger A ,
  • Achstetter K , et al
  • Gemeinsamer B
  • Bundesministerium für Forschung und Bildung (BMBF)
  • Kassenärztliche B
  • Schwartz BS ,
  • Stewart WF , et al
  • Thomas H , et al
  • Blomster JI ,
  • Curtis LH , et al
  • Peters MDJ ,
  • Stern C , et al
  • Tricco AC ,
  • Zarin W , et al
  • Gesundheitsberichterstattung des B
  • Heinemann S , et al
  • Schröder-Bernhardi D
  • Koerner T ,
  • Rosemann T , et al
  • World Organization of Family Doctors’ (WONCA)
  • Lingner H ,
  • Wacker M , et al
  • Mikolajczyk R ,
  • Bauer A , et al
  • Medizininformatikinitiative
  • Alzahrani MY , et al
  • Geyer S , et al
  • Thronicke A
  • Martin-Sanchez FJ ,
  • Aguiar-Pulido V ,
  • Lopez-Campos GH , et al
  • Bertram N ,
  • Püschner F ,
  • Gonçalves ASO , et al
  • Amelung VBS ,
  • Chase DP , et al
  • Kajbjer K ,
  • Nordberg R ,
  • World Health Organization
  • Clinical Practice Research Datalink (CPRD)

Supplementary materials

Supplementary data.

This web only file has been produced by the BMJ Publishing Group from an electronic file supplied by the author(s) and has not been edited for content.

  • Data supplement 1
  • Data supplement 2

Contributors KM, JM, and SU developed the methodological concept. KM, JM, and JS screened study titles and abstracts and examined the full texts for inclusion. KM, JM, JS, JC, TF, and JP developed the figures and tables. KM, JM, SU, TF, RM, JP, and JC participated in reading and approving the final manuscript. KM assumes responsibility as the guarantor for overseeing the entirety of the study's content.

Funding The authors have not declared a specific grant for this research from any funding agency in the public, commercial or not-for-profit sectors.

Competing interests None declared.

Patient and public involvement Patients and/or the public were not involved in the design, or conduct, or reporting, or dissemination plans of this research.

Provenance and peer review Not commissioned; externally peer reviewed.

Supplemental material This content has been supplied by the author(s). It has not been vetted by BMJ Publishing Group Limited (BMJ) and may not have been peer-reviewed. Any opinions or recommendations discussed are solely those of the author(s) and are not endorsed by BMJ. BMJ disclaims all liability and responsibility arising from any reliance placed on the content. Where the content includes any translated material, BMJ does not warrant the accuracy and reliability of the translations (including but not limited to local regulations, clinical guidelines, terminology, drug names and drug dosages), and is not responsible for any error and/or omissions arising from translation and adaptation or otherwise.

Read the full text or download the PDF:

IMAGES

  1. 5 stages of literature review

    literature review data collection method

  2. Iterative Cycle of Literature Review and Data Collection

    literature review data collection method

  3. Workflow of literature review, data collection and analysis.

    literature review data collection method

  4. Systematic literature review stages using PRISMA approach [39-41

    literature review data collection method

  5. Guidance on Conducting a Systematic Literature Review

    literature review data collection method

  6. Systematic reviews

    literature review data collection method

VIDEO

  1. Systematic literature review

  2. What is Literature Review?

  3. SYSTEMATIC AND LITERATURE REVIEWS

  4. Literature review and its process

  5. 3_session2 Importance of literature review, types of literature review, Reference management tool

  6. 12 Important Practice Questions /Research Methodology in English Education /Unit-1 /B.Ed. 4th Year

COMMENTS

  1. A practical guide to data analysis in general literature reviews

    The data analysis methods described here are based on basic content analysis as described by Elo and Kyngäs 4 and Graneheim and Lundman, 5 and the integrative review as described by Whittemore and Knafl, 6 but modified to be applicable to analysing the results of published studies, rather than empirical data.

  2. Literature review as a research methodology: An ...

    1. Introduction Building your research on and relating it to existing knowledge is the building block of all academic research activities, regardless of discipline. Therefore, to do so accurately should be a priority for all academics. However, this task has become increasingly complex.

  3. Chapter 5: Collecting data

    5.1 Introduction #section-5-1 Systematic reviews aim to identify all studies that are relevant to their research questions and to synthesize data about the design, risk of bias, and results of those studies.

  4. Chapter 9 Methods for Literature Reviews

    Among other methods, literature reviews are essential for: (a) identifying what has been written on a subject or topic; (b) determining the extent to which a specific research area reveals any interpretable trends or patterns; (c) aggregating empirical findings related to a narrow research question to support evidence-based practice; (d) generat...

  5. PDF METHODOLOGY OF THE LITERATURE REVIEW

    The CLR as a methodology The CLR meta-framework Introducing the Seven-Step Model Applying Concepts Using the Seven-Step Model to inform primary research The Seven-Step Model as a cyclical process Background Concepts The CLR: A Data Collection Tool The word data refers to a body of information.

  6. Best Practices in Data Collection and Preparation: Recommendations for

    We offer best-practice recommendations for journal reviewers, editors, and authors regarding data collection and preparation. Our recommendations are applicable to research adopting different epistemological and ontological perspectives—including both quantitative and qualitative approaches—as well as research addressing micro (i.e., individuals, teams) and macro (i.e., organizations ...

  7. Literature Review Research

    The objective of a Literature Review is to find previous published scholarly works relevant to an specific topic. Explains the background of research on a topic. Demonstrates why a topic is significant to a subject area. Discovers relationships between research studies/ideas. Identifies major themes, concepts, and researchers on a topic.

  8. Data Collection Methods

    Step 1: Define the aim of your research Before you start the process of data collection, you need to identify exactly what you want to achieve. You can start by writing a problem statement: what is the practical or scientific issue that you want to address, and why does it matter?

  9. How to Write a Literature Review

    Step 1 - Search for relevant literature Step 2 - Evaluate and select sources Step 3 - Identify themes, debates, and gaps Step 4 - Outline your literature review's structure Step 5 - Write your literature review Free lecture slides Other interesting articles Frequently asked questions Introduction Quick Run-through Step 1 & 2 Step 3 Step 4 Step 5

  10. Qualitative Research: Data Collection, Analysis, and Management

    INTRODUCTION In an earlier paper, 1 we presented an introduction to using qualitative research methods in pharmacy practice. In this article, we review some principles of the collection, analysis, and management of qualitative data to help pharmacists interested in doing research in their practice to continue their learning in this area.

  11. PDF Presenting Methodology and Research Approach

    5: Data-Collection Methods Explain that a selected literature review preceded data collection; although this informs the study, indicate that the literature is not data to be collected. Identify and present all the data-collection methods you used, and clearly explain the steps taken to carry out each method. Include in the

  12. Data Collection Instrument and Procedure for Systematic Reviews in the

    In this paper we describe the instrument and proce-dure used to collect and evaluate data from individual studies of intervention effectiveness, a key step in the methods used to develop the Guide. The form illustrates the Task Force's approach to categorizing information about study design, content, and quality of the scientific literature.

  13. Data extraction methods for systematic review (semi)automation: Update

    In the base-review we assessed the included publications based on a list of 17 items in the domains of reproducibility (3.4.1), transparency (3.4.2), description of testing (3.4.3), data availability (3.4.4), and internal and external validity (3.4.5).

  14. Collecting and combining

    Plan out your synthesis methods; List all your data elements to be extracted; Develop your data collection methods; Develop your form and collect your data; Complete your synthesis and explain your findings; tell your story; Three examples of a data extractions form are below: Data Extraction Form Example (suitable for small-scale literature ...

  15. A practical guide to data analysis in general literature reviews

    This article is a practical guide to conducting data analysis in general literature reviews. The general literature review is a synthesis and analysis of published research on a rel-evant clinical issue, and is a common format for academic theses at the bachelor's and master's levels in nursing, physiotherapy, occupational therapy, public ...

  16. Guidance on Conducting a Systematic Literature Review

    This article is organized as follows: The next section presents the methodology adopted by this research, followed by a section that discusses the typology of literature reviews and provides empirical examples; the subsequent section summarizes the process of literature review; and the last section concludes the paper with suggestions on how to ...

  17. Data Collection Methods Comparison: A Guide for Literature Reviews

    Data Collection Methods Comparison: A Guide for Literature Reviews All Academic Research How do you compare and contrast different data collection methods in academic research...

  18. Data Collection Methods and Tools for Research; A Step-by-Step Guide to

    Data collection is the process of collecting data aiming to gain insights regarding the research topic. There are different types of data and different data collection methods accordingly. However, it may be challenging for researchers to select the most appropriate type of data collection based on the type of data that is used in the research.

  19. A Systematic Literature Review on Survey Data Collection System

    The study conducts a systematic literature review (SLR) with the objective to understand the feature, technology, method for data collection system in the previous research. This understanding can be used as a base for developing a data collection system platform.

  20. (PDF) Data Collection Methods and Tools for Research; A Step-by-Step

    Data collection is the process of collecting data aiming to gain insights regarding the research topic. There are different types of data and different data collection methods...

  21. Data Collection Methods

    Many data collection methods can be either qualitative or quantitative. For example, in surveys, observational studies or case studies, your data can be represented as numbers (e.g., using rating scales or counting frequencies) or as words (e.g., with open-ended questions or descriptions of what you observe). ... Literature review: Survey of ...

  22. Data Collection Methods

    The proposed methods for this empirical research study were the following five data collection (DC) instruments: DC1 - Pre-test using a survey tool for data collection; This quantitative instrument was designed to extract information about the participants prior to their exposure to the proposed system. The survey tool itself was adopted and ...

  23. German primary care data collection projects: a scoping review

    Methods A literature search was conducted in the electronic databases in May 2021 and in June 2022. The search string included terms related to general practice, routine data, and Germany. The retrieved studies were classified as applied studies and methodological studies, and categorised by type of research, subject area, sample of publications, disease category, or main medication analysed.

  24. Conducting secondary analysis of qualitative data: Should we, can we

    There has been increasing commentary in the literature regarding secondary data analysis (SDA) with qualitative data. Many critics assert that there are potential methodological and ethical problems regarding such practice, especially when qualitative data is shared and SDA is conducted by researchers not involved with data collection.