• Module Index
  • Epiville Chamber of Commerce
  • About this site
  • Requirements

Case-Control Study

  • Introduction
  • Learning Objectives
  • Student Role

Study Design

  • Data Collection
  • Data Analysis
  • Discussion Questions
  • Print Module

Now that you have thoroughly assessed the situation, you have enough information to generate some hypotheses. The two suspected causal agents of the outbreak of Susser Syndrome are Quench-It and EnduroBrick. Use the case-control method to design a study that will allow you to compare the exposures to these products among your cases of Susser Syndrome and healthy controls of your choice. From all of your class work, you know that you want your hypotheses to be as explicit and detailed as possible.

1. Based on the information you gathered, which of the following hypotheses is the most appropriate for your case-control study?

  • (1) Those who consumed EnduroBrick are more likely to be diagnosed with Susser Syndrome than those who did not; (2) Those who consumed Quench-It are more likely to be diagnosed with Susser Syndrome than those who did not consume Quench-It.
  • Individuals diagnosed with Susser Syndrome are more likely to have been members of the Superfit Fitness Center than individuals without Susser Syndrome.
  • Individuals diagnosed with Susser Syndrome are likely to be exposed to a variety of different exposures than are individuals not diagnosed with Susser Syndrome

Now that you have hypotheses, the next step is to prepare the case definition. This requires us to understand how Susser Syndrome is diagnosed. The more certain you are about your diagnosis the less error you will introduce into your study by incorrectly specifying cases. Based on information from the EDOH website, you decide that your case definition will be based on a clinical diagnosis of Susser Syndrome.

After you establish your case definition, you need to decide on the population from which the cases for your study will be obtained. Since the majority of cases from the recent outbreak were active members of the Superfit Fitness Center, you decide to base your study on this population.

Next you need to decide how you will classify your cases and controls based on exposure status. Remember, we are actually operating under two hypotheses here, each with its own unique exposure variable. Scientists working on the possible causal connection between consumption of EnduroBrick or Quench-It and the development of Susser Syndrome suggest that both exposures may have an Induction time of at least 6 months. Under this hypothesis, any cases of Susser Syndrome that occurred within 6 months of initial consumption of either EnduroBrick or Quench-It could not have plausibly been caused by the exposure. Thus, you stipulate that at least 6 months are required to have elapsed since the initial exposure, before your individual will be considered "'exposed".

Once all of these decisions have been made, it is time to create appropriate eligibility criteria for your cases and controls.

2. Which of the following do you think are the best eligibility criteria for the cases? [Aschengrau & Seage, pp. 239-243]

  • Cases should have been members of the Superfit Center in the last two years for at least 6 months (total) and consumed either EnduroBrick or Quench-It.
  • Cases should be correctly diagnosed with Susser Syndrome and be employed at Glop Industries.
  • Cases should be correctly diagnosed with Susser Syndrome and have been members of the Superfit Fitness Center for at least 6 months in the last two years.

Now you need to decide who is eligible to be a control.

You recall from your wonderful learning experience in P6400 that valid controls in a case-control study are individuals that, had they acquired the disease under investigation, would have ended up as cases in your study. The best way to ensure this is to sample controls from the same population that gave rise to the cases. To ensure that the controls accurately represent a sample of the distribution of exposure in the population giving rise to the cases, they should be sampled independently of exposure status.

3. Which of the following do you think are the best eligibility criteria for the controls?

  • Controls should be residents of Epiville who have not been diagnosed with Susser Syndrome.
  • Controls should be members of the Superfit Fitness Center who have been diagnosed with Susser Syndrome but have not consumed either EnduroBrick or Quench-It.
  • Controls should be members of the Superfit Center for at least 6 months in the last 2 years and not be diagnosed with Susser Syndrome at the time of data collection.

Now that the eligibility criteria have been set, you must determine the specifics of the case-control study design.

How many cases and controls should you recruit?

The answer to this question obviously depends on your time and resources. However, an equally important consideration is how much power you want the study to have. Conventionally, we want a study's power to be at least 80 percent in being able to find a significant difference between the groups. Generally, if the study has less than 80 percent power, we conclude that the study is underpowered. This does not mean our results are incorrect; but if we observe an insignificant result in an underpowered study we may not be able to tell whether this is because there truly is no association or whether this is due to the lack of power in the study.

Intellectually Curious?

Learn more about power and sample size .

After crunching the numbers, you determine that the study will require the following size to achieve a desired power of 80 percent:

Number of cases: 112 Number of controls: 224 Total number of subjects: 336

Bear in mind that the study is voluntary. Subjects, even when eligible, are in no way required to participate. Furthermore, subjects may drop out of the study before completion, further decreasing your sample size. Study participation depends in large part on the methods of recruitment. In-person recruitment is generally regarded as the most effective, followed by telephone interviews, and then mail invitations. The participation rate that you expect to achieve, given your method of recruitment, will help you to calculate approximately how many individuals you will need to contact in order to meet your sample size.

Should you recruit cases and controls simultaneously or cases first and then all controls? Learn more here .

Website URL: http://epiville.ccnmtl.columbia.edu/

  • Open access
  • Published: 07 January 2022

Identification of causal effects in case-control studies

  • Bas B. L. Penning de Vries 1 &
  • Rolf H. H. Groenwold 1 , 2  

BMC Medical Research Methodology volume  22 , Article number:  7 ( 2022 ) Cite this article

5175 Accesses

2 Citations

8 Altmetric

Metrics details

Case-control designs are an important yet commonly misunderstood tool in the epidemiologist’s arsenal for causal inference. We reconsider classical concepts, assumptions and principles and explore when the results of case-control studies can be endowed a causal interpretation.

We establish how, and under which conditions, various causal estimands relating to intention-to-treat or per-protocol effects can be identified based on the data that are collected under popular sampling schemes (case-base, survivor, and risk-set sampling, with or without matching). We present a concise summary of our identification results that link the estimands to the (distribution of the) available data and articulate under which conditions these links hold.

The modern epidemiologist’s arsenal for causal inference is well-suited to make transparent for case-control designs what assumptions are necessary or sufficient to endow the respective study results with a causal interpretation and, in turn, help resolve or prevent misunderstanding. Our approach may inform future research on different estimands, other variations of the case-control design or settings with additional complexities.

Peer Review reports

Introduction

In causal inference, it is important that the causal question of interest is unambiguously articulated [ 1 ]. The causal question should dictate, and therefore be at the start of, investigation. When the target causal quantity, the estimand, is made explicit, one can start to question how it relates to the available data distribution and, as such, form a basis for estimation with finite samples from this distribution.

The counterfactual framework offers a language rich enough to articulate a wide variety of causal claims that can be expressed as what-if statements [ 1 ]. Another, albeit closely related, approach to causal inference is target trial emulation, an explicit effort to mitigate departures from a study (the ‘target trial’) that, if carried out, would enable one to readily answer the causal what-if question of interest [ 2 ]. While it may be too impractical or unethical to implement, making explicit what a target trial looks like has particular value in communicating the inferential goal and offers a reference against which to compare studies that have been or are to be conducted.

The counterfactual framework and emulation approach have become increasingly popular in observational cohort studies. Case-control studies, however, have not yet enjoyed this trend. A notable exception is given by Dickerman et al. [ 3 ], who recently outlined an application of trial emulation with case-control designs to statin use and colorectal cancer.

In this paper, we give an overview of how observational data obtained with case-control designs can be used to identify a number of causal estimands and, in doing so, recast historical case-control concepts, assumptions and principles in a modern and formal framework.

Preliminaries

Identification versus estimation.

An estimand is said to be identifiable if the distribution of the available data is compatible with exactly one value of the estimand, or therefore, if the estimand can be expressed as a functional of the available data distribution. Identifiability is a relative notion as it depends on which data are available as well as on the assumptions one is willing to make. Identification forms a basis for estimation with finite samples from the available data distribution [ 4 ]. Once the estimand has been made explicit and an identifying functional established, estimation is a purely statistical problem. While the identifying functional will often naturally translate into a plug-in estimator, there is, however, generally more than one way to translate an identifiability result into an estimator and different estimators may have important differences in their statistical properties. Moreover, while the estimand may be identifiable, there need not exist an estimator with the desired properties (see e.g. [ 5 ]). Here, our focus is on identification, so that the purely statistical issues of the next step in causal inference, estimation, can be momentarily put aside.

Case-control study nested in cohort study

To facilitate understanding, it is useful to consider every case-control study as being “nested” within a cohort study. A case-control study could be considered as a cohort study with missingness governed by the control sampling scheme. Therefore, when the observed data distribution of a case-control study is compatible with exactly one value of a given estimand, then so is the available or observed data distribution of the underlying cohort study. In other words, identifiability of an estimand with a case-control study implies identifiability of the estimand with the cohort study within which it is nested (conceptually). The converse is not evident and in fact may not be true. In this paper, the focus is on sets of conditions or assumptions that are sufficient for identifiability in case-control studies.

Set-up of underlying cohort study

Consider a time-varying exposure A k that can take one of two levels, 0 or 1, at K successive time points t k ( k =0,1,..., K −1), where t 0 denotes baseline (cohort entry or time zero). Study participants are followed over time until they sustain the event of interest or the administrative study end t K , whichever comes first. We denote by T the time elapsed from baseline until the event of interest and let Y k = I ( T < t k ) indicate whether the event has occurred by t k . The lengths between the time points are typically fixed at a constant (e.g., of one day, week, or month). Figure  1 depicts twelve equally spaced time points over, say, twelve months with several possible courses of follow-up of an individual. As the figure illustrates, individuals can switch between exposure levels during follow-up, as in any truly observational study. Apart from exposure and outcome data, we also consider a (vector of) covariate(s) L k , which describes time-fixed individual characteristics or time-varying characteristics typically relating to a time window just before exposure or non-exposure at t k , k =0,1,..., K −1.

figure 1

Illustration of possible courses of follow-up of an individual for a study with baseline t 0 and administrative study end t 12 . Solid bullets indicate ‘exposed’; empty bullets indicate ‘not exposed’. The incident event of interest is represented by a cross

Causal contrasts

Although there are many possible contrasts, particularly with time-varying exposures, for simplicity we consider only two pairs of mutually exclusive interventions: (1) setting baseline exposure A 0 to 1 versus 0; and (2) setting all of A 0 , A 1 ,..., A K −1 to 1 (‘always exposed’) versus all to 0 (‘never exposed’). For a =0,1, we let counterfactual outcome Y k ( a ) indicate whether the event has occurred by t k under the baseline-only intervention that sets A 0 to a . By convention, we write \(\overline {1}=(1,1,...,1)\) and \(\overline {0}=(0,0,...,0)\) , and let \(Y_{k}(\overline {1})\) and \(Y_{k}(\overline {0})\) indicate whether the event has occurred by t k under the intervention that sets all elements of ( A 0 , A 1 ,..., A K −1 ) to 1 and all to 0, respectively. Further details about the notation and set-up are given in Supplementary Appendix A.

Case-control sampling

The fact that each time-specific exposure variable can take only one value per time point means that at most one counterfactual outcome can be observed per individual. This type of missingness is common to all studies. Relative to the cohort studies within which they are nested, case-control studies have additional missingness, which is governed by the control sampling scheme. In this paper, we focus on three well-known sampling schemes: case-base sampling, survivor sampling, and risk-set sampling. The next sections give an overview of conditions under which intention-to-treat and always-versus-never-exposed per-protocol effects can be identified with the data that are observed under these sampling schemes.

Case-control studies without matching

Table  1 summarises a number of identification results for case-control studies without matching. Each result consists of one of the three aforementioned sampling schemes, an estimand, a set of assumptions, and an identification strategy. Under the conditions of the “Sampling scheme” and “Assumptions” columns, an identifying functional of the estimand of the “Estimand” column is obtained by following the steps of the “Identification strategy” column. More formal statements and proofs are given in Supplementary Appendix B.

In all case-control studies that we consider in this section, cases are compared with controls with regard to their exposure status via an odds ratio, even when an effect measure other than the odds ratio is targeted. An individual qualifies as a case if and only if they sustain the event of interest by the administrative study end (i.e., Y K =1) and adhered to one of the protocols of interest until the time of the incident event. In Fig.  1 , the individual represented by row 1 is therefore regarded as a case (an exposed case in particular) in our investigation of intention-to-treat effects but not in that of per-protocol effects. Whether an individual (also) serves as a control depends on the control sampling scheme.

Case-base sampling

The first result in Table  1 describes how to identify the intention-to-treat effect as quantified by the marginal risk ratio

under case-base sampling. (For identification of a conditional risk ratio, see Theorem 2 of Supplementary Appendix B.) Case-base sampling, also known as case-cohort sampling, means that no individual who is at risk at baseline of sustaining the event of interest is precluded from selection as a control. Selection as a control, S , is further assumed independent of baseline covariate L 0 and exposure A 0 . Selecting controls from survivors only (e.g., rows 4, 5, 7 and 9 in Fig.  1 ) violates this assumption when survival depends on L 0 or A 0 .

To account for baseline confounding, inverse probability weights could be derived from control data according to

We then compute the odds of baseline exposure among cases and among controls in the pseudopopulation that is obtained by weighting everyone by subject-specific values of W . The ratio of these odds coincides with the target risk ratio under the three key identifiability conditions of consistency, baseline conditional exchangeability and positivity [ 1 ]. Consistency here means that for a =0,1, Y K ( a )= Y K if A 0 = a , baseline conditional exchangeability that for a =0,1, A 0 is independent of Y K ( a ), and positivity that 0< Pr( A 0 =1| L 0 , S =1)<1.

The identification result for case-base sampling suggests a plug-in estimator: replace all functionals of the theoretical data distribution with sample analogues. For example, to obtain the weight for an individual with baseline covariate level l 0 , replace the theoretical propensity score Pr( A 0 =1| L 0 = l 0 , S =1) with an estimate \(\widehat {\Pr }(A_{0}=1|L_{0}=l_{0},S=1)\) derived from a fitted model (e.g., a logistic regression model) that imposes parametric constraints on the distribution of A 0 given L 0 among the controls.

Survivor sampling

With survivor (cumulative incidence or exclusive) sampling, a subject is eligible for selection as a control only if they reach the administrative study end event-free. To identify the conditional odds ratio of baseline exposure versus baseline non-exposure given L 0 ,

selection as a control, S , is assumed independent of baseline exposure A 0 given L 0 and survival until the end of study (i.e., Y K =0).

As is shown in Supplementary Appendix B, Theorem 3, the above odds ratio is identified by the ratio of the baseline exposure odds given L 0 among the cases versus controls, provided the key identifiability conditions of consistency, baseline conditional exchangeability, and positivity are met.

All estimands in Table  1 describe a marginal effect, except for the odds ratio, which is conditional on baseline covariates L 0 . The corresponding marginal odds ratio

is not identifiable from the available data distribution under the stated assumptions (see remark to Theorem 3, Supplementary Appendix B). However, approximate identifiability can be achieved by invoking the rare event assumption (or rare disease assumption), in which case the marginal odds ratio approximates the marginal risk ratio.

Risk-set sampling for intention-to-treat effect

With risk-set (or incidence density) sampling, for all time windows [ t k , t k +1 ), k =0,..., K −1, every subject who is event-free at t k is eligible for selection as a control for the period [ t k , t k +1 ). This means that study participants may be selected as a control more than once.

Consider the intention-to-treat effect quantified by the marginal (discrete-time) hazard ratio (or rate ratio)

(For identification of a conditional hazard ratio, see Theorem 5, Supplementary Appendix B.) For identification of the above marginal hazard ratio under risk-set sampling, it is assumed that selection as a control between t k and t k +1 , S k , is independent of the baseline covariates and exposure given eligibility at t k (i.e., Y k =0). It is also assumed that the sampling probability among those eligible, Pr( S k =1| Y k =0), is constant across time windows k =0,..., K −1. To this end, it suffices that the marginal hazard Pr( Y k +1 =1| Y k =0) remains constant across time windows and that every k th sampling fraction Pr( S k =1) is equal, up to a proportionality constant, to the probability Pr( Y k +1 =1, Y k =0) of an incident case in the k th window (see remark to Theorem 4, Supplementary Appendix B). For practical purposes, this suggests sampling a fixed number of controls for every case from among the set of eligible individuals. To illustrate, consider Fig.  1 and note first of all that the individual represented by row 1 trivially qualifies as a case, because the individual survived until the event occurred. Because the event was sustained between t 5 and t 6 , the proposed sampling suggests selecting a fixed number of controls from among those who are eligible at t 5 . Thus, rows (and only rows) 4 through 9 as well as row 1 itself in Fig.  1 qualify for selection as a control for this case. Even though the individual of row 1 is a case, the individual may also be selected as a control when the individuals of row 2, 3 and 6 (but not 8) sustain the event.

Once cases and controls are selected, we can start to derive inverse probability weights W according to Eq. 1 with S replaced with S 0 . We then compute the odds of baseline exposure among cases in the pseudopopulation that is obtained by weighting everyone by W and the odds of baseline exposure among controls weighted by W multiplied by the number of times the individual was selected as a control. The ratio of these odds coincides with the target hazard ratio under the three key identifiability conditions of consistency, baseline conditional exchangeability and positivity together with the assumption that the hazards in the numerator and denominator of the causal hazard ratio are constant across the time windows.

The consistency and exchangeability conditions are here slightly stronger than those of the previous subsections. Specifically, Theorem 4 (Supplementary Appendix B) requires consistency of the form: for all k =1,..., K and a =0,1, Y k ( a )= Y k if A 0 = a . The exchangeability condition requires, for a =0,1, that conditional on L 0 , the counterfactual outcomes Y 1 ( a ),..., Y K ( a ) are jointly independent of A 0 . The positivity condition takes the same form as in the previous subsections (i.e., 0< Pr( A 0 = a | L 0 , S 0 =1)<1).

Risk-set sampling for per-protocol effect

For the per-protocol effect quantified by the (discrete-time) hazard ratio (or rate ratio)

eligibility for selection as a control for the period [ t k , t k +1 ) again requires that the respective subject is event-free at t k (i.e., Y k =0). Selection as a control between t k and t k +1 , S k , is further assumed independent of covariate and exposure history up to t k given eligibility at t k (but see Supplementary Appendix B for a slightly weaker assumption). As for the intention-to-treat effect, it is also assumed that the probability to be selected as a control S k given eligibility is constant across time windows. This assumption is guaranteed to hold if the marginal hazard Pr( Y k +1 =1| Y k =0) remains constant across time windows and that every k th sampling fraction Pr( S k =1) is equal, up to a proportionality constant, to the probability of an incident case in the k th window. Figure  1 shows five incident events yet only three qualify as a case (rows 2, 3 and 8) when it concerns per-protocol effects. When the first case emerges (row 2), all rows meet the eligibility criterion for selection as a control. When the second emerges, the individual of row 2, who fails to survive event-free until t 4 , is precluded as a control. When the case of row 8 emerges, only the individuals of rows 4, 5, 7 and 9 are eligible as controls.

Once cases and controls are selected, we can start to derive time-varying inverse probability weights according to

It is important to note that the weights are derived from control information but are nonetheless used to weight both cases and controls [ 6 ]. The denominators of the weights describe the propensity to switch exposure level. However, once the weights are derived, every subject is censored from the time that they fail to adhere to one of the protocols of interest for all downstream analysis. The uncensored exposure levels are therefore constant over time. We then compute the baseline exposure odds among cases, weighted by the weights W k corresponding to the interval [ t k , t k +1 ) of the incident event (i.e., Y k =0, Y k +1 =1), as well as the baseline exposure odds among controls, weighted by \(\sum _{k=0}^{K-1}W_{k}S_{k}\) , the weighted number of times selected as control. The ratio of these odds equals the target hazard ratio under the three key identifiability conditions of consistency, sequential conditional exchangeability, and positivity together with the assumption that hazards in the numerator and denominator of the causal hazard ratio for the per-protocol effect are constant across the time windows. The consistency, exchangeability and positivity conditions take a somewhat different (stronger) form than in the previous subsections; we refer the reader to Supplementary Appendix A for further details.

Case-control studies with matching

Table  2 gives an overview of identification results for case-control studies with exact pair matching. Formal statements and proofs are given in Supplementary Appendix C, which also includes a generalisation of the results of Table  2 to exact 1-to- M matching. While the focus in this section is on exact covariate matching, for partial matching we refer the reader to Supplementary Appendix D, where we consider parametric identification by way of conditional logistic regression.

Pair matching involves assigning a single control exposure level, which we denote by A ′ , to every case. As for case-control studies without matching, in a case-control studies with matching an individual qualifies as a case if and only if they sustain the event of interest by the administrative study end (i.e., Y K =1) and adhered to one of the protocols of interest until the time of the incident event. How a matched control exposure is assigned is encoded in the sampling scheme and the assumptions of Table  2 . For example, for identification of the causal marginal risk ratio under case-base sampling, A ′ is sampled from all study participants whose baseline covariate value matches that of the case, independently of the participants’ baseline exposure value and whether they survive until the end of study. The matching is exact in the sense that the control exposure information is derived from an individual who has the same value for the baseline covariate as the case.

The identification strategy is the same for all results listed in Table  2 . Only the case-control pairs ( A 0 , A ′ ) with discordant exposure values (i.e., (1,0) or (0,1)) are used. Under the stated sampling schemes and assumptions, the respective estimands are identified by the ratio of discordant pairs.

This paper gives a formal account of how and when causal effects can be identified in case-control studies and, as such, underpins the case-control application of Dickerman et al. [ 3 ]. Like Dickerman et al., we believe that case-control studies should generally be regarded as being nested within cohort studies. This view emphasises that the threats to the validity of cohort studies should also be considered in case-control studies. For example, in case-control applications with risk-set sampling, researchers often consider the covariate and exposure status only at, or just before, the time of the event (for cases) or the time of sampling (for controls). However, where a cohort study would require information on baseline levels or the complete treatment and covariate history of participants, one should suspect that this holds for the nested case-control study too. To gain clarity, we encourage researchers to move away from using person-years, -weeks, or -days (rather than individuals) as the default units of inference [ 7 ], and to realise that inadequately addressed deviations from a target trial may lead to bias (or departure from identifiability), regardless of whether the study that attempts to emulate it is a case-control or a cohort study [ 3 ].

What is meant by a cohort study differs between authors and contexts [ 8 ]. The term ‘cohort’ may refer to either a ‘dynamic population’, or a ‘fixed cohort’, whose “membership is defined in a permanent fashion” and “determined by a single defining event and so becomes permanent” [ 9 ]. While it may sometimes be of interest to ask what would have happened with a dynamic cohort (e.g., the residents of a country) had it been subjected to one treatment protocol versus another, the results in this paper relate to fixed cohorts.

Like the cohort studies within which they are (at least conceptually) nested, case-control studies require an explicit definition of time zero, the time at which a choice is to be made between treatment strategies or protocols of interest [ 3 ]. Given a fixed cohort, time zero is generally determined by the defining event of the cohort (e.g., first diagnosis of a particular disease or having survived one year since diagnosis). This event may occur at different calendar times for different individuals. However, while a fixed cohort may be ‘open’ to new members relative to calendar time, it is always ‘closed’ along the time axis on which all subject-specific time zeros are aligned.

In this paper, time was regarded as discrete. Since we considered arbitrary intervals between time points and because, in real-world studies, time is never measured in a truly continuous fashion, this does not represent an important limitation for practical purposes. It is however important to note that the intervals between interventions and outcome assessments (in a target trial) are an intrinsic part of the estimand that lies at the start of investigation. Careful consideration of time intervals in the design of the conceptual target trial and of the actual cohort or case-control study is therefore warranted.

We emphasize that identification and estimation are distinct steps in causal inference. Although our focus was on the former, identifying functionals often naturally translate into estimators. The task of finding the estimator with the most appealing statistical properties is not necessarily straightforward, however, and is beyond the scope of this paper.

We specifically studied two causal contrasts (i.e., pairs of interventions), one corresponding to intention-to-treat effects and the other to always-versus-never per-protocol effects of a time-varying exposure. There are of course many more causal contrasts, treatment regimes and estimands conceivable that could be of interest. We argue that also for these estimands, researchers should seek to establish identifiability before they select an estimator.

The conditions under which identifiability is to be sought for practical purposes may well include more constraints or obstacles to causal inference, such as additional missingness (e.g., outcome censoring) and measurement error, than we have considered here. While some of our results assume that hazards or hazard ratios remain constant over time, in many cases these are likely time-varying [ 10 , 11 ]. There are also more case-control designs (e.g., the case-crossover design) to consider. These additional complexities and designs are beyond the scope of this paper and represent an interesting direction for future research.

The case-control family of study designs is an important yet often misunderstood tool for identifying causal relations [ 12 – 15 ]. Although there is much to be learned, we believe that the modern arsenal for causal inference, which includes counterfactual thinking, is well-suited to make transparent for these classical epidemiological study designs what assumptions are sufficient or necessary to endow the study results with a causal interpretation and, in turn, help resolve or prevent misunderstanding.

Availability of data and materials

Data sharing is not applicable to this article as no datasets were generated or analysed during the current study.

Hernán M, Robins J. Causal Inference: What If. Boca Raton: Chapman & Hall/CRC; 2020.

Google Scholar  

Hernán MA, Robins JM. Using big data to emulate a target trial when a randomized trial is not available. Am J Epidemiol. 2016; 183(8):758–64.

Article   Google Scholar  

Dickerman BA, García-Albéniz X, Logan RW, Denaxas S, Hernán MA. Emulating a target trial in case-control designs: an application to statins and colorectal cancer. Int J Epidemiol. 2020; 49(5):1637–46.

Petersen ML, Van der Laan MJ. Causal models and learning from data: integrating causal modeling and statistical estimation. Epidemiol (Camb, Mass). 2014; 25(3):418.

Maclaren OJ, Nicholson R. Models, identifiability, and estimability in causal inference. In: 38th International Conference on Machine Learning. Workshop on the Neglected Assumptions in Causal Inference. ICML: 2021. https://sites.google.com/view/naci2021/home .

Robins JM. [Choice as an alternative to control in observational studies]: comment. Stat Sci. 1999; 14(3):281–93.

Hernán MA. Counterpoint: epidemiology to guide decision-making: moving away from practice-free research. Am J Epidemiol. 2015; 182(10):834–39.

Vandenbroucke JP, Pearce N. Incidence rates in dynamic populations. Int J Epidemiol. 2012; 41(5):1472–79.

Rothman KJ, Greenland S, Lash TL. Modern Epidemiology, Third edition. Philadelphia: Lippincott Williams & Wilkins; 2008.

Lefebvre G, Angers J-F, Blais L. Estimation of time-dependent rate ratios in case-control studies: comparison of two approaches for exposure assessment. Pharmacoepidemiol Drug Saf. 2006; 15(5):304–16.

Guess HA. Exposure-time-varying hazard function ratios in case-control studies of drug effects. Pharmacoepidemiol Drug Saf. 2006; 15(2):81–92.

Knol MJ, Vandenbroucke JP, Scott P, Egger M. What do case-control studies estimate? survey of methods and assumptions in published case-control research. Am J Epidemiol. 2008; 168(9):1073–81.

Pearce N. Analysis of matched case-control studies. BMJ. 2016; 352:i969.

Mansournia MA, Jewell NP, Greenland S. Case–control matching: effects, misconceptions, and recommendations. Eur J Epidemiol. 2018; 33(1):5–14.

Labrecque JA, Hunink MM, Ikram MA, Ikram MK. Do case-control studies always estimate odds ratios?. Am J Epidemiol. 2021; 190(2):318–21.

Download references

Acknowledgments

None declared.

RHHG was funded by the Netherlands Organization for Scientific Research (NWO-Vidi project 917.16.430). The content is solely the responsibility of the authors and does not necessarily represent the official views of the funding body.

Author information

Authors and affiliations.

Department of Clinical Epidemiology, Leiden University Medical Center, Leiden, PO Box 9600, 2300 RC, The Netherlands

Bas B. L. Penning de Vries & Rolf H. H. Groenwold

Department of Biomedical Data Sciences, Leiden University Medical Center, Leiden, The Netherlands

Rolf H. H. Groenwold

You can also search for this author in PubMed   Google Scholar

Contributions

BBLPdV devised the project and wrote the manuscript and supplementary material with substantial input from RHHG, who supervised the project. The authors read and approved the final manuscript.

Corresponding author

Correspondence to Bas B. L. Penning de Vries .

Ethics declarations

Ethics approval and consent to participate.

Not applicable.

Consent for publication

Competing interests.

The authors declare that they have no competing interests.

Additional information

Publisher’s note.

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Supplementary Information

Additional file 1.

Supplementary material to ‘Identification of causal effects in case-control studies’.

Rights and permissions

Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/ . The Creative Commons Public Domain Dedication waiver ( http://creativecommons.org/publicdomain/zero/1.0/ ) applies to the data made available in this article, unless otherwise stated in a credit line to the data.

Reprints and permissions

About this article

Cite this article.

L. Penning de Vries, B.B., Groenwold, R.H.H. Identification of causal effects in case-control studies. BMC Med Res Methodol 22 , 7 (2022). https://doi.org/10.1186/s12874-021-01484-7

Download citation

Received : 26 August 2021

Accepted : 29 November 2021

Published : 07 January 2022

DOI : https://doi.org/10.1186/s12874-021-01484-7

Share this article

Anyone you share the following link with will be able to read this content:

Sorry, a shareable link is not currently available for this article.

Provided by the Springer Nature SharedIt content-sharing initiative

  • Causal inference
  • Case-control designs
  • Identifiability

BMC Medical Research Methodology

ISSN: 1471-2288

hypothesis for case control study

Case-Control Studies

  • First Online: 17 December 2023

Cite this chapter

Book cover

  • Qian Wu 3  

178 Accesses

The purpose of the case-control study is to evaluate the relationship between the disease and the exposure factors suspected of causing the disease. Both cohort and case-control studies are analytical studies, their main difference lies in the selection of the study population. In a cohort study, the subjects do not have the disease when entering the study and are classified according to their exposure to putative risk factors, in contrast, subjects in case-control studies are grouped according to the presence or absence of the disease of interest. Case-control studies are relatively easy to conduct and are increasingly being applied to explore the causes of disease, especially rare diseases. Case-control studies are used to estimate the relative risk of disease caused by a specific factor. When the disease is rare, case control study may be the only research method.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
  • Available as EPUB and PDF
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Author information

Authors and affiliations.

School of Public Health, Xi’an Jiaotong University, Xi’an, China

You can also search for this author in PubMed   Google Scholar

Corresponding author

Correspondence to Qian Wu .

Editor information

Editors and affiliations.

Department of Epidemiology and Biostatistics, College of Public Health, Zhengzhou University, Zhengzhou, Henan, China

Chongjian Wang

Department of Epidemiology and Health Statistics, School of Public Health, Capital Medical University, Beijing, China

Rights and permissions

Reprints and permissions

Copyright information

© 2023 Zhengzhou University Press

About this chapter

Wu, Q. (2023). Case-Control Studies. In: Wang, C., Liu, F. (eds) Textbook of Clinical Epidemiology. Springer, Singapore. https://doi.org/10.1007/978-981-99-3622-9_5

Download citation

DOI : https://doi.org/10.1007/978-981-99-3622-9_5

Published : 17 December 2023

Publisher Name : Springer, Singapore

Print ISBN : 978-981-99-3621-2

Online ISBN : 978-981-99-3622-9

eBook Packages : Medicine Medicine (R0)

Share this chapter

Anyone you share the following link with will be able to read this content:

Sorry, a shareable link is not currently available for this article.

Provided by the Springer Nature SharedIt content-sharing initiative

  • Publish with us

Policies and ethics

  • Find a journal
  • Track your research

Home

The UK Faculty of Public Health has recently taken ownership of the Health Knowledge resource. This new, advert-free website is still under development and there may be some issues accessing content. Additionally, the content has not been audited or verified by the Faculty of Public Health as part of an ongoing quality assurance process and as such certain material included maybe out of date. If you have any concerns regarding content you should seek to independently verify this.

Introduction to study designs - case-control studies

Introduction.

hypothesis for case control study

Learning objectives: You will learn about basic introduction to case-control studies, its analysis and interpretation of outcomes. Case-control studies are one of the frequently used study designs due to the relative ease of its application in comparison with other study designs. This section introduces you to basic concepts, application and strengths of case-control study. This section also covers: 1. Issues in the design of case-control studies 2. Common sources of bias in a case-control study 3. Analysis of case-control studies 4. Strengths and weaknesses of case-control studies 5. Nested case-control studies Read the resource text below.

Resource text

Case-control studies start with the identification of a group of cases (individuals with a particular health outcome) in a given population and a group of controls (individuals without the health outcome) to be included in the study.

hypothesis for case control study

In a case-control study the prevalence of exposure to a potential risk factor(s) is compared between cases and controls. If the prevalence of exposure is more common among cases than controls, it may be a risk factor for the outcome under investigation. A major characteristic of case-control studies is that data on potential risk factors are collected retrospectively and as a result may give rise to bias. This is a particular problem associated with case-control studies and therefore needs to be carefully considered during the design and conduct of the study.

1. Issues in the design of case-control studies

Formulation of a clearly defined hypothesis As with all epidemiological investigations the beginning of a case-control study should begin with the formulation of a clearly defined hypothesis. Case definition It is essential that the case definition is clearly defined at the outset of the investigation to ensure that all cases included in the study are based on the same diagnostic criteria. Source of cases The source of cases needs to be clearly defined.

Selection of cases Case-control studies may use incident or prevalent cases.

Incident cases comprise cases newly diagnosed during a defined time period. The use of incident cases is considered as preferential, as the recall of past exposure(s) may be more accurate among newly diagnosed cases. In addition, the temporal sequence of exposure and disease is easier to assess among incident cases.

Prevalent cases comprise individuals who have had the outcome under investigation for some time. The use of prevalent cases may give rise to recall bias as prevalent cases may be less likely to accurately report past exposures(s). As a result, the interpretation of results based on prevalent cases may prove more problematic, as it may be more difficult to ensure that reported events relate to a time before the development of disease rather than to the consequence of the disease process itself. For example, individuals may modify their exposure following the onset of disease. In addition, unless the effect of exposure on duration of illness is known, it will not be possible to determine the extent to which a particular characteristic is related to the prognosis of the disease once it develops rather than to its cause.

Source of cases Cases may be recruited from a number of sources; for example they may be recruited from a hospital, clinic, GP registers or may be population bases. Population based case control studies are generally more expensive and difficult to conduct.

Selection of controls A particular problem inherent in case-control studies is the selection of a comparable control group. Controls are used to estimate the prevalence of exposure in the population which gave rise to the cases. Therefore, the ideal control group would comprise a random sample from the general population that gave rise to the cases. However, this is not always possible in practice. The goal is to select individuals in whom the distribution of exposure status would be the same as that of the cases in the absence of an exposure disease association. That is, if there is no true association between exposure and disease, the cases and controls should have the same distribution of exposure. The source of controls is dependent on the source of cases. In order to minimize bias, controls should be selected to be a representative sample of the population which produced the cases. For example, if cases are selected from a defined population such as a GP register, then controls should comprise a sample from the same GP register.

hypothesis for case control study

In case-control studies where cases are hospital based, it is common to recruit controls from the hospital population. However, the choice of controls from a hospital setting should not include individuals with an outcome related to the exposure(s) being studied. For example, in a case-control study of the association between smoking and lung cancer the inclusion of controls being treated for a condition related to smoking (e.g. chronic bronchitis) may result in an underestimate of the strength of the association between exposure (smoking) and outcome. Recruiting more than one control per case may improve the statistical power of the study, though including more than 4 controls per case is generally considered to be no more efficient.

Measuring exposure status Exposure status is measured to assess the presence or level of exposure for each individual for the period of time prior to the onset of the disease or condition under investigation when the exposure would have acted as a causal factor. Note that in case-control studies the measurement of exposure is established after the development of disease and as a result is prone to both recall and observer bias. Various methods can be used to ascertain exposure status. These include:

  • Standardized questionnaires
  • Biological samples
  • Interviews with the subject
  • Interviews with spouse or other family members
  • Medical records
  • Employment records
  • Pharmacy records

The procedures used for the collection of exposure data should be the same for cases and controls.

2. Common sources of bias in case-control studies

Due to the retrospective nature of case-control studies, they are particularly susceptible to the effects of bias, which may be introduced as a result of a poor study design or during the collection of exposure and outcome data. Because the disease and exposure have already occurred at the outset of a case control study, there may be differential reporting of exposure information between cases and controls based on their disease status. For example, cases and controls may recall past exposure differently (recall bias). Similarly, the recording of exposure information may vary depending on the investigator's knowledge of an individual's disease status (interviewer/observer bias). Therefore, the design and conduct of the study must be carefully considered, as there are limited options for the control of bias during the analysis. Selection bias in case-control studies Selection bias is a particular problem inherent in case-control studies, where it gives rise to non-comparability between cases and controls. Selection bias in case control studies may occur when: 'cases (or controls) are included in (or excluded from) a study because of some characteristic they exhibit which is related to exposure to the risk factor under evaluation' [1]. The aim of a case-control study is to select study controls who are representative of the population which produced the cases. Controls are used to provide an estimate of the exposure rate in the population. Therefore, selection bias may occur when those individuals selected as controls are unrepresentative of the population that produced the cases.

hypothesis for case control study

The potential for selection bias in case control studies is a particular problem when cases and controls are recruited exclusively from hospital or clinics. Hospital patients tend to have different characteristics than the population, for example they may have higher levels of alcohol consumption or cigarette smoking. If these characteristics are related to the exposures under investigation, then estimates of the exposure among controls may be different from that in the reference population, which may result in a biased estimate of the association between exposure and disease. Berkesonian bias is a bias introduced in hospital based case-control studies, due to varying rates of hospital admissions. As the potential for selection bias is likely to be less of a problem in population based case-control studies, neighbourhood controls may be a preferable choice when using cases from a hospital or clinic setting. Alternatively, the potential for selection bias may be minimized by selecting controls from more than one source, such as by using both hospital and neighbourhood controls. Selection bias may also be introduced in case-control studies when exposed cases are more likely to be selected than unexposed cases.

3. Analysis of case-control studies

The odds ratio (OR) is used in case-control studies to estimate the strength of the association between exposure and outcome. Note that it is not possible to estimate the incidence of disease from a case control study unless the study is population based and all cases in a defined population are obtained.

The results of a case-control study can be presented in a 2x2 table as follow:

hypothesis for case control study

The odds ratio is a measure of the odds of disease in the exposed compared to the odds of disease in the unexposed (controls) and is calculated as:

hypothesis for case control study

Example: Calculation of the OR from a hypothetical case-control study of smoking and cancer of the pancreas among 100 cases and 400 controls. Table 1. Hypothetical case-control study of smoking and cancer of the pancreas.

hypothesis for case control study

OR = 60 x 300        100 x 40 OR = 4.5 The OR calculated from the hypothetical data in table 1 estimates that smokers are 4.5 times more likely to develop cancer of the pancreas than non-smokers. NB: The odds ratio of smoking and cancer of the pancreas has been performed without adjusting for potential confounders. Further analysis of the data would involve stratifying by levels of potential confounders such as age. The 2x2 table can then be extended to allow for stratum specific rates of the confounding variable(s) to be calculated and, where appropriate, an overall summary measure, adjusted for the effects of confounding, and a statistical test of significance can also be calculated. In addition, confidence intervals for the odds ratio would also be presented.

4. Strengths and weaknesses of case-control studies

  • Cost effective relative to other analytical studies such as cohort studies.
  • Case-control studies are retrospective, and cases are identified at the beginning of the study; therefore there is no long follow up period (as compared to cohort studies).
  • Efficient for the study of diseases with long latency periods.
  • Efficient for the study of rare diseases.
  • Good for examining multiple exposures.
  • Particularly prone to bias; especially selection, recall and observer bias.
  • Case-control studies are limited to examining one outcome.
  • Unable to estimate incidence rates of disease (unless study is population based).
  • Poor choice for the study of rare exposures.
  • The temporal sequence between exposure and disease may be difficult to determine.

References 1. Hennekens CH, Buring JE. Epidemiology in Medicine, Lippincott Williams & Wilkins, 1987.

What Is A Case Control Study?

Julia Simkus

Editor at Simply Psychology

BA (Hons) Psychology, Princeton University

Julia Simkus is a graduate of Princeton University with a Bachelor of Arts in Psychology. She is currently studying for a Master's Degree in Counseling for Mental Health and Wellness in September 2023. Julia's research has been published in peer reviewed journals.

Learn about our Editorial Process

Saul Mcleod, PhD

Editor-in-Chief for Simply Psychology

BSc (Hons) Psychology, MRes, PhD, University of Manchester

Saul Mcleod, PhD., is a qualified psychology teacher with over 18 years of experience in further and higher education. He has been published in peer-reviewed journals, including the Journal of Clinical Psychology.

Olivia Guy-Evans, MSc

Associate Editor for Simply Psychology

BSc (Hons) Psychology, MSc Psychology of Education

Olivia Guy-Evans is a writer and associate editor for Simply Psychology. She has previously worked in healthcare and educational sectors.

On This Page:

A case-control study is a research method where two groups of people are compared – those with the condition (cases) and those without (controls). By looking at their past, researchers try to identify what factors might have contributed to the condition in the ‘case’ group.

Explanation

A case-control study looks at people who already have a certain condition (cases) and people who don’t (controls). By comparing these two groups, researchers try to figure out what might have caused the condition. They look into the past to find clues, like habits or experiences, that are different between the two groups.

The “cases” are the individuals with the disease or condition under study, and the “controls” are similar individuals without the disease or condition of interest.

The controls should have similar characteristics (i.e., age, sex, demographic, health status) to the cases to mitigate the effects of confounding variables .

Case-control studies identify any associations between an exposure and an outcome and help researchers form hypotheses about a particular population.

Researchers will first identify the two groups, and then look back in time to investigate which subjects in each group were exposed to the condition.

If the exposure is found more commonly in the cases than the controls, the researcher can hypothesize that the exposure may be linked to the outcome of interest.

Case Control Study

Figure: Schematic diagram of case-control study design. Kenneth F. Schulz and David A. Grimes (2002) Case-control studies: research in reverse . The Lancet Volume 359, Issue 9304, 431 – 434

Quick, inexpensive, and simple

Because these studies use already existing data and do not require any follow-up with subjects, they tend to be quicker and cheaper than other types of research. Case-control studies also do not require large sample sizes.

Beneficial for studying rare diseases

Researchers in case-control studies start with a population of people known to have the target disease instead of following a population and waiting to see who develops it. This enables researchers to identify current cases and enroll a sufficient number of patients with a particular rare disease.

Useful for preliminary research

Case-control studies are beneficial for an initial investigation of a suspected risk factor for a condition. The information obtained from cross-sectional studies then enables researchers to conduct further data analyses to explore any relationships in more depth.

Limitations

Subject to recall bias.

Participants might be unable to remember when they were exposed or omit other details that are important for the study. In addition, those with the outcome are more likely to recall and report exposures more clearly than those without the outcome.

Difficulty finding a suitable control group

It is important that the case group and the control group have almost the same characteristics, such as age, gender, demographics, and health status.

Forming an accurate control group can be challenging, so sometimes researchers enroll multiple control groups to bolster the strength of the case-control study.

Do not demonstrate causation

Case-control studies may prove an association between exposures and outcomes, but they can not demonstrate causation.

A case-control study is an observational study where researchers analyzed two groups of people (cases and controls) to look at factors associated with particular diseases or outcomes.

Below are some examples of case-control studies:
  • Investigating the impact of exposure to daylight on the health of office workers (Boubekri et al., 2014).
  • Comparing serum vitamin D levels in individuals who experience migraine headaches with their matched controls (Togha et al., 2018).
  • Analyzing correlations between parental smoking and childhood asthma (Strachan and Cook, 1998).
  • Studying the relationship between elevated concentrations of homocysteine and an increased risk of vascular diseases (Ford et al., 2002).
  • Assessing the magnitude of the association between Helicobacter pylori and the incidence of gastric cancer (Helicobacter and Cancer Collaborative Group, 2001).
  • Evaluating the association between breast cancer risk and saturated fat intake in postmenopausal women (Howe et al., 1990).

Frequently asked questions

1. what’s the difference between a case-control study and a cross-sectional study.

Case-control studies are different from cross-sectional studies in that case-control studies compare groups retrospectively while cross-sectional studies analyze information about a population at a specific point in time.

In  cross-sectional studies , researchers are simply examining a group of participants and depicting what already exists in the population.

2. What’s the difference between a case-control study and a longitudinal study?

Case-control studies compare groups retrospectively, while longitudinal studies can compare groups either retrospectively or prospectively.

In a  longitudinal study , researchers monitor a population over an extended period of time, and they can be used to study developmental shifts and understand how certain things change as we age.

In addition, case-control studies look at a single subject or a single case, whereas longitudinal studies can be conducted on a large group of subjects.

3. What’s the difference between a case-control study and a retrospective cohort study?

Case-control studies are retrospective as researchers begin with an outcome and trace backward to investigate exposure; however, they differ from retrospective cohort studies.

In a  retrospective cohort study , researchers examine a group before any of the subjects have developed the disease, then examine any factors that differed between the individuals who developed the condition and those who did not.

Thus, the outcome is measured after exposure in retrospective cohort studies, whereas the outcome is measured before the exposure in case-control studies.

Boubekri, M., Cheung, I., Reid, K., Wang, C., & Zee, P. (2014). Impact of windows and daylight exposure on overall health and sleep quality of office workers: a case-control pilot study. Journal of Clinical Sleep Medicine: JCSM: Official Publication of the American Academy of Sleep Medicine, 10 (6), 603-611.

Ford, E. S., Smith, S. J., Stroup, D. F., Steinberg, K. K., Mueller, P. W., & Thacker, S. B. (2002). Homocyst (e) ine and cardiovascular disease: a systematic review of the evidence with special emphasis on case-control studies and nested case-control studies. International journal of epidemiology, 31 (1), 59-70.

Helicobacter and Cancer Collaborative Group. (2001). Gastric cancer and Helicobacter pylori: a combined analysis of 12 case control studies nested within prospective cohorts. Gut, 49 (3), 347-353.

Howe, G. R., Hirohata, T., Hislop, T. G., Iscovich, J. M., Yuan, J. M., Katsouyanni, K., … & Shunzhang, Y. (1990). Dietary factors and risk of breast cancer: combined analysis of 12 case—control studies. JNCI: Journal of the National Cancer Institute, 82 (7), 561-569.

Lewallen, S., & Courtright, P. (1998). Epidemiology in practice: case-control studies. Community eye health, 11 (28), 57–58.

Strachan, D. P., & Cook, D. G. (1998). Parental smoking and childhood asthma: longitudinal and case-control studies. Thorax, 53 (3), 204-212.

Tenny, S., Kerndt, C. C., & Hoffman, M. R. (2021). Case Control Studies. In StatPearls . StatPearls Publishing.

Togha, M., Razeghi Jahromi, S., Ghorbani, Z., Martami, F., & Seifishahpar, M. (2018). Serum Vitamin D Status in a Group of Migraine Patients Compared With Healthy Controls: A Case-Control Study. Headache, 58 (10), 1530-1540.

Further Information

  • Schulz, K. F., & Grimes, D. A. (2002). Case-control studies: research in reverse. The Lancet, 359(9304), 431-434.
  • What is a case-control study?

Print Friendly, PDF & Email

Leave a Comment Cancel reply

You must be logged in to post a comment.

Step 4: Test Hypotheses

Once investigators have narrowed down the likely source of the outbreak to a few possible foods, they test the hypotheses . Investigators can use many different methods to test their hypotheses, but most methods entail studies that compare how often (frequency) sick people in the outbreak ate certain foods to how often people not part of the outbreak ate those foods.

If eating a particular food is associated with getting sick in the outbreak, it provides evidence that the food is the likely source. Investigators can describe the strength of the association between food and illness by using statistical tests or measures, such as odds ratios and confidence intervals.

Illness clusters

An illness cluster is when two or more people who do not live in the same household report eating at the same restaurant location, attending a common event, or shopping at the same location of a grocery store before getting sick. Investigating illness clusters can help test hypotheses about the source of an outbreak because an illness cluster suggests that the contaminated food item was served or sold at the cluster location.

Conducting epidemiologic studies within illness cluster locations can be an effective way to identify foods that are associated with illness. Case-control and cohort studies can both be used in illness cluster investigations and are especially useful when they assess associations between illness and specific food ingredients.

In some multistate outbreaks, investigators identify numerous illness clusters. In those situations, looking for common ingredients that people ate across all the illness clusters can help investigators test hypotheses, even in the absence of an epidemiologic study.

Surveys of healthy people

Investigators often compare the frequency of foods reported by sick people in a multistate outbreak to data that already exist about healthy people. The most common source for data about how often healthy people eat certain foods is the FoodNet Population Survey , a periodic survey of randomly selected residents in the FoodNet surveillance area . The most recent FoodNet Population Survey was conducted during 2018–2019 and included interviews from 38,743 adults and children. In addition to information on food exposures, the survey also includes questions on demographic characteristics, such as age, gender, race, and ethnicity. Investigators use statistical tests to determine if people in an outbreak report eating any of the suspected foods significantly more often than people in the survey. Comparing the frequency of foods reported by sick people to existing data is often faster than conducting a formal epidemiologic study.

Epidemiologic studies

If one or more of the suspected foods under consideration are not included on the FoodNet Population Survey, investigators might need to do an epidemiologic study to determine whether consuming the food is associated with being ill. Several types of studies can be conducted during multistate foodborne outbreaks:

  • Case-control studies : Investigators collect information from sick people (cases) and people who are not sick (controls) to see if cases were more likely to eat certain foods significantly more often than controls.
  • Case-case studies : Investigators compare sick people in the outbreak to other sick people who are not part of the outbreak.
  • Cohort studies : Investigators gather data from all the people that attended an event or ate at the same restaurant and compare the frequency of illness between people who did and did not eat specific foods. When people who ate a certain food got sick significantly more often than people who did not eat the food, it provides evidence that the food is the source of the outbreak.

Challenges of hypothesis testing

There are several reasons why hypothesis testing might not identify the likely source of an outbreak.

  • The initial investigation did not lead to a strong hypothesis to test.
  • There were too few illnesses to statistically analyze differences between sick people and people who were not part of the outbreak.
  • Sick people in the outbreak could not be reached to ask about their food exposures.
  • Certain ingredients were commonly consumed together in dishes, such as tomatoes, onions, and peppers in a salsa.

Even if investigators do not find a statistical association between a food and illness, the outbreak could still be foodborne. If the outbreak has ended, the source of the outbreak is considered unknown. If people are still getting sick, investigators keep gathering information to find the food that is causing the illnesses.

<< Previous Step: Generate Hypotheses about Outbreak Sources

Next Step: Confirm the Outbreak Source >>

Exit Notification / Disclaimer Policy

  • The Centers for Disease Control and Prevention (CDC) cannot attest to the accuracy of a non-federal website.
  • Linking to a non-federal website does not constitute an endorsement by CDC or any of its employees of the sponsors or the information and products presented on the website.
  • You will be subject to the destination website's privacy policy when you follow the link.
  • CDC is not responsible for Section 508 compliance (accessibility) on other federal or private website.

U.S. flag

An official website of the United States government

The .gov means it’s official. Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

The site is secure. The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

  • Publications
  • Account settings

Preview improvements coming to the PMC website in October 2024. Learn More or Try it out now .

  • Advanced Search
  • Journal List
  • Europe PMC Author Manuscripts

Basic statistical analysis in genetic case-control studies

Geraldine m clarke.

1 Genetic and Genomic Epidemiology Unit, Wellcome Trust Centre for Human Genetics, University of Oxford, Oxford, UK.

Carl A Anderson

2 Wellcome Trust Sanger Institute, Wellcome Trust Genome Campus, Hinxton, Cambridge, UK.

Fredrik H Pettersson

Lon r cardon.

3 GlaxoSmithKline, King of Prussia, Pennsylvania, USA.

Andrew P Morris

Krina t zondervan, associated data.

This protocol describes how to perform basic statistical analysis in a population-based genetic association case-control study. The steps described involve the (i) appropriate selection of measures of association and relevance of disease models; (ii) appropriate selection of tests of association; (iii) visualization and interpretation of results; (iv) consideration of appropriate methods to control for multiple testing; and (v) replication strategies. Assuming no previous experience with software such as PLINK, R or Haploview, we describe how to use these popular tools for handling single-nucleotide polymorphism data in order to carry out tests of association and visualize and interpret results. This protocol assumes that data quality assessment and control has been performed, as described in a previous protocol, so that samples and markers deemed to have the potential to introduce bias to the study have been identified and removed. Study design, marker selection and quality control of case-control studies have also been discussed in earlier protocols. The protocol should take ~1 h to complete.

INTRODUCTION

A genetic association case-control study compares the frequency of alleles or genotypes at genetic marker loci, usually single-nucleotide polymorphisms (SNPs) (see Box 1 for a glossary of terms), in individuals from a given population—with and without a given disease trait—in order to determine whether a statistical association exists between the disease trait and the genetic marker. Although individuals can be sampled from families (‘family-based’ association study), the most common design involves the analysis of unrelated individuals sampled from a particular outbred population (‘population-based association study’). Although disease-related traits are usually the main trait of interest, the methods described here are generally applicable to any binary trait.

The result of interbreeding between individuals from different populations.

Cochran-Armitage trend test

Statistical test for analysis of categorical data when categories are ordered. It is used to test for association in a 2 × k contingency table ( k > 2). In genetic association studies, because the underlying genetic model is unknown, the additive version of this test is most commonly used.

Confounding

A type of bias in statistical analysis that occurs when a factor exists that is causally associated with the outcome under study (e.g., case-control status) independently of the exposure of primary interest (e.g., the genotype at a given locus) and is associated with the exposure variable but is not a consequence of the exposure variable.

Any variable other than the main exposure of interest that is possibly predictive of the outcome under study; covariates include confounding variables that, in addition to predicting the outcome variable, are associated with exposure.

False discovery rate

The proportion of non-causal or false positive significant SNPs in a genetic association study.

False positive

Occurs when the null hypothesis of no effect of exposure on disease is rejected for a given variant when in fact the null hypothesis is true.

Family-wise error rate

The probability of one or more false positives in a set of tests. For genetic association studies, family-wise error rates reflect false positive findings of associations between allele/genotype and disease.

Hardy-Weinberg equilibrium (HWE)

Given a minor allele frequency of p , the probabilities of the three possible unordered genotypes ( a/a , A/a , A/A ) at a biallelic locus with minor allele A and major allele a, are (1 – p ) 2 , 2 p (1 – p ), p 2 . In a large, randomly mating, homogenous population, these probabilities should be stable from generation to generation.

Linkage disequilibrium (LD)

The population correlation between two (usually nearby) allelic variants on the same chromosome; they are in LD if they are inherited together more often than expected by chance.

A measure of LD between two markers calculated according to the correlation between marker alleles.

A measure of association derived from case-control studies; it is the ratio of the odds of disease in the exposed group compared with the non-exposed.

The risk of disease in a given individual. Genotype-specific penetrances reflect the risk of disease with respect to genotype.

Population allele frequency

The frequency of a particular allelic variant in a general population of specified origin.

Population stratification

The presence of two or more groups with distinct genetic ancestry.

Relative risk

The risk of disease or of an event occurring in one group relative to another.

Single-nucleotide polymorphism (SNP)

A genetic variant that consists of a single DNA base-pair change, usually resulting in two possible allelic identities at that position.

Following previous protocols on study design, marker selection and data quality control 1 – 3 , this protocol considers basic statistical analysis methods and techniques for the analysis of genetic SNP data from population-based genome-wide and candidate-gene (CG) case-control studies. We describe disease models, measures of association and testing at genotypic (individual) versus allelic (gamete) level, single-locus versus multilocus methods of association testing, methods for controlling for multiple testing and strategies for replication. Statistical methods discussed relate to the analysis of common variants, i.e., alleles with a minor allele frequency (MAF) > 1%; different analytical techniques are required for the analysis of rare variants 4 . All methods described are proven and used routinely in our research group 5 , 6 .

Conceptual basis for statistical analysis

The success of a genetic association study depends on directly or indirectly genotyping a causal polymorphism. Direct genotyping occurs when an actual causal polymorphism is typed. Indirect genotyping occurs when nearby genetic markers that are highly correlated with the causal polymorphism are typed. Correlation, or non-random association, between alleles at two or more genetic loci is referred to as linkage disequilibrium (LD). LD is generated as a consequence of a number of factors and results in the shared ancestry of a population of chromosomes at nearby loci. The shared ancestry means that alleles at flanking loci tend to be inherited together on the same chromosome, with specific combinations of alleles known as haplotypes. In genome-wide association (GWA) studies, common SNPs are typically typed at such high density across the genome that, although any single SNP is unlikely to have direct causal relevance, some are likely to be in LD with any underlying common causative variants. Indeed, most recent GWA arrays containing up to 1 million SNPs use known patterns of genomic LD from sources such as HapMap 7 to provide the highest possible coverage of common genomic variation 8 . CG studies usually focus on genotyping a smaller but denser set of SNPs, including functional polymorphisms with a potentially higher previous probability of direct causal relevance 2 .

A fundamental assumption of the case-control study is that the individuals selected in case and control groups provide unbiased allele frequency estimates of the true underlying distribution in affected and unaffected members of the population of interest. If not, association findings will merely reflect biases resulting from the study design 1 .

Models and measures of association

Consider a genetic marker consisting of a single biallelic locus with alleles a and A (i.e., a SNP). Unordered possible genotypes are then a/a , a/A and A/A . The risk factor for case versus control status (disease outcome) is the genotype or allele at a specific marker. The disease penetrance associated with a given genotype is the risk of disease in individuals carrying that genotype. Standard models for disease penetrance that imply a specific relationship between genotype and phenotype include multiplicative, additive, common recessive and common dominant models. Assuming a genetic penetrance parameter γ (γ > 1), a multiplicative model indicates that the risk of disease is increased γ-fold with each additional A allele; an additive model indicates that risk of disease is increased γ-fold for genotype a/A and by 2γ-fold for genotype A/A ; a common recessive model indicates that two copies of allele A are required for a γ-fold increase in disease risk, and a common dominant model indicates that either one or two copies of allele A are required for a γ-fold increase in disease risk. A commonly used and intuitive measure of the strength of an association is the relative risk (RR), which compares the disease penetrances between individuals exposed to different genotypes. Special relationships exist between the RRs for these common models 9 (see Table 1 ).

Disease penetrance functions and associated relative risks.

Shown are disease penetrance functions for genotypes a/a , A/a and A/A and associated relative risks for genotypes A/a and A/a compared with baseline genotype a/a for standard disease models when baseline disease penetrance associated with genotype a/a is f 0 0 and genetic penetrance parameter is γ> 19.

RR estimates based on penetrances can only be derived directly from prospective cohort studies, in which a group of exposed and unexposed individuals from the same population are followed up to assess who develops disease. In a case-control study, in which the ratio of cases to controls is controlled by the investigator, it is not possible to make direct estimates of disease penetrance, and hence of RRs. In this type of study, the strength of an association is measured by the odds ratio (OR). In a case-control study, the OR of interest is the odds of disease (the probability that the disease is present compared with the probability that it is absent) in exposed versus non-exposed individuals. Because of selected sampling, odds of disease are not directly measurable. However, conveniently, the disease OR is mathematically equivalent to the exposure OR (the odds of exposure in cases versus controls), which we can calculate directly from exposure frequencies 10 . The allelic OR describes the association between disease and allele by comparing the odds of disease in an individual carrying allele A to the odds of disease in an individual carrying allele a . The genotypic ORs describe the association between disease and genotype by comparing the odds of disease in an individual carrying one genotype to the odds of disease in an individual carrying another genotype. Hence, there are usually two genotypic ORs, one comparing the odds of disease between individuals carrying genotype A/A and those carrying a/a and the other comparing the odds of disease between individuals carrying genotype a/A and those carrying genotype a/a. Beneficially, when disease penetrance is small, there is little difference between RRs and ORs (i.e., RR ≈ OR). Moreover, the OR is amenable to analysis by multivariate statistical techniques that allow extension to incorporate further SNPs, risk factors and clinical variables. Such techniques include logistic regression and other types of log-linear models 11 .

To work with observations made at the allelic (gamete) rather than the genotypic (individual) level, it is necessary to assume (i) that there is Hardy-Weinberg equilibrium (HWE) in the population, (ii) that the disease has a low prevalence ( < 10%) and (iii) that the disease risks are multiplicative. Under the null hypothesis of no association with disease, the first condition ensures that there is HWE in both controls and cases. Under the alternative hypothesis, the second condition further ensures that controls will be in HWE and the third condition further ensures that cases will also be in HWE. Under these assumptions, allelic frequencies in affected and unaffected individuals can be estimated from case-control studies. The OR comparing the odds of allele A between cases and controls is called the allelic RR (γ*). It can be shown that the genetic penetrance parameter in a multiplicative model of penetrance is closely approximated by the allelic RR, i.e., γ ≈ γ* ( ref. 10 ).

Tests for association

Tests of genetic association are usually performed separately for each individual SNP. The data for each SNP with minor allele a and major allele A can be represented as a contingency table of counts of disease status by either genotype count (e.g., a/a , A/a and A/A ) or allele count (e.g., a and A ) (see Box 2 ). Under the null hypothesis of no association with the disease, we expect the relative allele or genotype frequencies to be the same in case and control groups. A test of association is thus given by a simple χ 2 test for independence of the rows and columns of the contingency table.

CONTINGENCY TABLES AND ASSOCIATED TESTS

The risk factor for case versus control status (disease outcome) is the genotype or allele at a specific marker. The data for each SNP with minor allele a and major allele A in case and control groups comprising n individuals can be written as a 2 × k contingency table of disease status by either allele ( k = 2) or genotype ( k = 3) count.

Allele count

  • The allelic odds ratio is estimated by OR A = m 12 m 21 m 11 m 22 .
  • If the disease prevalence in a control individual carrying an a allele can be estimated and is denoted as P 0 , then the relative risk of disease in individuals with an A allele compared with an a allele is estimated by RR A = OR A 1 − P 0 + P o OR A .

An allelic association test is based on a simple χ 2 test for independence of rows and columns X 2 = ∑ i = 1 2 ∑ j = 1 2 ( m i j − E [ m i j ] ) 2 E [ m i j ] where E [ m i j ] = m i • m • j 2 n X 2 has a χ 2 distribution with 1 d.f. under the null hypothesis of no association.

Genotype count

  • The genotypic odds ratio for genotype A/A relative to genotype a/a is estimated by OR A A = n 13 n 21 n 11 n 23 . The genotypic odds ratio for genotype A/a relative to genotype a/a is estimated by OR A a = n 12 n 21 n 11 n 22 .
  • If the disease prevalence in a control individual carrying an a/a genotype can be estimated and is denoted as P 0 , then the relative risk of disease in individuals with an A/A [A/a] genotype compared with an a/a genotype is estimated by RR A A = OR A A 1 − P 0 + P o OR A A [ RR A a = OR A a 1 − P 0 + P o OR A a ] .
  • A genotypic association test is based on a simple χ 2 test for independence of rows and columns X 2 = ∑ i = 1 2 ∑ j = 1 3 ( n i j − E [ n i j ] ) 2 E [ n i j ] where E [ n i j ] = n i • n • j n X 2 has a χ 2 distribution with 2 d.f. under the null hypothesis of no association. To test for a dominant (recessive) effect of allele A, counts for genotypes a/A and A/A ( a/a and A/a ) can be combined and the usual 1 d.f. χ 2 -test for independence of rows and columns can be applied to the summarized 2 × 2 table.
  • A Cochran-Armitage trend test of association between disease and marker is given by T 2 = [ ∑ i = 1 3 w i ( n i n 2 • − n 2 n 1 • ) ] 2 n 1 • n 2 • n [ ∑ i = 1 3 w i 2 n • i ( n − n • i ) − 2 ∑ i = 1 2 ∑ j = i + 1 3 w i w j n • i n • j ] where w = ( w 1 , w 2 , w 3 ) are weights chosen to detect particular types of association. For example, to test whether allele A is dominant over allele a w = (0,1,1) is optimal; to test whether allele A is recessive to allele a , the optimal choice is w = (0,0,1). In genetic association studies, w = (0,1,2) is most often used to test for an additive effect of allele A . T 2 has a χ 2 distribut ion with 1 d.f. under the null hypothesis of no association.

In a conventional χ 2 test for association based on a 2 × 3 contingency table of case-control genotype counts, there is no sense of genotype ordering or trend: each of the genotypes is assumed to have an independent association with disease and the resulting genotypic association test has 2 degrees of freedom (d.f.). Contingency table analysis methods allow alternative models of penetrance by summarizing the counts in different ways. For example, to test for a dominant model of penetrance, in which any number of copies of allele A increase the risk of disease, the contingency table can be summarized as a 2 × 2 table of genotype counts of A/A versus both a/A and a/a combined. To test for a recessive model of penetrance, in which two copies of allele A are required for any increased risk, the contingency table is summarized into genotype counts of a/a versus a combined count of both a/A and A/A genotypes. To test for a multiplicative model of penetrance using contingency table methods, it is necessary to analyze by gamete rather than individual: a χ 2 test applied to the 2 × 2 table of case-control allele counts is the widely used allelic association test. The allelic association test with 1 d.f. will be more powerful than the genotypic test with 2 d.f., as long as the penetrance of the heterozygote genotype is between the penetrances of the two homozygote genotypes. Conversely, if there is extreme deviation from the multiplicative model, the genotypic test will be more powerful. In the absence of HWE in controls, the allelic association test is not suitable and alternative methods must be used to test for multiplicative models. See the earlier protocol on data quality assessment and control for a discussion of criteria for retaining SNPs showing deviation from HWE 3 . Alternatively, any penetrance model specifying some kind of trend in risk with increasing numbers of A alleles, of which additive, dominant and recessive models are all examples, can be examined using the Cochran-Armitage trend test 12 , 13 . The Cochran-Armitage trend test is a method of directing χ 2 tests toward these narrower alternatives. Power is very often improved as long as the disease risks associated with the a/A genotype are intermediate to those associated with the a/a and A/A genotypes. In genetic association studies in which the underlying genetic model is unknown, the additive version of this test is most commonly used. Table 2 summarizes the various tests of association that use contingency table methods. Box 2 outlines contingency tables and associated tests in statistical detail.

Tests of association using contingency table methods.

d.f. for tests of association based on contingency tables along with associated PLINK keyword are shown for allele and genotype counts in case and control groups, comprising N individuals at a bi-allelic locus with alleles a and A .

Tests of association can also be conducted with likelihood ratio (LR) methods in which inference is based on the likelihood of the genotyped data given disease status. The likelihood of the observed data under the proposed model of disease association is compared with the likelihood of the observed data under the null model of no association; a high LR value tends to discredit the null hypothesis. All disease models can be tested using LR methods. In large samples, the χ 2 and LR methods can be shown to be equivalent under the null hypothesis 14 .

More complicated logistic regression models of association are used when there is a need to include additional covariates to handle complex traits. Examples of this are situations in which we expect disease risk to be modified by environmental effects such as epidemiological risk factors (e.g., smoking and gender), clinical variables (e.g., disease severity and age at onset) and population stratification (e.g., principal components capturing variation due to differential ancestry 3 ), or by the interactive and joint effects of other marker loci. In logistic regression models, the logarithm of the odds of disease is the response variable, with linear (additive) combinations of the explanatory variables (genotype variables and any covariates) entering into the model as its predictors. For suitable linear predictors, the regression coefficients fitted in the logistic regression represent the log of the ORs for disease gene association described above. Linear predictors for genotype variables in a selection of standard disease models are shown in Table 3 .

Linear predictors for genotype variables in a selection of standard disease models.

Multiple testing

Controlling for multiple testing to accurately estimate significance thresholds is a very important aspect of studies involving many genetic markers, particularly GWA studies. The type I error, also called the significance level or false-positive rate, is the probability of rejecting the null hypothesis when it is true. The significance level indicates the proportion of false positives that an investigator is willing to tolerate in his or her study. The family-wise error rate (FWER) is the probability of making one or more type I errors in a set of tests. Lower FWERs restrict the proportion of false positives at the expense of reducing the power to detect association when it truly exists. A suitable FWER should be specified at the design stage of the analysis 1 . It is then important to keep track of the number of statistical comparisons performed and correct the individual SNP-based significance thresholds for multiple testing to maintain the overall FWER. For association tests applied at each of n SNPs, per-test significance levels of α* for a given FWER of α can be simply approximated using Bonferroni (α* = α/ n ) or Sidak 15 , 16 (α* = 1 − (1 – α) 1/ n ) adjustments. When tests are independent, the Sidak correction is exact; however, in GWA studies comprising dense sets of markers, this is unlikely to be true and both corrections are then very conservative. A similar but slightly less-stringent alternative to the Bonferroni correction is given by Holm 17 . Alternatives to the FWER approach include false discovery rate (FDR) procedures 18 , 19 , which control for the expected proportion of false positives among those SNPs declared significant. However, dependence between markers and the small number of expected true positives make FDR procedures problematic for GWA studies. Alternatively, permutation approaches aim to render the null hypothesis correct by randomization: essentially, the original P value is compared with the empirical distribution of P values obtained by repeating the original tests while randomly permuting the case-control labels 20 . Although Bonferroni and Sidak corrections provide a simple way to adjust for multiple testing by assuming independence between markers, permutation testing is considered to be the ‘gold standard’ for accurate correction 20 . Permutation procedures are computationally intensive in the setting of GWA studies and, moreover, apply only to the current genotyped data set; therefore, unless the entire genome is sequenced, they cannot generate truly genome-wide significance thresholds. Bayes factors have also been proposed for the measurement of significance 6 . For GWA studies of dense SNPs and resequence data, a standard genome-wide significance threshold of 7.2 × 10 − 8 for the UK Caucasian population has been proposed by Dudbridge and Gusnanto 21 . Other thresholds for contemporary populations, based on sample size and proposed FWER, have been proposed by Hoggart et al 22 . Informally, some journals have accepted a genome-wide significance threshold of 5 × 10 − 7 as strong evidence for association 6 ; however, most recently, the accepted standard is 5 × 10 − 8 ( ref. 23 ). Further, graphical techniques for assessing whether observed P values are consistent with expected values include log quantile-quantile P value plots that highlight loci that deviate from the null hypothesis 24 .

Interpretation of results

A significant result in an association test rarely implies that a SNP is directly influencing disease risk; population association can be direct, indirect or spurious. A direct, or causal, association occurs when different alleles at the marker locus are directly involved in the etiology of the disease through a biological pathway. Such associations are typically only found during follow-up genotyping phases of initial GWA studies, or in focused CG studies in which particular functional polymorphisms are targeted. An indirect, or non-causal, association occurs when the alleles at the marker locus are correlated (in LD) with alleles at a nearby causal locus but do not directly influence disease risk. When a significant finding in a genetic association study is true, it is most likely to be indirect. Spurious associations can occur as a consequence of data quality issues or statistical sampling, or because of confounding by population stratification or admixture. Population stratification occurs when cases and controls are sampled disproportionately from different populations with distinct genetic ancestry. Admixture occurs when there has been genetic mixing of two or more groups in the recent past. For example, genetic admixture is seen in Native American populations in which there has been recent genetic mixing of individuals with both American Indian and Caucasian ancestry 25 . Confounding occurs when a factor exists that is associated with both the exposure (genotype) and the disease but is not a consequence of the exposure. As allele frequencies and disease frequencies are known to vary among populations of different genetic ancestry, population stratification or admixture can confound the association between the disease trait and the genetic marker; it can bias the observed association, or indeed can cause a spurious association. Principal component analyses or multidimensional scaling methods are commonly used to identify and remove individuals exhibiting divergent ancestry before association testing. These techniques are described in detail in an earlier protocol 3 . To adjust for any residual population structure during association testing, the principal components from principal component analyses or multidimensional scaling methods can be included as covariates in a logistic regression. In addition, the technique of genomic control 26 can be used to detect and compensate for the presence of fine-scale or within-population stratification during association testing. Under genomic control, population stratification is treated as a random effect that causes the distribution of the χ 2 association test statistics to have an inflated variance and a higher median than would otherwise be observed. The test statistics are assumed to be uniformly affected by an inflation factor λ, the magnitude of which is estimated from a set of selected markers by comparing the median of their observed test statistics with the median of their expected test statistics under an assumption of no population stratification. Under genomic control, if λ > 1, then population stratification is assumed to exist and a correction is applied by dividing the actual association test χ 2 statistic values by λ. As λ scales with sample size, λ 1,000 , the inflation factor for an equivalent study of 1,000 cases and 1,000 controls calculated by rescaling λ, is often reported 27 . In a CG study, λ can only be determined if an additional set of markers specifically designed to indicate population stratification are genotyped. In a GWA study, an unbiased estimation of λ can be determined using all of the genotyped markers; the effect on the inflation factor of potential causal SNPs in such a large set of genomic control markers is assumed to be negligible.

Replication

Replication occurs when a positive association from an initial study is confirmed in a subsequent study involving an independent sample drawn from the same population as the initial study. It is the process by which genetic association results are validated. In theory, a repeated significant association between the same trait and allele in an independent sample is the benchmark for replication. However, in practice, so-called replication studies often comprise findings of association between the same trait and nearby variants in the same gene as the original SNP, or between the same SNP and different high-risk traits. A precise definition of what constitutes replication for any given study is therefore important and should be clearly stated 28 .

In practice, replication studies often involve different investigators with different samples and study designs aiming to independently verify reports of positive association and obtain accurate effect-size estimates, regardless of the designs used to detect effects in the primary study. Two commonly used strategies in such cases are an exact strategy, in which only marker loci indicating a positive association are subsequently genotyped in the replicate sample, and a local strategy, in which additional variants are also included, thus combining replication with fine-mapping objectives. In general, the exact strategy is more balanced in power and efficiency; however, depending on local patterns of LD and the strength of primary association signals, a local strategy can be beneficial 28 .

In the past, multistage designs have been proposed as cost-efficient approaches to allow the possibility of replication within a single overall study. The first stage of a standard two-stage design involves genotyping a large number of markers on a proportion of available samples to identify potential signals of association using a nominal P value threshold. In stage two, the top signals are then followed up by genotyping them on the remaining samples while a joint analysis of data from both stages is conducted 29 , 30 . Significant signals are subsequently tested for replication in a second data set. With the ever-decreasing costs of GWA genotyping, two-stage studies have become less common.

Standard statistical software (such as R ( ref. 31 ) or SPSS) can be used to conduct and visualize all the analyses outlined above. However, many researchers choose to use custom-built GWA software. In this protocol we use PLINK 32 , Haploview 33 and the customized R package car 34 . PLINK is a popular and computationally efficient software program that offers a comprehensive and well-documented set of automated GWA quality control and analysis tools. It is a freely available open source software written in C++, which can be installed on Windows, Mac and Unix machines ( http://pngu.mgh.harvard.edu/~purcell/plink/index.shtml ). Haploview ( http://www.broadinstitute.org/haploview/haploview ) is a convenient tool for visualizing LD; it interfaces directly with PLINK to produce a standard visualization of PLINK association results. Haploview is most easily run through a graphical user interface, which offers many advantages in terms of display functions and ease of use. car ( http://socserv.socsci.mcmaster.ca/jfox/ ) is an R package that contains a variety of functions for graphical diagnostic methods.

The next section describes protocols for the analysis of SNP data and is illustrated by the use of simulated data sets from CG and GWA studies (available as gzipped files from http://www.well.ox.ac.uk/ggeu/NPanalysis/ or .zip files as Supplementary Data 1 and Supplementary Data 2 ). We assume that SNP data for a CG study, typically comprising on the order of thousands of markers, will be available in a standard PED and MAP file format (for an explanation of these file formats, see http://pngu.mgh.harvard.edu/~purcell/plink/data.shtml#ped ) and that SNP data for a GWA study, typically comprising on the order of hundreds of thousands of markers, will be available in a standard binary file format (for an explanation of the binary file format, see http://pngu.mgh.harvard.edu/~purcell/plink/data.shtml#bed ). In general, SNP data for either type of study may be available in either format. The statistical analysis described here is for the analysis of one SNP at a time; therefore, apart from the requirement to take potentially differing input file formats into account, it does not differ between CG and GWA studies.

Computer workstation with Unix/Linux operating system and web browser

  • PLINK 32 software for association analysis ( http://pngu.mgh.harvard.edu/~purcell/plink/download.shtml ).
  • Unzipping tool such as WinZip ( http://www.winzip.com ) or gunzip ( http://www.gzip.org )
  • Statistical software for data analysis and graphing such as R ( http://cran.r-project.org/ ) and Haploview 33 ( http://www.broadinstitute.org/haploview/haploview ).
  • SNPSpD 35 (Program to calculate the effective number of independent SNPs among a collection of SNPs in LD with each other; http://genepi.qimr.edu.au/general/daleN/SNPSpD/ )
  • Files: genome-wide and candidate-gene SNP data (available as gzipped files from http://www.well.ox.ac.uk/ggeu/NPanalysis/ or .zip files as Supplementary Data 1 and Supplementary Data 2 )

Identify file formats ● TIMING ~5 min

1 | For SNP data available in standard PED and MAP file formats, as in our CG study, follow option A. For SNP data available in standard binary file format, as in our GWA study, follow option B. The instructions provided here are for unpacking the sample data provided as gzipped files at http://www.well.ox.ac.uk/ggeu/NPanalysis/ . If using the .zip files provided as supplementary Data 1 or supplementary Data 2 , please proceed directly to step 2.

▲ CRITICAL STEP The format in which genotype data are returned to investigators varies according to genome-wide SNP platforms and genotyping centers. We assume that genotypes have been called by the genotyping center, undergone appropriate quality control filters as described in a previous protocol 3 and returned as clean data in a standard file format.

  • Download the file ‘cg-data.tgz’.

▲ CRITICAL STEP The simulated data used here have passed standard quality control filters: all individuals have a missing data rate of < 20%, and SNPs with a missing rate of > 5%, a MAF < 1% or an HWE P value < 1 × 10 − 4 have already been excluded. These filters were selected in accordance with procedures described elsewhere 3 to minimize the influence of genotype-calling artifacts in a CG study.

  • Download the file ‘gwa-data.tgz’.

▲ CRITICAL STEP We assume that covariate files are available in a standard file format. For an explanation of the standard format for covariate files, see http://pngu.mgh.harvard.edu/~purcell/plink/data.shtml#covar .

▲ CRITICAL STEP Optimized binary BED files contain the genotype information and the corresponding BIM/FAM files contain the map and pedigree information. The binary BED file is a compressed file that allows faster processing in PLINK and takes less storage space, thus facilitating the analysis of large-scale data sets 32 .

▲ CRITICAL STEP The simulated data used here have passed standard quality control: all individuals have a missing data rate of < 10%. SNPs with a missing rate > 10%, a MAF < 1% or an HWE P value < 1 × 10 − 5 have already been excluded. These filters were selected in accordance with procedures described elsewhere 3 to minimize the influence of genotype-calling artifacts in a GWA study.

? TROUBLESHOOTING

Basic descriptive summary ● TIMING ~5 min

2 | To obtain a summary of MAFs in case and control populations and an estimate of the OR for association between the minor allele (based on the whole sample) and disease in the CG study, type ‘plink --file cg --assoc --out data’. In any of the PLINK commands in this protocol, replace the ‘--file cg’ option with the ‘--bfile gwa’ option to use the binary file format of the GWA data rather than the PED and MAP file format of the CG data.

▲ CRITICAL STEP PLINK always creates a log file called ‘data.log’, which includes details of the implemented commands, the number of cases and controls in the input files, any excluded data and the genotyping rate in the remaining data. This file is very useful for checking the software is successfully completing commands.

▲ CRITICAL STEP The options in a PLINK command can be specified in any order.

3 | Open the output file ‘data.assoc’. It has one row per SNP containing the chromosome [CHR], the SNP identifier [SNP], the base-pair location [BP], the minor allele [A1], the frequency of the minor allele in the cases [F_A] and controls [F_U], the major allele [A2] and statistical data for an allelic association test including the χ 2 -test statistic [CHISQ], the asymptotic P value [ P ] and the estimated OR for association between the minor allele and disease [OR].

Single SNP tests of association ● TIMING ~5 min

4 | When there are no covariates to consider, carry out simple χ 2 tests of association by following option A. For inclusion of multiple covariates and covariate interactions, follow option B.

▲ CRITICAL STEP Genotypic, dominant and recessive tests will not be conducted if any one of the cells in the table of case control by genotype counts contains less than five observations. This is because the χ 2 approximation may not be reliable when cell counts are small. For SNPs with MAFs < 5%, a sample of more than 2,000 cases and controls would be required to meet this threshold and more than 50,000 would be required for SNPs with MAF < 1%. To change the threshold, use the ‘--cell’ option. For example, we could lower the threshold to 3 and repeat the χ 2 tests of association by typing ‘plink --file cg --model --cell 3 --out data’.

  • Open the output file ‘data.model’. It contains five rows per SNP, one for each of the association tests described in Table 2 . Each row contains the chromosome [CHR], the SNP identifier [SNP], the minor allele [A1], the major allele [A2], the test performed [TEST: GENO (genotypic association); TREND (Cochran-Armitage trend); ALLELIC (allelic association); DOM (dominant model); and REC (recessive model)], the cell frequency counts for cases [AFF] and controls [UNAFF], the χ 2 test statistic [CHISQ], the degrees of freedom for the test [DF] and the asymptotic P value [ P ].

▲ CRITICAL STEP To specify a genotypic, dominant or recessive model in place of a multiplicative model, include the model option --genotypic, --dominant or --recessive, respectively. To include sex as a covariate, include the option --sex. To specify interactions between covariates, and between SNPs and covariates, include the option --interaction. Open the output file ‘data.assoc.logistic’. If no model option is specified, the first row for each SNP corresponds to results for a multiplicative test of association. If the ‘--genotypic’ option has been selected, the first row will correspond to a test for additivity and the subsequent row to a separate test for deviation from additivity. If the ‘--dominant’ or ‘--recessive’ model options have been selected, then the first row will correspond to tests for a dominant or recessive model of association, respectively. If covariates have been included, each of these P values is adjusted for the effect of the covariates. The C ≥ 0 subsequent rows for each SNP correspond to separate tests of significance for each of the C covariates included in the regression model. Finally, if the ‘--genotypic’ model option has been selected, there is a final row per SNP corresponding to a 2 d.f. LR test of whether both the additive and the deviation from additivity components of the regression model are significant. Each row contains the chromosome [CHR], the SNP identifier [SNP], the base-pair location [BP], the minor allele [A1], the test performed [TEST: ADD (multiplicative model or genotypic model testing additivity), GENO_2DF (genotypic model), DOMDEV (genotypic model testing deviation from additivity), DOM (dominant model) or REC (recessive model)], the number of missing individuals included [NMISS], the OR, the coefficient z -statistic [STAT] and the asymptotic P value [ P ].▲ CRITICAL STEP ORs for main effects cannot be interpreted directly when interactions are included in the model; their interpretation depends on the exact combination of variables included in the model. Refer to a standard text on logistic regression for more details 36 .

Data visualization ● TIMING ~5 min

5 | To create quantile-quantile plots to compare the observed association test statistics with their expected values under the null hypothesis of no association and so assess the number, magnitude and quality of true associations, follow option A. Note that quantile-quantile plots are only suitable for GWA studies comprising hundreds of thousands of markers. To create a Manhattan plot to display the association test P values as a function of chromosomal location and thus provide a visual summary of association test results that draw immediate attention to any regions of significance, follow option B. To visualize the LD between sets of markers in an LD plot, follow option C. Manhattan and LD plots are suitable for both GWA and CG studies comprising any number of markers. Otherwise, create customized graphics for the visualization of association test output using customized simple R 31 commands 37 (not detailed here)).

  • Start R software.
  • Create a quantile-quantile plot ‘chisq.qq.plot.pdf’ with a 95% confidence interval based on output from the simple χ 2 tests of association described in Step 4A for trend, allelic, dominant or recessive models, wherein statistics have a χ 2 distribution with 1 d.f. under the null hypothesis of no association. Create the plot by typing data < -read.table(“[path_to]/data.model”, header = TRUE); pdf(“[path_to]/chisq.qq.plot.pdf”); library(car); obs < - data[data$TEST = = “[model]”,]$CHISQ; qqPlot(obs, distribution = ”chisq”, df = 1, xlab = ”Expected chi-squared values”, ylab = “Observed test statistic”, grid = FALSE); dev.off()’, where [path_to] is the appropriate directory path and [model] identifies the association test output to be displayed, and where [model] can be TREND (Cochran-Armitage trend); ALLELIC (allelic association); DOM (dominant model); or REC (recessive model). For simple χ 2 tests of association based on a genotypic model, in which test statistics have a χ 2 distribution with 2 d.f. under the null hypothesis of no association, use the option [df] = 2 and [model] = GENO.
  • Create a quantile-quantile plot ‘pvalue.qq.plot.pdf’ based on – log10 P values from tests of association using logistic regression described in Step 4B by typing ‘data < - read.table(“[path_to]/data.assoc.logistic”, header = TRUE); pdf(“[path_to]/pvalue.qq.plot.pdf”); obs < - −log10(sort(data[data$TEST = = ”[model]”,]$P)); exp < - −log10( c(1:length(obs)) /(length(obs) + 1)); plot(exp, obs, ylab = “Observed (−logP)”, xlab = ”Expected(−logP) “, ylim = c(0,20), xlim = c(0,7)) lines(c(0,7), c(0,7), col = 1, lwd = 2) ; dev.off()’, where [path_to] is the appropriate directory path and [model] identifies the association test output to be displayed and where [model] is ADD (multiplicative model); GENO_2DF (genotypic model); DOMDEV (genotypic model testing deviation from additivity); DOM (dominant model); or REC (recessive model).
  • Start Haploview. In the ‘Welcome to Haploview’ window, select the ‘PLINK Format’ tab. Click the ‘browse’ button and select the SNP association output file created in Step 4. We select our GWA study χ 2 tests of association output file ‘data.model’. Select the corresponding MAP file, which will be the ‘.map’ file for the pedigree file format or the ‘.bim’ file for the binary file format. We select our GWA study file ‘gwa.bim’. Leave other options as they are (ignore pairwise comparison of markers > 500 kb apart and exclude individuals with > 50% missing genotypes). Click ‘OK’.
  • Select the association results relevant to the test of interest by selecting ‘TEST’ in the dropdown tab to the right of ‘Filter:’, ‘ = ’ in the dropdown menu to the right of that and the PLINK keyword corresponding to the test of interest in the window to the right of that. We select PLINK keyword ‘ALLELIC’ to visualize results for allelic tests of association in our GWA study. Click the gray ‘Filter’ button. Click the gray ‘Plot’ button. Leave all options as they are so that ‘Chromosomes’ is selected as the ‘X-Axis’. Choose ‘P’ from the drop-down menu for the ‘Y-Axis’ and ‘−log10′ from the corresponding dropdown menu for ‘Scale:’. Click ‘OK’ to display the Manhattan plot.
  • To save the plot as a scalable vector graphics file, click the button ‘Export to scalable vector graphics:’ and then click the ‘Browse’ button (immediately to the right) to select the appropriate title and directory.
  • Using the standard MAP file, create the locus information file required by Haploview for the CG data by typing ‘cg.map < - read.table(“[path_to]/cg.map”); write.table(cg.map[,c(2,4)],“[path_to]/cg.hmap”, col.names = FALSE, row.names = FALSE, quote = FALSE) where [path_to] is the appropriate directory path.
  • Start Haploview. In the ‘Welcome to Haploview’ window, select the ‘LINKAGE Format’ tab. Click the ‘browse’ button to enter the ‘Data File’ and select the PED file ‘cg.ped’. Click the ‘browse’ button to enter the ‘Locus Information File’ and select the file ‘cg.hmap’. Leave other options as they are (ignore pairwise comparison of markers > 500 kb apart and exclude individuals with > 50% missing genotypes). Click ‘OK’. Select the ‘LD Plot’ tab.

Adjustment for multiple testing ● TIMING ~5 min

6 | For CG studies, typically comprising hundreds of thousands of markers, control for multiple testing using Bonferroni’s adjustment (follow option A); Holm, Sidak or FDR (follow option B) methods; or permutation (follow option C). Although Bonferroni, Holm, Sidak and FDR are simple to implement, permutation testing is widely recommended for accurately correcting for multiple testing and should be used when computationally possible. For GWA studies, select an appropriate genome-wide significance threshold (follow option D).

▲ CRITICAL STEP If some of the SNPs are in LD so that there are fewer than 40 independent tests, the Bonferroni correction will be too conservative. Use LD information from HapMap and SNPSpD ( http://genepi.qimr.edu.au/general/daleN/SNPSpD/ ) 35 to estimate the effective number of independent SNPs 1 . Derive the per-test significance rate α* by dividing α by the effective number of independent SNPs.

  • To obtain significance values adjusted for multiple testing for trend, dominant and recessive tests of association, include the --adjust option along with the model specification option --model-[x] (where [x] is ‘trend’, ‘rec’ or ‘dom’ to indicate whether trend, dominant or recessive test association P values, respectively, are to be adjusted for) in any of the PLINK commands described in Step 4A. For example, adjusted significance values for a Cochran-Armitage trend test of association in the CG data are obtained by typing ‘plink --file cg --adjust --model-trend --out data’. Obtain significance values adjusted for an allelic test of association by typing ‘plink --file cg --assoc –adjust --out data’.
  • Open the output file ‘data.model.[x].adjusted’ for adjusted trend, dominant or recessive test association P values or ‘data.assoc.adjusted’ for adjusted allelic test of association P values. These files have one row per SNP containing the chromosome [CHR], the SNP identifier [SNP], the unadjusted P value [UNADJ] identical to that found in the original association output file, the genomic-control–adjusted P value [GC], the Bonferroni-adjusted P value [BONF], the Holm step-down–adjusted P value [HOLM], the Sidak single-step–adjusted P value [SIDAK_SS], the Sidak step-down–adjusted P value [SIDAK_SD], the Benjamini and Hochberg FDR control [FDR_BH] and the Benjamini and Yekutieli FDR control [FDR_BY]. To maintain a FWER or FDR of α = 0.05, only SNPs with adjusted P values less than α are declared significant.
  • To generate permuted P values, include the --mperm option along with the number of permutations to be performed and the model specification option –model-[x] (where [x] is ‘gen’, ‘trend’, ‘rec’ or ‘dom’ to indicate whether genotypic, trend, dominant or recessive test association P values are to be permuted) in any of the PLINK commands described in Step 4A. For example, permuted P values based on 1,000 replicates for a Cochran-Armitage trend test of association are obtained by typing ‘plink --file cg --model --mperm 1000 --model-trend --out data’ and permuted P values based on 1,000 replicates for an allelic test of association are obtained by typing ‘plink --file cg --assoc –mperm 1000 --out data’.
  • Open the output file ‘data.model.[x].mperm’ for permuted P values for genotypic, trend, dominant or recessive association tests or ‘data.assoc.mperm’ for permuted P values for allelic tests of association. These files have one row per SNP containing the chromosome [CHR], the SNP identifier [SNP], the point-wise estimate of the SNP’s significance [EMP1] and the family-wise estimate of the SNP’s significance [EMP2]. To maintain a FWER of α = 0.05, only SNPs with family-wise estimated significance of less than α are declared significant.

Population stratification ● TIMING ~5 min

7 | For CG studies, typically comprising hundreds of thousands of markers, calculate the inflation factor λ (follow option A). For GWA studies, obtain an unbiased evaluation of the inflation factor λ by using all testing SNPs (follow option B).

▲ CRITICAL STEP To assess the inflation factor in CG studies, an additional set of null marker loci, which are common SNPs not associated with the disease and not in LD with CG SNPs, must be available. We do not have any null loci data files available for our CG study.

Open the PLINK log file ‘data.log’ that records the inflation factor.

  • To obtain the inflation factor, include the --adjust option in any of the PLINK commands described in Step 4B. For example, the inflation factor based on logistic regression tests of association for all SNPs and assuming multiplicative or genotypic models in the GWA study is obtained by typing ‘plink --bfile gwa --genotypic --logistic --covar gwa.covar --adjust --out data’.

▲ CRITICAL STEP When the sample size is large, the inflation factor λ 1000 , for an equivalent study of 1,000 cases and 1,000 controls, can be calculated by rescaling λ according to the following formula

For general help on the programs and websites used in this protocol, refer to the relevant websites:

Step 1: If genotypes are not available in standard PED and MAP or binary file formats, both Goldsurfer2 (Gs2; see refs. 38 , 39 ) and PLINK have the functionality to read other file formats (e.g., HapMap, HapMart, Affymetrix, transposed file sets and long-format file sets) and convert these into PED and MAP or binary file formats.

Steps 2–6: The default missing genotype character is ‘0′. PLINK can recognize a different character as the missing genotype by using the ‘--missing-genotype’ option. For example, specify a missing genotype character of ‘N’ instead of ‘0′ in Step 2 by typing ‘plink --file cg --assoc --missing-genotype N --out data’.

● TIMING

None of the programs used take longer than a few minutes to run. Displaying and interpreting the relevant information are the rate-limiting steps.

ANTICIPATED RESULTS

Summary of results.

Table 4 shows the unadjusted P value for an allelic test of association in the CG region, as well as corresponding adjusted P values for SNPs with significant P values. Here we have defined a P value to be significant if at least one of the adjusted values is smaller than the threshold required to maintain a FWER of 0.05. The top four SNPs are significant according to every method of adjustment for multiple testing. The last SNP is only significant according to the FDR method of Benjamini and Hochberg, and statements of significance should be made with some caution.

SNPs in the CG study showing the strongest association signals.

Shown are adjusted and unadjusted P values for those SNPs with significant P values in an allelic test of association according to at least one method of adjustment for multiple testing. Chr, chromosome; FDR, false discovery rate; BH, Benjamini and Hochberg; BY, Benjamini and Yekutieli.

Figure 1 shows an LD plot based on CG data. Numbers within diamonds indicate the r 2 values. SNPs with significant P values ( P value < 0.05 and listed in Table 4 ) in the CG study are shown in white boxes. Six haplotype blocks of LD across the region have been identified and are marked in black. The LD plot shows that the five significant SNPs belong to three different haplotype blocks with the region studied: three out of five significantly associated SNPs are located in Block 2, which is a 52-kb block of high LD ( r 2 > 0.34). The two remaining significant SNPs are each located in separate blocks, Block 3 and Block 5. Results indicate possible allelic heterogeneity (the presence of multiple independent risk-associated variants). Further fine mapping would be required to locate the precise causal variants.

An external file that holds a picture, illustration, etc.
Object name is ukmss-34429-f0022.jpg

LD plot. LD plot showing LD patterns among the 37 SNPs genotyped in the CG study. The LD between the SNPs is measured as r 2 and shown (× 100) in the diamond at the intersection of the diagonals from each SNP. r 2 = 0 is shown as white, 0 < r 2 < 1 is shown in gray and r 2 = 1 is shown in black. The analysis track at the top shows the SNPs according to chromosomal location. Six haplotype blocks (outlined in bold black line) indicating markers that are in high LD are shown. At the top, the markers with the strongest evidence for association (listed in Table 4 ) are boxed in white.

Quantile-quantile plot

Figure 2 shows the quantile-quantile plots for two different tests of association in the GWA data, one based on χ 2 statistics from a test of allelic association and another based on − log 10 P values from a logistic regression under a multiplicative model of association. These plots show only minor deviations from the null distribution, except in the upper tail of the distribution, which corresponds to the SNPs with the strongest evidence for association. By illustrating that the majority of the results follow the null distribution and that only a handful deviate from the null we suggest that we do not have population structure that is unaccounted for in the analysis. These plots thus give confidence in the quality of the data and the robustness of the analysis. Both these plots are included here for illustration purposes only; typically only one (corresponding to the particular test of association) is required.

An external file that holds a picture, illustration, etc.
Object name is ukmss-34429-f0023.jpg

Quantile-quantile plots. Quantile-quantile plots of the results from the GWA study of ( a ) a simple χ 2 allelic test of association and ( b ) a multiplicative test of association based on logistic regression for all 306,102 SNPs that have passed the standard quality control filters. The solid line indicates the middle of the first and third quartile of the expected distribution of the test statistics. The dashed lines mark the 95% confidence interval of the expected distribution of the test statistics. Both plots show deviation from the null distribution only in the upper tails, which correspond to SNPs with the strongest evidence for association.

Manhattan plot

Figure 3 shows a Manhattan plot for the allelic test of association in the GWA study. SNPs with significant P values are easy to distinguish, corresponding to those values with large log10 P values. Three black ellipses mark regions on chromosomes 3, 8 and 16 that reach genome-wide significance ( P < 5 × 10 −8 ). Markers in these regions would then require further scrutiny through replication in an independent sample for confirmation of a true association.

An external file that holds a picture, illustration, etc.
Object name is ukmss-34429-f0024.jpg

Manhattan plot. Manhattan plot of simple χ 2 allelic test of association P values from the GWA study. The plot shows –log10 P values for each SNP against chromosomal location. Values for each chromosome (Chr) are shown in different colors for visual effect. Three regions are highlighted where markers have reached genome-wide significance ( P value < 5 × 10 −8 ).

Supplementary Material

Acknowledgments.

G.M.C. is funded by the Wellcome Trust. F.H.P. is funded by the Welcome Trust. C.A.A. is funded by the Wellcome Trust (WT91745/Z/10/Z). A.P.M. is supported by a Wellcome Trust Senior Research Fellowship. K.T.Z. is supported by a Wellcome Trust Research Career Development Fellowship.

Note: Supplementary information is available in the HTML version of this article.

COMPETING FINANCIAL INTERESTS The authors declare no competing financial interests.

Reprints and permissions information is available online at http://npg.nature.com/reprintsandpermissions/ .

IMAGES

  1. PPT

    hypothesis for case control study

  2. Case Control

    hypothesis for case control study

  3. A practical introduction to conducting and using case‐control studies

    hypothesis for case control study

  4. PPT

    hypothesis for case control study

  5. Differences between cross-sectional, case-control, and cohort study

    hypothesis for case control study

  6. what is meant by case control study

    hypothesis for case control study

VIDEO

  1. Sahulat

  2. Case Control study, Odds Ratio Concept

  3. Sahulat

  4. [統計學] 第八講、Hypothesis Testing (1)

  5. Sahulat

  6. Epidemiology: Case control study

COMMENTS

  1. What Is a Case-Control Study?

    Revised on June 22, 2023. A case-control study is an experimental design that compares a group of participants possessing a condition of interest to a very similar group lacking that condition. Here, the participants possessing the attribute of study, such as a disease, are called the "case," and those without it are the "control.".

  2. Case Control Studies

    A case-control study is a type of observational study commonly used to look at factors associated with diseases or outcomes.[1] The case-control study starts with a group of cases, which are the individuals who have the outcome of interest. The researcher then tries to construct a second group of individuals called the controls, who are similar to the case individuals but do not have the ...

  3. Case Control Study: Definition, Benefits & Examples

    A case control study is a retrospective, observational study that compares two existing groups. Researchers form these groups based on the existence of a condition in the case group and the lack of that condition in the control group. They evaluate the differences in the histories between these two groups looking for factors that might cause a ...

  4. PDF Design and Analysis of Case-Control Studies

    A case-control study is usually conducted before a cohort or an experimental study to identify the possible etiology of the disease. It costs relatively less and can be conducted in a shorter time. For a given disease, a case-control study can investigate multiple exposures (when the real exposure is not known).

  5. Methodology Series Module 2: Case-control Studies

    Case-Control study design is a type of observational study. In this design, participants are selected for the study based on their outcome status. Thus, some participants have the outcome of interest (referred to as cases), whereas others do not have the outcome of interest (referred to as controls). The investigator then assesses the exposure ...

  6. Research Design: Case-Control Studies

    Characteristics of Case-Control Studies. How do case-control studies fit into classifications of research design described in an earlier article? 1 Case-control studies are empirical studies that are based on samples, not individual cases or case series. They are cross-sectional because cases and controls are identified and evaluated for caseness, historical exposures, and confounding ...

  7. Epiville: Case-Control Study -- Study Design

    Answer (a) — correct: These hypotheses clearly state the expected causal factors (EnduroBrock and Quench-it), and the expected direction of effect. They are sufficiently explicit to allow the hypotheses to be tested after collecting data. In addition, we know that, although a case-control methodology selects individuals based on disease status and compares the exposure distribution between ...

  8. Case-Control Studies

    1.1 A Brief History. The case-control study examines the association between disease and potential risk factors by taking separate samples of diseased cases and of controls at risk of developing disease. Information may be collected for both cases and controls on genetic, social, behavioral, environmental, or other determinants of disease risk.

  9. Hypothesis Testing in Case-Control Studies

    Hypothesis testing in case-control studies BY A. J. SCOTT AND C. J. WILD Department of Mathematics and Statistics, University of Auckland, Auckland, New Zealand SUMMARY Prentice & Pyke (1979) have shown that valid estimators of the odds-ratio parameters in a logistic regression model may be obtained from case-control data by fitting the model ...

  10. Case-Control Study

    Before a case-control study is carefully planned, as with any other study type, the precise hypothesis being investigated must be articulated. Failure to do so may result in poor design and issues with result interpretation. Case-control studies enable the assessment of a variety of exposures that may be connected to a particular disease .

  11. Identification of causal effects in case-control studies

    Case-control designs are an important yet commonly misunderstood tool in the epidemiologist's arsenal for causal inference. We reconsider classical concepts, assumptions and principles and explore when the results of case-control studies can be endowed a causal interpretation. We establish how, and under which conditions, various causal estimands relating to intention-to-treat or per ...

  12. Case-control study in medical research: Uses and limitations

    While a case-control study can help to test a hypothesis about the link between a risk factor and an outcome, it is not as powerful as other types of study in confirming a causal relationship.

  13. Research Design: Case-Control Studies

    As an actual example of a case-control study, children with autism spectrum disorder (ASD) may be compared with normally developing children to determine whether a history of maternal antidepressant use during pregnancy is more frequent among cases than among controls; if it is, and if the association remains statistically significant after adjusting for confounding variables, one may ...

  14. Observational Studies: Cohort and Case-Control Studies

    Cohort studies and case-control studies are two primary types of observational studies that aid in evaluating associations between diseases and exposures. In this review article, we describe these study designs, methodological issues, and provide examples from the plastic surgery literature. Keywords: observational studies, case-control study ...

  15. Case-Control Studies

    Case-control studies belong to observational studies. It set up a control group. In case-control studies, Odds Ratio was used to estimate the strength of the association between disease and exposure factors. Selection bias, information bias, and confounding bias are major sources of bias in case-control studies.

  16. PDF Case Control Studies

    Advantages of case-control studies Case-control studies are the most efficient design for rare diseases and require a much smaller study sample than cohort studies. Additionally, investigators can avoid the logistical challenges of following a large sample over time. Thus, case-control studies also allow more intensive

  17. Introduction to study designs

    Case-control studies are one of the frequently used study designs due to the relative ease of its application in comparison with other study designs. This section introduces you to basic concepts, application and strengths of case-control study. This section also covers: 1. Issues in the design of case-control studies 2.

  18. PDF Case-control studies

    Case-control studies Study hypothesis A specific hypothesis is the first and the most important step in an epidemiologic study design. It must be clearly stated prior to the design of the case-control study. Poor hypotheses lead to poor study design and problems in interpretation of

  19. Designing and Conducting Analytic Studies in the Field

    Most field case-control studies use control-to-case-patient ratios of 1:1, 2:1, or 3:1. Enrolling more than one control per case-patient can increase study power, which might be needed to detect a statistically significant difference in exposure between case-patients and controls, particularly when an outbreak involves a limited number of cases.

  20. Epidemiology in Practice: Case-Control Studies

    Introduction. A case-control study is designed to help determine if an exposure is associated with an outcome (i.e., disease or condition of interest). In theory, the case-control study can be described simply. First, identify the cases (a group known to have the outcome) and the controls (a group known to be free of the outcome).

  21. Case Control Study: Definition & Examples

    Examples. A case-control study is an observational study where researchers analyzed two groups of people (cases and controls) to look at factors associated with particular diseases or outcomes. Below are some examples of case-control studies: Investigating the impact of exposure to daylight on the health of office workers (Boubekri et al., 2014).

  22. Step 4: Test Hypotheses

    Step 4: Test Hypotheses. Once investigators have narrowed down the likely source of the outbreak to a few possible foods, they test the hypotheses. Investigators can use many different methods to test their hypotheses, but most methods entail studies that compare how often (frequency) sick people in the outbreak ate certain foods to how often ...

  23. Basic statistical analysis in genetic case-control studies

    Following previous protocols on study design, marker selection and data quality control 1-3, this protocol considers basic statistical analysis methods and techniques for the analysis of genetic SNP data from population-based genome-wide and candidate-gene (CG) case-control studies.We describe disease models, measures of association and testing at genotypic (individual) versus allelic ...