Study Design 101

  • Helpful formulas
  • Finding specific study types
  • Case Control Study
  • Meta- Analysis
  • Systematic Review
  • Practice Guideline
  • Randomized Controlled Trial
  • Cohort Study
  • Case Reports

A study that compares patients who have a disease or outcome of interest (cases) with patients who do not have the disease or outcome (controls), and looks back retrospectively to compare how frequently the exposure to a risk factor is present in each group to determine the relationship between the risk factor and the disease.

Case control studies are observational because no intervention is attempted and no attempt is made to alter the course of the disease. The goal is to retrospectively determine the exposure to the risk factor of interest from each of the two groups of individuals: cases and controls. These studies are designed to estimate odds.

Case control studies are also known as "retrospective studies" and "case-referent studies."

  • Good for studying rare conditions or diseases
  • Less time needed to conduct the study because the condition or disease has already occurred
  • Lets you simultaneously look at multiple risk factors
  • Useful as initial studies to establish an association
  • Can answer questions that could not be answered through other study designs


Design pitfalls to look out for

Care should be taken to avoid confounding, which arises when an exposure and an outcome are both strongly associated with a third variable. Controls should be subjects who might have been cases in the study but are selected independent of the exposure. Cases and controls should also not be "over-matched."

Is the control group appropriate for the population? Does the study use matching or pairing appropriately to avoid the effects of a confounding variable? Does it use appropriate inclusion and exclusion criteria?

Fictitious Example

There is a suspicion that zinc oxide, the white non-absorbent sunscreen traditionally worn by lifeguards is more effective at preventing sunburns that lead to skin cancer than absorbent sunscreen lotions. A case-control study was conducted to investigate if exposure to zinc oxide is a more effective skin cancer prevention measure. The study involved comparing a group of former lifeguards that had developed cancer on their cheeks and noses (cases) to a group of lifeguards without this type of cancer (controls) and assess their prior exposure to zinc oxide or absorbent sunscreen lotions.

This study would be retrospective in that the former lifeguards would be asked to recall which type of sunscreen they used on their face and approximately how often. This could be either a matched or unmatched study, but efforts would need to be made to ensure that the former lifeguards are of the same average age, and lifeguarded for a similar number of seasons and amount of time per season.

Real-life Examples

Boubekri, M., Cheung, I., Reid, K., Wang, C., & Zee, P. (2014). Impact of windows and daylight exposure on overall health and sleep quality of office workers: a case-control pilot study . Journal of Clinical Sleep Medicine : JCSM : Official Publication of the American Academy of Sleep Medicine, 10 (6), 603-611.

This pilot study explored the impact of exposure to daylight on the health of office workers (measuring well-being and sleep quality subjectively, and light exposure, activity level and sleep-wake patterns via actigraphy). Individuals with windows in their workplaces had more light exposure, longer sleep duration, and more physical activity. They also reported a better scores in the areas of vitality and role limitations due to physical problems, better sleep quality and less sleep disturbances.

Togha, M., Razeghi Jahromi, S., Ghorbani, Z., Martami, F., & Seifishahpar, M. (2018). Serum Vitamin D Status in a Group of Migraine Patients Compared With Healthy Controls: A Case-Control Study . Headache, 58 (10), 1530-1540.

This case-control study compared serum vitamin D levels in individuals who experience migraine headaches with their matched controls. Studied over a period of thirty days, individuals with higher levels of serum Vitamin D was associated with lower odds of migraine headache.

Related Formulas

Related Terms

A patient with the disease or outcome of interest.


When an exposure and an outcome are both strongly associated with a third variable.

A patient who does not have the disease or outcome.

Matched Design

Each case is matched individually with a control according to certain characteristics such as age and gender. It is important to remember that the concordant pairs (pairs in which the case and control are either both exposed or both not exposed) tell us nothing about the risk of exposure separately for cases or controls.

Observed Assignment

The method of assignment of individuals to study and control groups in observational studies when the investigator does not intervene to perform the assignment.

Unmatched Design

The controls are a sample from a suitable non-affected population.

Now test yourself!

1. Case Control Studies are prospective in that they follow the cases and controls over time and observe what occurs.

a) True b) False

2. Which of the following is an advantage of Case Control Studies?

a) They can simultaneously look at multiple risk factors. b) They are useful to initially establish an association between a risk factor and a disease or outcome. c) They take less time to complete because the condition or disease has already occurred. d) b and c only e) a, b, and c

← Previous Next →

© 2011-2019, The Himmelfarb Health Sciences Library Questions? Ask us .

Creative Commons License

U.S. flag

An official website of the United States government

The .gov means it’s official. Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

The site is secure. The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Epidemiology in Practice: Case-Control Studies


A case-control study is designed to help determine if an exposure is associated with an outcome (i.e., disease or condition of interest). In theory, the case-control study can be described simply. First, identify the cases (a group known to have the outcome) and the controls (a group known to be free of the outcome). Then, look back in time to learn which subjects in each group had the exposure(s), comparing the frequency of the exposure in the case group to the control group.

By definition, a case-control study is always retrospective because it starts with an outcome then traces back to investigate exposures. When the subjects are enrolled in their respective groups, the outcome of each subject is already known by the investigator. This, and not the fact that the investigator usually makes use of previously collected data, is what makes case-control studies ‘retrospective’.

Advantages of Case-Control Studies

Case-control studies have specific advantages compared to other study designs. They are comparatively quick, inexpensive, and easy. They are particularly appropriate for (1) investigating outbreaks, and (2) studying rare diseases or outcomes. An example of (1) would be a study of endophthalmitis following ocular surgery. When an outbreak is in progress, answers must be obtained quickly. An example of (2) would be a study of risk factors for uveal melanoma, or corneal ulcers. Since case-control studies start with people known to have the outcome (rather than starting with a population free of disease and waiting to see who develops it) it is possible to enroll a sufficient number of patients with a rare disease. The practical value of producing rapid results or investigating rare outcomes may outweigh the limitations of case-control studies. Because of their efficiency, they may also be ideal for preliminary investigation of a suspected risk factor for a common condition; conclusions may be used to justify a more costly and time-consuming longitudinal study later.

Consider a situation in which a large number of cases of post-operative endophthalmitis have occurred in a few weeks. The case group would consist of all those patients at the hospital who developed post-operative endophthalmitis during a pre-defined period.

The definition of a case needs to be very specific:

There are not necessarily any ‘right’ answers to these questions but they must be answered before the study begins. At the end of the study, the conclusions will be valid only for patients who have the same sort of ‘endophthalmitis’ as in the case definition.

Controls should be chosen who are similar in many ways to the cases. The factors (e.g., age, sex, time of hospitalisation) chosen to define how controls are to be similar to the cases are the ‘matching criteria’. The selected control group must be at similar risk of developing the outcome; it would not be appropriate to compare a group of controls who had traumatic corneal lacerations with cases who underwent elective intraocular surgery. In our example, controls could be defined as patients who underwent elective intraocular surgery during the same period of time.

Matching Cases and Controls

Although controls must be like the cases in many ways, it is possible to over-match. Over-matching can make it difficult to find enough controls. Also, once a matching variable has been selected, it is not possible to analyse it as a risk factor. Matching for type of intraocular surgery (e.g., secondary IOL implantation) would mean including the same percentage of controls as cases who had surgery to implant a secondary IOL; if this were done, it would not be possible to analyse secondary IOL implantation as a potential risk factor for endophthalmitis.

An important technique for adding power to a study is to enroll more than one control for every case. For statistical reasons, however, there is little gained by including more than two controls per case.

Collecting Data

After clearly defining cases and controls, decide on data to be collected; the same data must be collected in the same way from both groups. Care must be taken to be objective in the search for past risk factors, especially since the outcome is already known, or the study may suffer from researcher bias. Although it may not always be possible, it is important to try to mask the outcome from the person who is collecting risk factor information or interviewing patients. Sometimes it will be necessary to interview patients about potential factors (such as history of smoking, diet, use of traditional eye medicines, etc.) in their past. It may be difficult for some people to recall all these details accurately. Furthermore, patients who have the outcome (cases) are likely to scrutinize the past, remembering details of negative exposures more clearly than controls. This is known as recall bias. Anything the researcher can do to minimize this type of bias will strengthen the study.

Analysis; Odds Ratios and Confidence Intervals

In the analysis stage, calculate the frequency of each of the measured variables in each of the two groups. As a measure of the strength of the association between an exposure and the outcome, case-control studies yield the odds ratio. An odds ratio is the ratio of the odds of an exposure in the case group to the odds of an exposure in the control group. It is important to calculate a confidence interval for each odds ratio. A confidence interval that includes 1.0 means that the association between the exposure and outcome could have been found by chance alone and that the association is not statistically significant. An odds ratio without a confidence interval is not very meaningful. These calculations are usually made with computer programmes (e.g., Epi-Info). Case-control studies cannot provide any information about the incidence or prevalence of a disease because no measurements are made in a population based sample.

Risk Factors and Sampling

Another use for case-control studies is investigating risk factors for a rare disease, such as uveal melanoma. In this example, cases might be recruited by using hospital records. Patients who present to hospital, however, may not be representative of the population who get melanoma. If, for example, women present less commonly at hospital, bias might occur in the selection of cases.

The selection of a proper control group may pose problems. A frequent source of controls is patients from the same hospital who do not have the outcome. However, hospitalised patients often do not represent the general population; they are likely to suffer health problems and they have access to the health care system. An alternative may be to enroll community controls, people from the same neighborhoods as the cases. Care must be taken with sampling to ensure that the controls represent a ‘normal’ risk profile. Sometimes researchers enroll multiple control groups . These could include a set of community controls and a set of hospital controls.


Matching controls to cases will mitigate the effects of confounders . A confounding variable is one which is associated with the exposure and is a cause of the outcome. If exposure to toxin ‘X’ is associated with melanoma, but exposure to toxin ‘X’ is also associated with exposure to sunlight (assuming that sunlight is a risk factor for melanoma), then sunlight is a potential confounder of the association between toxin ‘X’ and melanoma.

Case-control studies may prove an association but they do not demonstrate causation. Consider a case-control study intended to establish an association between the use of traditional eye medicines (TEM) and corneal ulcers. TEM might cause corneal ulcers but it is also possible that the presence of a corneal ulcer leads some people to use TEM. The temporal relationship between the supposed cause and effect cannot be determined by a case-control study.

Be aware that the term ‘case-control study’ is frequently misused. All studies which contain ‘cases’ and ‘controls’ are not case-control studies. One may start with a group of people with a known exposure and a comparison group (‘control group’) without the exposure and follow them through time to see what outcomes result, but this does not constitute a case-control study.

Case-control studies are sometimes less valued for being retrospective. However, they can be a very efficient way of identifying an association between an exposure and an outcome. Sometimes they are the only ethical way to investigate an association. If care is taken with definitions, selection of controls, and reducing the potential for bias, case-control studies can generate valuable information.

Case-Control Studies: Advantages and Disadvantages

Recommended Reading

Case-control studies

Selection of cases, selection of controls, ascertainment of exposure, cross sectional studies.

Follow us on

Content links.

Explore BMJ


You need to enable JavaScript to run this app.


The UK Faculty of Public Health has recently taken ownership of the Health Knowledge resource. This new, advert-free website is still under development and there may be some issues accessing content. Additionally, the content has not been audited or verified by the Faculty of Public Health as part of an ongoing quality assurance process and as such certain material included maybe out of date. If you have any concerns regarding content you should seek to independently verify this.

Introduction to study designs - case-control studies


a case control study design

Learning objectives: You will learn about basic introduction to case-control studies, its analysis and interpretation of outcomes. Case-control studies are one of the frequently used study designs due to the relative ease of its application in comparison with other study designs. This section introduces you to basic concepts, application and strengths of case-control study. This section also covers: 1. Issues in the design of case-control studies 2. Common sources of bias in a case-control study 3. Analysis of case-control studies 4. Strengths and weaknesses of case-control studies 5. Nested case-control studies Read the resource text below.

Resource text

Case-control studies start with the identification of a group of cases (individuals with a particular health outcome) in a given population and a group of controls (individuals without the health outcome) to be included in the study.

a case control study design

In a case-control study the prevalence of exposure to a potential risk factor(s) is compared between cases and controls. If the prevalence of exposure is more common among cases than controls, it may be a risk factor for the outcome under investigation. A major characteristic of case-control studies is that data on potential risk factors are collected retrospectively and as a result may give rise to bias. This is a particular problem associated with case-control studies and therefore needs to be carefully considered during the design and conduct of the study.

1. Issues in the design of case-control studies

Formulation of a clearly defined hypothesis As with all epidemiological investigations the beginning of a case-control study should begin with the formulation of a clearly defined hypothesis. Case definition It is essential that the case definition is clearly defined at the outset of the investigation to ensure that all cases included in the study are based on the same diagnostic criteria. Source of cases The source of cases needs to be clearly defined.

Selection of cases Case-control studies may use incident or prevalent cases.

Incident cases comprise cases newly diagnosed during a defined time period. The use of incident cases is considered as preferential, as the recall of past exposure(s) may be more accurate among newly diagnosed cases. In addition, the temporal sequence of exposure and disease is easier to assess among incident cases.

Prevalent cases comprise individuals who have had the outcome under investigation for some time. The use of prevalent cases may give rise to recall bias as prevalent cases may be less likely to accurately report past exposures(s). As a result, the interpretation of results based on prevalent cases may prove more problematic, as it may be more difficult to ensure that reported events relate to a time before the development of disease rather than to the consequence of the disease process itself. For example, individuals may modify their exposure following the onset of disease. In addition, unless the effect of exposure on duration of illness is known, it will not be possible to determine the extent to which a particular characteristic is related to the prognosis of the disease once it develops rather than to its cause.

Source of cases Cases may be recruited from a number of sources; for example they may be recruited from a hospital, clinic, GP registers or may be population bases. Population based case control studies are generally more expensive and difficult to conduct.

Selection of controls A particular problem inherent in case-control studies is the selection of a comparable control group. Controls are used to estimate the prevalence of exposure in the population which gave rise to the cases. Therefore, the ideal control group would comprise a random sample from the general population that gave rise to the cases. However, this is not always possible in practice. The goal is to select individuals in whom the distribution of exposure status would be the same as that of the cases in the absence of an exposure disease association. That is, if there is no true association between exposure and disease, the cases and controls should have the same distribution of exposure. The source of controls is dependent on the source of cases. In order to minimize bias, controls should be selected to be a representative sample of the population which produced the cases. For example, if cases are selected from a defined population such as a GP register, then controls should comprise a sample from the same GP register.

a case control study design

In case-control studies where cases are hospital based, it is common to recruit controls from the hospital population. However, the choice of controls from a hospital setting should not include individuals with an outcome related to the exposure(s) being studied. For example, in a case-control study of the association between smoking and lung cancer the inclusion of controls being treated for a condition related to smoking (e.g. chronic bronchitis) may result in an underestimate of the strength of the association between exposure (smoking) and outcome. Recruiting more than one control per case may improve the statistical power of the study, though including more than 4 controls per case is generally considered to be no more efficient.

Measuring exposure status Exposure status is measured to assess the presence or level of exposure for each individual for the period of time prior to the onset of the disease or condition under investigation when the exposure would have acted as a causal factor. Note that in case-control studies the measurement of exposure is established after the development of disease and as a result is prone to both recall and observer bias. Various methods can be used to ascertain exposure status. These include:

The procedures used for the collection of exposure data should be the same for cases and controls.

2. Common sources of bias in case-control studies

Due to the retrospective nature of case-control studies, they are particularly susceptible to the effects of bias, which may be introduced as a result of a poor study design or during the collection of exposure and outcome data. Because the disease and exposure have already occurred at the outset of a case control study, there may be differential reporting of exposure information between cases and controls based on their disease status. For example, cases and controls may recall past exposure differently (recall bias). Similarly, the recording of exposure information may vary depending on the investigator's knowledge of an individual's disease status (interviewer/observer bias). Therefore, the design and conduct of the study must be carefully considered, as there are limited options for the control of bias during the analysis. Selection bias in case-control studies Selection bias is a particular problem inherent in case-control studies, where it gives rise to non-comparability between cases and controls. Selection bias in case control studies may occur when: 'cases (or controls) are included in (or excluded from) a study because of some characteristic they exhibit which is related to exposure to the risk factor under evaluation' [1]. The aim of a case-control study is to select study controls who are representative of the population which produced the cases. Controls are used to provide an estimate of the exposure rate in the population. Therefore, selection bias may occur when those individuals selected as controls are unrepresentative of the population that produced the cases.

a case control study design

The potential for selection bias in case control studies is a particular problem when cases and controls are recruited exclusively from hospital or clinics. Hospital patients tend to have different characteristics than the population, for example they may have higher levels of alcohol consumption or cigarette smoking. If these characteristics are related to the exposures under investigation, then estimates of the exposure among controls may be different from that in the reference population, which may result in a biased estimate of the association between exposure and disease. Berkesonian bias is a bias introduced in hospital based case-control studies, due to varying rates of hospital admissions. As the potential for selection bias is likely to be less of a problem in population based case-control studies, neighbourhood controls may be a preferable choice when using cases from a hospital or clinic setting. Alternatively, the potential for selection bias may be minimized by selecting controls from more than one source, such as by using both hospital and neighbourhood controls. Selection bias may also be introduced in case-control studies when exposed cases are more likely to be selected than unexposed cases.

3. Analysis of case-control studies

The odds ratio (OR) is used in case-control studies to estimate the strength of the association between exposure and outcome. Note that it is not possible to estimate the incidence of disease from a case control study unless the study is population based and all cases in a defined population are obtained.

The results of a case-control study can be presented in a 2x2 table as follow:

a case control study design

The odds ratio is a measure of the odds of disease in the exposed compared to the odds of disease in the unexposed (controls) and is calculated as:

a case control study design

Example: Calculation of the OR from a hypothetical case-control study of smoking and cancer of the pancreas among 100 cases and 400 controls. Table 1. Hypothetical case-control study of smoking and cancer of the pancreas.

a case control study design

OR = 60 x 300        100 x 40 OR = 4.5 The OR calculated from the hypothetical data in table 1 estimates that smokers are 4.5 times more likely to develop cancer of the pancreas than non-smokers. NB: The odds ratio of smoking and cancer of the pancreas has been performed without adjusting for potential confounders. Further analysis of the data would involve stratifying by levels of potential confounders such as age. The 2x2 table can then be extended to allow for stratum specific rates of the confounding variable(s) to be calculated and, where appropriate, an overall summary measure, adjusted for the effects of confounding, and a statistical test of significance can also be calculated. In addition, confidence intervals for the odds ratio would also be presented.

4. Strengths and weaknesses of case-control studies

References 1. Hennekens CH, Buring JE. Epidemiology in Medicine, Lippincott Williams & Wilkins, 1987.


Case Control Studies

a case control study design

Five steps in conducting a case-control study

1. define a study population (source of cases and controls).

Controls must have as similar a background as possible to the cases, except that they do not have the outcome in question. They should come from the same population as the cases. Their selection should be independent of the exposures of interest. Objective measures of the presence of risk factors are best, ideally carried out in a 'blind' assessment or before the cases and controls are identified (i.e. they do not know who is a control or not).

2. Define and select cases

Identification of cases can be made from the general population using health register and data or from a particular medical setting. The criteria for diagnosis of a case should be defined as well as the eligibility criteria used for selection. The diagnostic criteria should be sensitive and specific (i.e. strict!). Information on diseases can be got from death certificates, disease registers, medical records or population survey. For rare diseases, cases may have to sought from large areas or over many years.

3. Define and select controls

This is a very important step. Get this wrong and you introduce bias into the study. Controls should represent the population that the cases come from (i.e. they should be at risk of becoming new cases). Ratio to cases is usually 1:1. If cases are limited, you can have up to 4 controls: 1 case. Some time will be needed in considering the way in which the cases and controls, which make up the study will be chosen. More heterogeneity in the cases, less likelihood of being able to link a specific risk factor to the disease causation. But, narrower the category of disease for inclusion as 'cases', less general applicability the findings will have.

Source of Controls: Hospital

People have taken controls from a hospital population because they maintain that the controls are in some way matched to the hospital cases. However, they are people with other risk factors. (For example, you could be comparing people with lung cancer with people with broken legs. People who break their legs are not the same as all those who develop lung cancer). The controls may have different diseases to the cases, which may have an effect on the results. 

Advantages of using hospital controls: 

Disadvantages of using hospital controls: 

Source of controls: General Population (Community Controls)

Controls can be taken from the community the cases are from or from a different population. The controls may be healthy or may have other diseases. It is worth bearing in mind a couple of weaknesses with community controls. Healthy controls have less reliable recall of exposure. They may also be less motivated to take part and so may have lower response rates.

4. Measure exposure

The measurement of the exposure(s) must be collected in a comparable way for cases and controls. It is worth 'blinding' the data gatherers to case or control status of participants or at least blind them to the main hypothesis of the study. This should help prevent measurement or researcher bias. Exposure information can come from records (though, obvious disadvantage is that records can be inaccurate, incomplete and were not originially collected for the study purposes) or can be via an interview or questionnaire (this can introduce recall bias, where cases have more vested interest in recalling the exposures than controls, and sometimes rely on 'proxy' respondents, e.g. carers, or parents of children).

5. Estimate disease risk associated with exposure

Traditionally, data from case control studies are set in a 2 by 2 or fourfold table. It is unlike cohort studies (where study population is denominator adn incidence rate can be calculated for the disease as people are affected adn relative risk can be calculated). Because there is no population based data in case-control studies, results are best expressed as odds ratio (the ratio of exposed to non-exposed in the case group divided by the same ratio in the control group). When the number with disease is small compared with the number unaffected, the odds ratio is closer in value to the relative risk, which is a population-based estimate derived from cohort studies.

Things to watch out for!

Confounding factors.

Confounding factors or variables are variable other than the risk factor, for which cases and control groups differ, age is the most common example. The possiblity of unknown confounding variables makes it difficult to state categorically that a factor is a cause. Confounding should be addressed either in the design stage or with analytical techniques. In the design stage, confounding can be controlled for by restriction or matching. Many researchers prefer to handle it in the analysis phase with analytical techniques like logistic regression or by stratification with Mantel-Haenszel approaches.

Matching of cases and controls can eliminate the matched parameter as a cause of difference. Controls are matched to cases on the basis of certain characteristics, which are also known to be present in the cases. The purpose is to eliminate confounding variables (factors in addition to the risk factor that influence whether disease occurs). If such confounding factors are unevenly distributed between study groups, they can distort comparsions and the conclusions being made. Age is a common confounder (standardisation can be used). Matching should be used sparingly. The tendency is to match in analysis of results rather than in the design stage. Overmatching occurs if a variable matched could, in fact be an intermediate on a casual pathway. This would mask a disease association.

Bias is a systematic error in the estimate of an association between cause and effect. It may result from poor diagnosis/diagnostic criteria, poor case choice, poor choice of controls or variation in the way risk exposure is measured in case and controls.

1. Quicker, cheaper and require less time and effort than cohort studies

2. Case-control studies can study rare diseases

3. Case-control studies can study multiple risk factors/exposures

4. They are useful for studying outcomes (diseases) that take a long time to develop, e.g. cancer


1. Case-control studies are prone to selection and recall bias (i.e. better recollection of exposure amongst cases than among controls

2. They are inefficient for examining rare exposures

3. It may be difficult to establish temporality (when the person was actually exposed to the disease/risk factor)

4. It can be difficutl to choose an appropriate control group

5. Unlike cohort studies, case-control studies cannot calculate incidence rates, relative risks or attributable risks. Instead odds ratio are the measure of association used (when outcome is uncommon, e.g. most cancers, it can be a good proxy for the true relative risk)

Calculating Sample Size for Case Control Studies

1. Size of 'effect' to be detected

2. Statistical significance level

3. Power of study (usually 0.8 or 0.9)

4. Ratio of 1 group to the other (exposed Vs unexposed; cases Vs controls)

Further Reading

Schultz, K.F. & Grimes, D.A. (2002) "Case-control studies: research in reverse". The Lancet, 359, 431-34.

Pearce N.  Classification of Epidemiological Studies .Int J Epidemiol (2012) 41 (2): 393-397.  (This article talks about there really only being 4 types of epidemiological studies: incidence studies, prevalance studies, incidence case-control studies and prevalance case control studies.  The differences being the outcome and whether or not you sample on outcome).

Case-Control Studies

a case control study design


Cohort studies have an intuitive logic to them, but they can be very problematic when one is investigating outcomes that only occur in a small fraction of exposed and unexposed individuals. They can also be problematic when it is expensive or very difficult to obtain exposure information from a cohort. In these situations a case-control design offers an alternative that is much more efficient. The goal of a case-control study is the same as that of cohort studies, i.e., to estimate the magnitude of association between an exposure and an outcome. However, case-control studies employ a different sampling strategy that gives them greater efficiency.

Learning Objectives

After completing this module, the student will be able to:

Overview of Case-Control Design

In the module entitled Overview of Analytic Studies it was noted that Rothman describes the case-control strategy as follows:

"Case-control studies are best understood by considering as the starting point a source population , which represents a hypothetical study population in which a cohort study might have been conducted. The source population is the population that gives rise to the cases included in the study. If a cohort study were undertaken, we would define the exposed and unexposed cohorts (or several cohorts) and from these populations obtain denominators for the incidence rates or risks that would be calculated for each cohort. We would then identify the number of cases occurring in each cohort and calculate the risk or incidence rate for each. In a case-control study the same cases are identified and classified as to whether they belong to the exposed or unexposed cohort. Instead of obtaining the denominators for the rates or risks, however, a control group is sampled from the entire source population that gives rise to the cases. Individuals in the control group are then classified into exposed and unexposed categories. The purpose of the control group is to determine the relative size of the exposed and unexposed components of the source population. Because the control group is used to estimate the distribution of exposure in the source population, the cardinal requirement of control selection is that the controls be sampled independently of exposure status."

To illustrate this consider the following hypothetical scenario in which the source population is the state of Massachusetts. Diseased individuals are red, and non-diseased individuals are blue. Exposed individuals are indicated by a whitish midsection. Note the following aspects of the depicted scenario:

Map of Massachusetts with thousands of icon people overlaid. A very small percentage of them are identified as having a rare disease.

If we somehow had exposure and outcome information on all of the subjects in the source population and looked at the association using a cohort design, we might find the data summarized in the contingency table below.

In this hypothetical example, we have data on all 6,000,000 people in the source population, and we could compute the probability of disease (i.e., the risk or incidence) in both the exposed group and the non-exposed group, because we have the denominators for both the exposed and non-exposed groups.

The table above summarizes all of the necessary information regarding exposure and outcome status for the population and enables us to compute a risk ratio as a measure of the strength of the association. Intuitively, we compute the probability of disease (the risk) in each exposure group and then compute the risk ratio as follows:

The problem , of course, is that we usually don't have the resources to get the data on all subjects in the population. If we took a random sample of even 5-10% of the population, we would have few diseased people in our sample, certainly not enough to produce a reasonably precise measure of association. Moreover, we would expend an inordinate amount of effort and money collecting exposure and outcome data on a large number of people who would not develop the outcome.

We need a method that allows us to retain all the people in the numerator of disease frequency (diseased people or "cases") but allows us to collect information from only a small proportion of the people that make up the denominator (population, or "controls"), most of whom do not have the disease of interest. The case-control design allows us to accomplish this. We identify and collect exposure information on all the cases, but identify and collect exposure information on only a sample of the population. Once we have the exposure information, we can assign subjects to the numerator and denominator of the exposed and unexposed groups. This is what Rothman means when he says,

"The purpose of the control group is to determine the relative size of the exposed and unexposed components of the source population."

In the above example, we would have identified all 1,300 cases, determined their exposure status, and ended up categorizing 700 as exposed and 600 as unexposed. We might have ransomly sampled 6,000 members of the population (instead of 6 million) in order to determine the exposure distribution in the total population. If our sampling method was random, we would expect that about 1,000 would be exposed and 5,000 unexposed (the same ratio as in the overall population). We calculate a similar measure as the risk ratio above, but substituting in the denominator a sample of the population ("controls") instead of the whole population:

Note that when we take a sample of the population, we no longer have a measure of disease frequency, because the denominator no longer represents the population. Therefore, we can no longer compute the probability or rate of disease incidence in each exposure group. We also can't calculate a risk or rate difference measure for the same reason. However, as we have seen, we can compute the relative probability of disease in the exposed vs. unexposed group. The term generally used for this measure is an odds ratio , described in more detail later in the module.

Consequently, when the outcome is uncommon, as in this case, the risk ratio can be estimated much more efficiently by using a case-control design. One would focus first on finding an adequate number of cases in order to determine the ratio of exposed to unexposed cases. Then, one only needs to take a sample of the population in order to estimate the relative size of the exposed and unexposed components of the source population. Note that if one can identify all of the cases that were reported to a registry or other database within a defined period of time, then it is possible to compute an estimate of the incidence of disease if the size of the population is known from census data.   While this is conceptually possible, it is rarely done, and we will not discuss it further in this course.

Toggle open/close quiz question

A Nested Case-Control Study

Suppose a prospective cohort study were conducted among almost 90,000 women for the purpose of studying the determinants of cancer and cardiovascular disease. After enrollment, the women provide baseline information on a host of exposures, and they also provide baseline blood and urine samples that are frozen for possible future use. The women are then followed, and, after about eight years, the investigators want to test the hypothesis that past exposure to pesticides such as DDT is a risk factor for breast cancer. Eight years have passed since the beginning of the study, and 1.439 women in the cohort have developed breast cancer. Since they froze blood samples at baseline, they have the option of analyzing all of the blood samples in order to ascertain exposure to DDT at the beginning of the study before any cancers occurred. The problem is that there are almost 90,000 women and it would cost $20 to analyze each of the blood samples. If the investigators could have analyzed all 90,000 samples this is what they would have found the results in the table below.

Table of Breast Cancer Occurrence Among Women With or Without DDT Exposure

If they had been able to afford analyzing all of the baseline blood specimens in order to categorize the women as having had DDT exposure or not, they would have found a risk ratio = 1.87 (95% confidence interval: 1.66-2.10). The problem is that this would have cost almost $1.8 million, and the investigators did not have the funding to do this.

While 1,439 breast cancers is a disturbing number, it is only 1.6% of the entire cohort, so the outcome is relatively rare, and it is costing a lot of money to analyze the blood specimens obtained from all of the non-diseased women. There is, however, another more efficient alternative, i.e., to use a case-control sampling strategy. One could analyze all of the blood samples from women who had developed breast cancer, but only a sample of the whole cohort in order to estimate the exposure distribution in the population that produced the cases.

If one were to analyze the blood samples of 2,878 of the non-diseased women (twice as many as the number of cases), one would obtain results that would look something like those in the next table.

Odds of Exposure: 360/1079 in the cases versus 432/2,446 in the non-diseased controls.

Totals Samples analyzed = 1,438+2,878 = 4,316

Total Cost = 4,316 x $20 = $86,320

With this approach a similar estimate of risk was obtained after analyzing blood samples from only a small sample of the entire population at a fraction of the cost with hardly any loss in precision. In essence, a case-control strategy was used, but it was conducted within the context of a prospective cohort study. This is referred to as a case-control study "nested" within a cohort study.

Rothman states that one should look upon all case-control studies as being "nested" within a cohort. In other words the cohort represents the source population that gave rise to the cases. With a case-control sampling strategy one simply takes a sample of the population in order to obtain an estimate of the exposure distribution within the population that gave rise to the cases. Obviously, this is a much more efficient design.

It is important to note that, unlike cohort studies, case-control studies do not follow subjects through time. Cases are enrolled at the time they develop disease and controls are enrolled at the same time. The exposure status of each is determined, but they are not followed into the future for further development of disease.

As with cohort studies, case-control studies can be prospective or retrospective. At the start of the study, all cases might have already occurred and then this would be a retrospective case-control study. Alternatively, none of the cases might have already occurred, and new cases will be enrolled prospectively. Epidemiologists generally prefer the prospective approach because it has fewer biases, but it is more expensive and sometimes not possible. When conducted prospectively, or when nested in a prospective cohort study, it is straightforward to select controls from the population at risk. However, in retrospective case-control studies, it can be difficult to select from the population at risk, and controls are then selected from those in the population who didn't develop disease. Using only the non-diseased to select controls as opposed to the whole population means the denominator is not really a measure of disease frequency, but when the disease is rare , the odds ratio using the non-diseased will be very similar to the estimate obtained when the entire population is used to sample for controls. This phenomenon is known as the r are-disease assumption . When case-control studies were first developed, most were conducted retrospectively, and it is sometimes assumed that the rare-disease assumption applies to all case-control studies. However, it actually only applies to those case-control studies in which controls are sampled only from the non-diseased rather than the whole population.  

The difference between sampling from the whole population and only the non-diseased is that the whole population contains people both with and without the disease of interest. This means that a sampling strategy that uses the whole population as its source must allow for the fact that people who develop the disease of interest can be selected as controls. Students often have a difficult time with this concept. It is helpful to remember that it seems natural that the population denominator includes people who develop the disease in a cohort study. If a case-control study is a more efficient way to obtain the information from a cohort study, then perhaps it is not so strange that the denominator in a case-control study also can include people who develop the disease. This topic is covered in more detail in EP813 Intermediate Epidemiology.

Retrospective and Prospective Case-Control Studies

Students usually think of case-control studies as being only retrospective, since the investigators enroll subjects who have developed the outcome of interest. However, case-control studies, like cohort studies, can be either retrospective or prospective. In a prospective case-control study, the investigator still enrolls based on outcome status, but the investigator must wait to the cases to occur.

When is a Case-Control Study Desirable?

Given the greater efficiency of case-control studies, they are particularly advantageous in the following situations:

Another advantage of their greater efficiency, of course, is that they are less time-consuming and much less costly than prospective cohort studies.

The DES Case-Control Study

A classic example of the efficiency of the case-control approach is the study (Herbst et al.: N. Engl. J. Med. Herbst et al. (1971;284:878-81) that linked in-utero exposure to diethylstilbesterol (DES) with subsequent development of vaginal cancer 15-22 years later. In the late 1960s, physicians at MGH identified a very unusual cancer cluster. Eight young woman between the ages of 15-22 were found to have cancer of the vagina, an uncommon cancer even in elderly women. The cluster of cases in young women was initially reported as a case series, but there were no strong hypotheses about the cause.

In retrospect, the cause was in-utero exposure to DES. After World War II, DES started being prescribed for women who were having troubles with a pregnancy -- if there were signs suggesting the possibility of a miscarriage, DES was frequently prescribed. It has been estimated that between 1945-1950 DES was prescribed for about 20% of all pregnancies in the Boston area. Thus, the unborn fetus was exposed to DES in utero, and in a very small percentage of cases this resulted in development of vaginal cancer when the child was 15-22 years old (a very long latent period). There were several reasons why a case-control study was the only feasible way to identify this association: the disease was extremely rare (even in subjects who had been exposed to DES), there was a very long latent period between exposure and development of disease, and initially they had no idea what was responsible, so there were many possible exposures to consider.

In this situation, a case-control study was the only reasonable approach to identify the causative agent. Given how uncommon the outcome was, even a large prospective study would have been unlikely to have more than one or two cases, even after 15-20 years of follow-up. Similarly, a retrospective cohort study might have been successful in enrolling a large number of subjects, but the outcome of interest was so uncommon that few, if any, subjects would have had it. In contrast, a case-control study was conducted in which eight known cases and 32 age-matched controls provided information on many potential exposures. This strategy ultimately allowed the investigators to identify a highly significant association between the mother's treatment with DES during pregnancy and the eventual development of adenocarcinoma of the vagina in their daughters (in-utero at the time of exposure) 15 to 22 years later.

For more information see the DES Fact Sheet from the National Cancer Institute.

An excellent summary of this landmark study and the long-range effects of DES can be found in a Perspective article in the New England Journal of Medicine. A cohort of both mothers who took DES and their children (daughters and sons) was later formed to look for more common outcomes. Members of the faculty at BUSPH are on the team of investigators that follow this cohort for a variety of outcomes, particularly reproductive consequences and other cancers.

Selecting & Defining Cases and Controls

The "case" definition.

Careful thought should be given to the case definition to be used. If the definition is too broad or vague, it is easier to capture people with the outcome of interest, but a loose case definition will also capture people who do not have the disease. On the other hand, an overly restrictive case definition is employed, fewer cases will be captured, and the sample size may be limited. Investigators frequently wrestle with this problem during outbreak investigations. Initially, they will often use a somewhat broad definition in order to identify potential cases. However, as an outbreak investigation progresses, there is a tendency to narrow the case definition to make it more precise and specific, for example by requiring confirmation of the diagnosis by laboratory testing. In general, investigators conducting case-control studies should thoughtfully construct a definition that is as clear and specific as possible without being overly restrictive.

Investigators studying chronic diseases generally prefer newly diagnosed cases, because they tend to be more motivated to participate, may remember relevant exposures more accurately, and because it avoids complicating factors related to selection of longer duration (i.e., prevalent) cases. However, it is sometimes impossible to have an adequate sample size if only recent cases are enrolled.

Sources of Cases

Typical sources for cases include:

Selection of the Controls

As noted above, it is always useful to think of a case-control study as being nested within some sort of a cohort, i.e., a source population that produced the cases that were identified and enrolled. In view of this there are two key principles that should be followed in selecting controls:

If either of these principles are not adhered to, selection bias can result (as discussed in detail in the module on Bias).

a case control study design

Note that in the earlier example of a case-control study conducted in the Massachusetts population, we specified that our sampling method was random so that exposed and unexposed members of the population had an equal chance of being selected. Therefore, we would expect that about 1,000 would be exposed and 5,000 unexposed (the same ratio as in the whole population), and came up with an odds ratio that was same as the hypothetical risk ratio we would have had if we had collected exposure information from the whole population of six million:

What if we had instead been more likely to sample those who were exposed, so that we instead found 1,500 exposed and 4,500 unexposed among the 6,000 controls?   Then the odds ratio would have been:

This odds ratio is biased because it differs from the true odds ratio.   In this case, the bias stemmed from the fact that we violated the second principle in selection of controls. Depending on which category is over or under-sampled, this type of bias can result in either an underestimate or an overestimate of the true association.

A hypothetical case-control study was conducted to determine whether lower socioeconomic status (the exposure) is associated with a higher risk of cervical cancer (the outcome). The "cases" consisted of 250 women with cervical cancer who were referred to Massachusetts General Hospital for treatment for cervical cancer. They were referred from all over the state. The cases were asked a series of questions relating to socioeconomic status (household income, employment, education, etc.). The investigators identified control subjects by going door-to-door in the community around MGH from 9:00 AM to 5:00  PM. Many residents are not home, but they persist and eventually enroll enough controls. The problem is that the controls were selected by a different mechanism than the cases, AND the selection mechanism may have tended to select individuals of different socioeconomic status, since women who were at home may have been somewhat more likely to be unemployed. In other words, the controls were more likely to be enrolled (selected) if they had the exposure of interest (lower socioeconomic status). 

Toggle open/close quiz question

Sources for "Controls"

Population controls:.

A population-based case-control study is one in which the cases come from a precisely defined population, such as a fixed geographic area, and the controls are sampled directly from the same population. In this situation cases might be identified from a state cancer registry, for example, and the comparison group would logically be selected at random from the same source population. Population controls can be identified from voter registration lists, tax rolls, drivers license lists, and telephone directories or by "random digit dialing". Population controls may also be more difficult to obtain, however, because of lack of interest in participating, and there may be recall bias, since population controls are generally healthy and may remember past exposures less accurately.

Example of a Population-based Case-Control Study: Rollison et al. reported on a "Population-based Case-Control Study of Diabetes and Breast Cancer Risk in Hispanic and Non-Hispanic White Women Living in US Southwestern States". (ALink to the article - Citation: Am J Epidemiol 2008;167:447–456).

"Briefly, a population-based case-control study of breast cancer was conducted in Colorado, New Mexico, Utah, and selected counties of Arizona. For investigation of differences in the breast cancer risk profiles of non-Hispanic Whites and Hispanics, sampling was stratified by race/ethnicity, and only women who self-reported their race as non-Hispanic White, Hispanic, or American Indian were eligible, with the exception of American Indian women living on reservations. Women diagnosed with histologically confirmed breast cancer between October 1999 and May 2004 (International Classification of Diseases for Oncology codes C50.0–C50.6 and C50.8–C50.9) were identified as cases through population-based cancer registries in each state."

"Population-based controls were frequency-matched to cases in 5-year age groups. In New Mexico and Utah, control participants under age 65 years were randomly selected from driver's license lists; in Arizona and Colorado, controls were randomly selected from commercial mailing lists, since driver's license lists were unavailable. In all states, women aged 65 years or older were randomly selected from the lists of the Centers for Medicare and Medicaid Services (Social Security lists). Of all women contacted, 68 percent of cases and 42 percent of controls participated in the study."

"Odds ratios and 95% confidence intervals were calculated using logistic regression, adjusting for age, body mass index at age 15 years, and parity. Having any type of diabetes was not associated with breast cancer overall (odds ratio = 0.94, 95% confidence interval: 0.78, 1.12). Type 2 diabetes was observed among 19% of Hispanics and 9% of non-Hispanic Whites but was not associated with breast cancer in either group."

In this example, it is clear that the controls were selected from the source population (principle 1), but less clear that they were enrolled independent of exposure status (principle 2), both because drivers' licenses were used for selection and because the participation rate among controls was low. These factors would only matter if they impacted on the estimate of the proportion of the population who had diabetes.

Hospital or Clinic Controls:

a case control study design

The advantages of using controls who are patients from the same facility are:

Example: Several years ago the vascular surgeons at Boston Medical Center wanted to study risk factors for severe atherosclerosis of the lower extremities. The cases were patients who were referred to the hospital for elective surgery to bypass severe atherosclerotic blockages in the arteries to the legs. The controls consisted of patients who were admitted to the same hospital for elective joint replacement of the hip or knee. The patients undergoing joint replacement were similar in age and they also were following the same referral pathways. In other words, they met the "would" criterion: if one of the joint replacement surgery patients had developed severe atherosclerosis in their leg arteries, they would have been referred to the same hospital.

Friend, Neighbor, Spouse, and Relative Controls:

Occasionally investigators will ask cases to nominate controls who are in one of these categories, because they have similar characteristics, such as genotype, socioeconomic status, or environment, i.e., factors that can cause confounding, but are hard to measure and adjust for. By matching cases and controls on these factors, confounding by these factors will be controlled.   However, one must be careful that the controls satisfy the two fundamental principles. Often, they do not.

How Many Controls?

Since case-control studies are often used for uncommon outcomes, investigators often have a limited number of cases but a plentiful supply of potential controls. In this situation the statistical power of the study can be increased somewhat by enrolling more controls than cases. However, the additional power that is achieved diminishes as the ratio of controls to cases increases, and ratios greater than 4:1 have little additional impact on power. Consequently, if it is time-consuming or expensive to collect data on controls, the ratio of controls to cases should be no more than 4:1. However, if the data on controls is easily obtained, there is no reason to limit the number of controls.

Methods of Control Sampling

There are three strategies for selecting controls that are best explained by considering the nested case-control study described on page 3 of this module:

The Rare Outcome Assumption

It is often said that an odds ratio provides a good estimate of the risk ratio only when the outcome of interest is rare, but this is only true when survivor sampling is used. With case-base sampling or risk set sampling, the odds ratio will provide a good estimate of the risk ratio regardless of the frequency of the outcome, because the controls will provide an accurate estimate of the distribution in the source population (i.e., not just in non-diseased people).

More on Selection Bias

Always consider the source population for case-control studies, i.e. the "population" that generated the cases. The cases are always identified and enrolled by some method or a set of procedures or circumstances. For example, cases with a certain disease might be referred to a particular tertiary hospital for specialized treatment. Alternatively, if there is a database or a disease registry for a geographic area, cases might be selected at random from the database. The key to avoiding selection bias is to select the controls by a similar, if not identical, mechanism in order to ensure that the controls provide an accurate representation of the exposure status of the source population.

Example 1: In the first example above, in which cases were randomly selected from a geographically defined database, the source population is also defined geographically, so it would make sense to select population controls by some random method. In contrast, if one enrolled controls from a particular hospital within the geographic area, one would have to at least consider whether the controls were inherently more or less likely to have the exposure of interest. If so, they would not provide an accurate estimate of the exposure distribution of the source population, and selection bias would result.

Example 2: In the second example above, the source population was defined by the patterns of referral to a particular hospital for a particular disease. In order for the controls to be representative of the "population" that produced those cases, the controls should be selected by a similar mechanism, e.g., by contacting the referring health care providers and asking them to provide the names of potential controls. By this mechanism, one can ensure that the controls are representative of the source population, because if they had had the disease of interest they would have been just as likely as the cases to have been included in the case group (thus fulfilling the "would" criterion).

Example 3: A food handler at a delicatessen who is infected with hepatitis A virus is responsible for an outbreak of hepatitis which is largely confined to the surrounding community from which most of the customers come. Many (but not all) of the infected cases are identified by passive and active surveillance. How should controls be selected? In this situation, one might guess that the likelihood of people going to the delicatessen would be heavily influenced by their proximity to it, and this would to a large extent define the source population. In a case-control study undertaken to identify the source, the delicatessen is one of the exposures being tested. Consequently, even if the cases were reported to the state-wide surveillance system, it would not be appropriate to randomly select controls from the state, the county, or even the town where the delicatessen is located. In other words, the "would" criterion doesn't work here, because anyone in the state with clinical hepatitis would end up in the surveillance system, but someone who lived far from the deli would have a much lower likelihood of having the exposure. A better approach would be to select controls who were matched to the cases by neighborhood, age, and gender. These controls would have similar access to go to the deli if they chose to, and they would therefore be more representative of the source population.

Analysis of Case-Control Studies

The computation and interpretation of the odds ratio in a case-control study has already been discussed in the modules on Overview of Analytic Studies and Measures of Association. Additionally, one can compute the confidence interval for the odds ratio, and statistical significance can also be evaluated by using a chi-square test (or a Fisher's Exact Test if the sample size is small) to compute a p-value. These calculations can be done using the Case-Control worksheet in the Excel file called EpiTools.XLS.

Image of the Case-Control worksheet in the Epi_Tools file

Advantages and Disadvantages of Case-Control Studies




  1. Case-Control Study Design Image

    a case control study design

  2. Case control studies

    a case control study design

  3. Case-control study design. Cases and controls are selected from the...

    a case control study design

  4. View Image

    a case control study design

  5. The case-control study design

    a case control study design

  6. Cases, Controls, and Exposure

    a case control study design


  1. Project 1 Case Study

  2. Major Health Sciences Study Designs

  3. Case-control study design

  4. Case management workshop review

  5. Case Study (Product Design Placement): Cameron

  6. How to Do Case-Control Studies (Bangla)


  1. Case Control

    Case control studies are observational because no intervention is attempted and no attempt is made to alter the course of the disease. The goal is to

  2. Methodology Series Module 2: Case-control Studies

    Case-Control study design is a type of observational study. In this design, participants are selected for the study based on their outcome status.

  3. Epidemiology in Practice: Case-Control Studies

    A case-control study is designed to help determine if an exposure is associated with an outcome (i.e., disease or condition of interest).

  4. Chapter 8. Case-control and cross sectional studies

    An alternative which avoids this difficulty is the case-control or case-referent design. In a case-control study patients who have developed a disease are

  5. Definition of case-control study

    case-control study ... A study that compares two groups of people: those with the disease or condition under study (cases) and a very similar group of people who

  6. Case-Control Studies

    In case-control studies the proportion of cases in the entire population-at-risk is unknown, ... design our study, so the ratios of controls to cases is not.

  7. Case–control study

    A case–control study is a type of observational study in which two existing groups differing in outcome are identified and compared on the basis of some

  8. Introduction to study designs

    Case-control studies start with the identification of a group of cases (individuals with a particular health outcome) in a given population and a group of

  9. Case Control Studies

    A case-control study is an epidemiological study design called an observational study. Observational studies differ from experimental studies in that the

  10. Case-Control Studies

    In a case-control study the same cases are identified and classified as to whether they belong to the exposed or unexposed cohort. Instead of obtaining the