• Comprehensive Learning Paths
  • 150+ Hours of Videos
  • Complete Access to Jupyter notebooks, Datasets, References.

Rating

Hypothesis Testing – A Deep Dive into Hypothesis Testing, The Backbone of Statistical Inference

  • September 21, 2023

Explore the intricacies of hypothesis testing, a cornerstone of statistical analysis. Dive into methods, interpretations, and applications for making data-driven decisions.

how relevant learning hypothesis testing is

In this Blog post we will learn:

  • What is Hypothesis Testing?
  • Steps in Hypothesis Testing 2.1. Set up Hypotheses: Null and Alternative 2.2. Choose a Significance Level (α) 2.3. Calculate a test statistic and P-Value 2.4. Make a Decision
  • Example : Testing a new drug.
  • Example in python

1. What is Hypothesis Testing?

In simple terms, hypothesis testing is a method used to make decisions or inferences about population parameters based on sample data. Imagine being handed a dice and asked if it’s biased. By rolling it a few times and analyzing the outcomes, you’d be engaging in the essence of hypothesis testing.

Think of hypothesis testing as the scientific method of the statistics world. Suppose you hear claims like “This new drug works wonders!” or “Our new website design boosts sales.” How do you know if these statements hold water? Enter hypothesis testing.

2. Steps in Hypothesis Testing

  • Set up Hypotheses : Begin with a null hypothesis (H0) and an alternative hypothesis (Ha).
  • Choose a Significance Level (α) : Typically 0.05, this is the probability of rejecting the null hypothesis when it’s actually true. Think of it as the chance of accusing an innocent person.
  • Calculate Test statistic and P-Value : Gather evidence (data) and calculate a test statistic.
  • p-value : This is the probability of observing the data, given that the null hypothesis is true. A small p-value (typically ≤ 0.05) suggests the data is inconsistent with the null hypothesis.
  • Decision Rule : If the p-value is less than or equal to α, you reject the null hypothesis in favor of the alternative.

2.1. Set up Hypotheses: Null and Alternative

Before diving into testing, we must formulate hypotheses. The null hypothesis (H0) represents the default assumption, while the alternative hypothesis (H1) challenges it.

For instance, in drug testing, H0 : “The new drug is no better than the existing one,” H1 : “The new drug is superior .”

2.2. Choose a Significance Level (α)

When You collect and analyze data to test H0 and H1 hypotheses. Based on your analysis, you decide whether to reject the null hypothesis in favor of the alternative, or fail to reject / Accept the null hypothesis.

The significance level, often denoted by $α$, represents the probability of rejecting the null hypothesis when it is actually true.

In other words, it’s the risk you’re willing to take of making a Type I error (false positive).

Type I Error (False Positive) :

  • Symbolized by the Greek letter alpha (α).
  • Occurs when you incorrectly reject a true null hypothesis . In other words, you conclude that there is an effect or difference when, in reality, there isn’t.
  • The probability of making a Type I error is denoted by the significance level of a test. Commonly, tests are conducted at the 0.05 significance level , which means there’s a 5% chance of making a Type I error .
  • Commonly used significance levels are 0.01, 0.05, and 0.10, but the choice depends on the context of the study and the level of risk one is willing to accept.

Example : If a drug is not effective (truth), but a clinical trial incorrectly concludes that it is effective (based on the sample data), then a Type I error has occurred.

Type II Error (False Negative) :

  • Symbolized by the Greek letter beta (β).
  • Occurs when you accept a false null hypothesis . This means you conclude there is no effect or difference when, in reality, there is.
  • The probability of making a Type II error is denoted by β. The power of a test (1 – β) represents the probability of correctly rejecting a false null hypothesis.

Example : If a drug is effective (truth), but a clinical trial incorrectly concludes that it is not effective (based on the sample data), then a Type II error has occurred.

Balancing the Errors :

how relevant learning hypothesis testing is

In practice, there’s a trade-off between Type I and Type II errors. Reducing the risk of one typically increases the risk of the other. For example, if you want to decrease the probability of a Type I error (by setting a lower significance level), you might increase the probability of a Type II error unless you compensate by collecting more data or making other adjustments.

It’s essential to understand the consequences of both types of errors in any given context. In some situations, a Type I error might be more severe, while in others, a Type II error might be of greater concern. This understanding guides researchers in designing their experiments and choosing appropriate significance levels.

2.3. Calculate a test statistic and P-Value

Test statistic : A test statistic is a single number that helps us understand how far our sample data is from what we’d expect under a null hypothesis (a basic assumption we’re trying to test against). Generally, the larger the test statistic, the more evidence we have against our null hypothesis. It helps us decide whether the differences we observe in our data are due to random chance or if there’s an actual effect.

P-value : The P-value tells us how likely we would get our observed results (or something more extreme) if the null hypothesis were true. It’s a value between 0 and 1. – A smaller P-value (typically below 0.05) means that the observation is rare under the null hypothesis, so we might reject the null hypothesis. – A larger P-value suggests that what we observed could easily happen by random chance, so we might not reject the null hypothesis.

2.4. Make a Decision

Relationship between $α$ and P-Value

When conducting a hypothesis test:

We then calculate the p-value from our sample data and the test statistic.

Finally, we compare the p-value to our chosen $α$:

  • If $p−value≤α$: We reject the null hypothesis in favor of the alternative hypothesis. The result is said to be statistically significant.
  • If $p−value>α$: We fail to reject the null hypothesis. There isn’t enough statistical evidence to support the alternative hypothesis.

3. Example : Testing a new drug.

Imagine we are investigating whether a new drug is effective at treating headaches faster than drug B.

Setting Up the Experiment : You gather 100 people who suffer from headaches. Half of them (50 people) are given the new drug (let’s call this the ‘Drug Group’), and the other half are given a sugar pill, which doesn’t contain any medication.

  • Set up Hypotheses : Before starting, you make a prediction:
  • Null Hypothesis (H0): The new drug has no effect. Any difference in healing time between the two groups is just due to random chance.
  • Alternative Hypothesis (H1): The new drug does have an effect. The difference in healing time between the two groups is significant and not just by chance.

Calculate Test statistic and P-Value : After the experiment, you analyze the data. The “test statistic” is a number that helps you understand the difference between the two groups in terms of standard units.

For instance, let’s say:

  • The average healing time in the Drug Group is 2 hours.
  • The average healing time in the Placebo Group is 3 hours.

The test statistic helps you understand how significant this 1-hour difference is. If the groups are large and the spread of healing times in each group is small, then this difference might be significant. But if there’s a huge variation in healing times, the 1-hour difference might not be so special.

Imagine the P-value as answering this question: “If the new drug had NO real effect, what’s the probability that I’d see a difference as extreme (or more extreme) as the one I found, just by random chance?”

For instance:

  • P-value of 0.01 means there’s a 1% chance that the observed difference (or a more extreme difference) would occur if the drug had no effect. That’s pretty rare, so we might consider the drug effective.
  • P-value of 0.5 means there’s a 50% chance you’d see this difference just by chance. That’s pretty high, so we might not be convinced the drug is doing much.
  • If the P-value is less than ($α$) 0.05: the results are “statistically significant,” and they might reject the null hypothesis , believing the new drug has an effect.
  • If the P-value is greater than ($α$) 0.05: the results are not statistically significant, and they don’t reject the null hypothesis , remaining unsure if the drug has a genuine effect.

4. Example in python

For simplicity, let’s say we’re using a t-test (common for comparing means). Let’s dive into Python:

Making a Decision : “The results are statistically significant! p-value < 0.05 , The drug seems to have an effect!” If not, we’d say, “Looks like the drug isn’t as miraculous as we thought.”

5. Conclusion

Hypothesis testing is an indispensable tool in data science, allowing us to make data-driven decisions with confidence. By understanding its principles, conducting tests properly, and considering real-world applications, you can harness the power of hypothesis testing to unlock valuable insights from your data.

More Articles

Correlation – connecting the dots, the role of correlation in data analysis, sampling and sampling distributions – a comprehensive guide on sampling and sampling distributions, law of large numbers – a deep dive into the world of statistics, central limit theorem – a deep dive into central limit theorem and its significance in statistics, skewness and kurtosis – peaks and tails, understanding data through skewness and kurtosis”, similar articles, complete introduction to linear regression in r, how to implement common statistical significance tests and find the p value, logistic regression – a complete tutorial with examples in r.

Subscribe to Machine Learning Plus for high value data science content

© Machinelearningplus. All rights reserved.

how relevant learning hypothesis testing is

Machine Learning A-Z™: Hands-On Python & R In Data Science

Free sample videos:.

how relevant learning hypothesis testing is

Have a language expert improve your writing

Run a free plagiarism check in 10 minutes, generate accurate citations for free.

  • Knowledge Base

Hypothesis Testing | A Step-by-Step Guide with Easy Examples

Published on November 8, 2019 by Rebecca Bevans . Revised on June 22, 2023.

Hypothesis testing is a formal procedure for investigating our ideas about the world using statistics . It is most often used by scientists to test specific predictions, called hypotheses, that arise from theories.

There are 5 main steps in hypothesis testing:

  • State your research hypothesis as a null hypothesis and alternate hypothesis (H o ) and (H a  or H 1 ).
  • Collect data in a way designed to test the hypothesis.
  • Perform an appropriate statistical test .
  • Decide whether to reject or fail to reject your null hypothesis.
  • Present the findings in your results and discussion section.

Though the specific details might vary, the procedure you will use when testing a hypothesis will always follow some version of these steps.

Table of contents

Step 1: state your null and alternate hypothesis, step 2: collect data, step 3: perform a statistical test, step 4: decide whether to reject or fail to reject your null hypothesis, step 5: present your findings, other interesting articles, frequently asked questions about hypothesis testing.

After developing your initial research hypothesis (the prediction that you want to investigate), it is important to restate it as a null (H o ) and alternate (H a ) hypothesis so that you can test it mathematically.

The alternate hypothesis is usually your initial hypothesis that predicts a relationship between variables. The null hypothesis is a prediction of no relationship between the variables you are interested in.

  • H 0 : Men are, on average, not taller than women. H a : Men are, on average, taller than women.

Prevent plagiarism. Run a free check.

For a statistical test to be valid , it is important to perform sampling and collect data in a way that is designed to test your hypothesis. If your data are not representative, then you cannot make statistical inferences about the population you are interested in.

There are a variety of statistical tests available, but they are all based on the comparison of within-group variance (how spread out the data is within a category) versus between-group variance (how different the categories are from one another).

If the between-group variance is large enough that there is little or no overlap between groups, then your statistical test will reflect that by showing a low p -value . This means it is unlikely that the differences between these groups came about by chance.

Alternatively, if there is high within-group variance and low between-group variance, then your statistical test will reflect that with a high p -value. This means it is likely that any difference you measure between groups is due to chance.

Your choice of statistical test will be based on the type of variables and the level of measurement of your collected data .

  • an estimate of the difference in average height between the two groups.
  • a p -value showing how likely you are to see this difference if the null hypothesis of no difference is true.

Based on the outcome of your statistical test, you will have to decide whether to reject or fail to reject your null hypothesis.

In most cases you will use the p -value generated by your statistical test to guide your decision. And in most cases, your predetermined level of significance for rejecting the null hypothesis will be 0.05 – that is, when there is a less than 5% chance that you would see these results if the null hypothesis were true.

In some cases, researchers choose a more conservative level of significance, such as 0.01 (1%). This minimizes the risk of incorrectly rejecting the null hypothesis ( Type I error ).

Here's why students love Scribbr's proofreading services

Discover proofreading & editing

The results of hypothesis testing will be presented in the results and discussion sections of your research paper , dissertation or thesis .

In the results section you should give a brief summary of the data and a summary of the results of your statistical test (for example, the estimated difference between group means and associated p -value). In the discussion , you can discuss whether your initial hypothesis was supported by your results or not.

In the formal language of hypothesis testing, we talk about rejecting or failing to reject the null hypothesis. You will probably be asked to do this in your statistics assignments.

However, when presenting research results in academic papers we rarely talk this way. Instead, we go back to our alternate hypothesis (in this case, the hypothesis that men are on average taller than women) and state whether the result of our test did or did not support the alternate hypothesis.

If your null hypothesis was rejected, this result is interpreted as “supported the alternate hypothesis.”

These are superficial differences; you can see that they mean the same thing.

You might notice that we don’t say that we reject or fail to reject the alternate hypothesis . This is because hypothesis testing is not designed to prove or disprove anything. It is only designed to test whether a pattern we measure could have arisen spuriously, or by chance.

If we reject the null hypothesis based on our research (i.e., we find that it is unlikely that the pattern arose by chance), then we can say our test lends support to our hypothesis . But if the pattern does not pass our decision rule, meaning that it could have arisen by chance, then we say the test is inconsistent with our hypothesis .

If you want to know more about statistics , methodology , or research bias , make sure to check out some of our other articles with explanations and examples.

  • Normal distribution
  • Descriptive statistics
  • Measures of central tendency
  • Correlation coefficient

Methodology

  • Cluster sampling
  • Stratified sampling
  • Types of interviews
  • Cohort study
  • Thematic analysis

Research bias

  • Implicit bias
  • Cognitive bias
  • Survivorship bias
  • Availability heuristic
  • Nonresponse bias
  • Regression to the mean

Hypothesis testing is a formal procedure for investigating our ideas about the world using statistics. It is used by scientists to test specific predictions, called hypotheses , by calculating how likely it is that a pattern or relationship between variables could have arisen by chance.

A hypothesis states your predictions about what your research will find. It is a tentative answer to your research question that has not yet been tested. For some research projects, you might have to write several hypotheses that address different aspects of your research question.

A hypothesis is not just a guess — it should be based on existing theories and knowledge. It also has to be testable, which means you can support or refute it through scientific research methods (such as experiments, observations and statistical analysis of data).

Null and alternative hypotheses are used in statistical hypothesis testing . The null hypothesis of a test always predicts no effect or no relationship between variables, while the alternative hypothesis states your research prediction of an effect or relationship.

Cite this Scribbr article

If you want to cite this source, you can copy and paste the citation or click the “Cite this Scribbr article” button to automatically add the citation to our free Citation Generator.

Bevans, R. (2023, June 22). Hypothesis Testing | A Step-by-Step Guide with Easy Examples. Scribbr. Retrieved April 10, 2024, from https://www.scribbr.com/statistics/hypothesis-testing/

Is this article helpful?

Rebecca Bevans

Rebecca Bevans

Other students also liked, choosing the right statistical test | types & examples, understanding p values | definition and examples, what is your plagiarism score.

  • Business Essentials
  • Leadership & Management
  • Credential of Leadership, Impact, and Management in Business (CLIMB)
  • Entrepreneurship & Innovation
  • *New* Digital Transformation
  • Finance & Accounting
  • Business in Society
  • For Organizations
  • Support Portal
  • Media Coverage
  • Founding Donors
  • Leadership Team

how relevant learning hypothesis testing is

  • Harvard Business School →
  • HBS Online →
  • Business Insights →

Business Insights

Harvard Business School Online's Business Insights Blog provides the career insights you need to achieve your goals and gain confidence in your business skills.

  • Career Development
  • Communication
  • Decision-Making
  • Earning Your MBA
  • Negotiation
  • News & Events
  • Productivity
  • Staff Spotlight
  • Student Profiles
  • Work-Life Balance
  • Alternative Investments
  • Business Analytics
  • Business Strategy
  • Business and Climate Change
  • Design Thinking and Innovation
  • Digital Marketing Strategy
  • Disruptive Strategy
  • Economics for Managers
  • Entrepreneurship Essentials
  • Financial Accounting
  • Global Business
  • Launching Tech Ventures
  • Leadership Principles
  • Leadership, Ethics, and Corporate Accountability
  • Leading with Finance
  • Management Essentials
  • Negotiation Mastery
  • Organizational Leadership
  • Power and Influence for Positive Impact
  • Strategy Execution
  • Sustainable Business Strategy
  • Sustainable Investing
  • Winning with Digital Platforms

A Beginner’s Guide to Hypothesis Testing in Business

Business professionals performing hypothesis testing

  • 30 Mar 2021

Becoming a more data-driven decision-maker can bring several benefits to your organization, enabling you to identify new opportunities to pursue and threats to abate. Rather than allowing subjective thinking to guide your business strategy, backing your decisions with data can empower your company to become more innovative and, ultimately, profitable.

If you’re new to data-driven decision-making, you might be wondering how data translates into business strategy. The answer lies in generating a hypothesis and verifying or rejecting it based on what various forms of data tell you.

Below is a look at hypothesis testing and the role it plays in helping businesses become more data-driven.

Access your free e-book today.

What Is Hypothesis Testing?

To understand what hypothesis testing is, it’s important first to understand what a hypothesis is.

A hypothesis or hypothesis statement seeks to explain why something has happened, or what might happen, under certain conditions. It can also be used to understand how different variables relate to each other. Hypotheses are often written as if-then statements; for example, “If this happens, then this will happen.”

Hypothesis testing , then, is a statistical means of testing an assumption stated in a hypothesis. While the specific methodology leveraged depends on the nature of the hypothesis and data available, hypothesis testing typically uses sample data to extrapolate insights about a larger population.

Hypothesis Testing in Business

When it comes to data-driven decision-making, there’s a certain amount of risk that can mislead a professional. This could be due to flawed thinking or observations, incomplete or inaccurate data , or the presence of unknown variables. The danger in this is that, if major strategic decisions are made based on flawed insights, it can lead to wasted resources, missed opportunities, and catastrophic outcomes.

The real value of hypothesis testing in business is that it allows professionals to test their theories and assumptions before putting them into action. This essentially allows an organization to verify its analysis is correct before committing resources to implement a broader strategy.

As one example, consider a company that wishes to launch a new marketing campaign to revitalize sales during a slow period. Doing so could be an incredibly expensive endeavor, depending on the campaign’s size and complexity. The company, therefore, may wish to test the campaign on a smaller scale to understand how it will perform.

In this example, the hypothesis that’s being tested would fall along the lines of: “If the company launches a new marketing campaign, then it will translate into an increase in sales.” It may even be possible to quantify how much of a lift in sales the company expects to see from the effort. Pending the results of the pilot campaign, the business would then know whether it makes sense to roll it out more broadly.

Related: 9 Fundamental Data Science Skills for Business Professionals

Key Considerations for Hypothesis Testing

1. alternative hypothesis and null hypothesis.

In hypothesis testing, the hypothesis that’s being tested is known as the alternative hypothesis . Often, it’s expressed as a correlation or statistical relationship between variables. The null hypothesis , on the other hand, is a statement that’s meant to show there’s no statistical relationship between the variables being tested. It’s typically the exact opposite of whatever is stated in the alternative hypothesis.

For example, consider a company’s leadership team that historically and reliably sees $12 million in monthly revenue. They want to understand if reducing the price of their services will attract more customers and, in turn, increase revenue.

In this case, the alternative hypothesis may take the form of a statement such as: “If we reduce the price of our flagship service by five percent, then we’ll see an increase in sales and realize revenues greater than $12 million in the next month.”

The null hypothesis, on the other hand, would indicate that revenues wouldn’t increase from the base of $12 million, or might even decrease.

Check out the video below about the difference between an alternative and a null hypothesis, and subscribe to our YouTube channel for more explainer content.

2. Significance Level and P-Value

Statistically speaking, if you were to run the same scenario 100 times, you’d likely receive somewhat different results each time. If you were to plot these results in a distribution plot, you’d see the most likely outcome is at the tallest point in the graph, with less likely outcomes falling to the right and left of that point.

distribution plot graph

With this in mind, imagine you’ve completed your hypothesis test and have your results, which indicate there may be a correlation between the variables you were testing. To understand your results' significance, you’ll need to identify a p-value for the test, which helps note how confident you are in the test results.

In statistics, the p-value depicts the probability that, assuming the null hypothesis is correct, you might still observe results that are at least as extreme as the results of your hypothesis test. The smaller the p-value, the more likely the alternative hypothesis is correct, and the greater the significance of your results.

3. One-Sided vs. Two-Sided Testing

When it’s time to test your hypothesis, it’s important to leverage the correct testing method. The two most common hypothesis testing methods are one-sided and two-sided tests , or one-tailed and two-tailed tests, respectively.

Typically, you’d leverage a one-sided test when you have a strong conviction about the direction of change you expect to see due to your hypothesis test. You’d leverage a two-sided test when you’re less confident in the direction of change.

Business Analytics | Become a data-driven leader | Learn More

4. Sampling

To perform hypothesis testing in the first place, you need to collect a sample of data to be analyzed. Depending on the question you’re seeking to answer or investigate, you might collect samples through surveys, observational studies, or experiments.

A survey involves asking a series of questions to a random population sample and recording self-reported responses.

Observational studies involve a researcher observing a sample population and collecting data as it occurs naturally, without intervention.

Finally, an experiment involves dividing a sample into multiple groups, one of which acts as the control group. For each non-control group, the variable being studied is manipulated to determine how the data collected differs from that of the control group.

A Beginner's Guide to Data and Analytics | Access Your Free E-Book | Download Now

Learn How to Perform Hypothesis Testing

Hypothesis testing is a complex process involving different moving pieces that can allow an organization to effectively leverage its data and inform strategic decisions.

If you’re interested in better understanding hypothesis testing and the role it can play within your organization, one option is to complete a course that focuses on the process. Doing so can lay the statistical and analytical foundation you need to succeed.

Do you want to learn more about hypothesis testing? Explore Business Analytics —one of our online business essentials courses —and download our Beginner’s Guide to Data & Analytics .

how relevant learning hypothesis testing is

About the Author

Library homepage

  • school Campus Bookshelves
  • menu_book Bookshelves
  • perm_media Learning Objects
  • login Login
  • how_to_reg Request Instructor Account
  • hub Instructor Commons
  • Download Page (PDF)
  • Download Full Book (PDF)
  • Periodic Table
  • Physics Constants
  • Scientific Calculator
  • Reference & Cite
  • Tools expand_more
  • Readability

selected template will load here

This action is not available.

Statistics LibreTexts

9.1: Introduction to Hypothesis Testing

  • Last updated
  • Save as PDF
  • Page ID 10211

  • Kyle Siegrist
  • University of Alabama in Huntsville via Random Services

Basic Theory

Preliminaries.

As usual, our starting point is a random experiment with an underlying sample space and a probability measure \(\P\). In the basic statistical model, we have an observable random variable \(\bs{X}\) taking values in a set \(S\). In general, \(\bs{X}\) can have quite a complicated structure. For example, if the experiment is to sample \(n\) objects from a population and record various measurements of interest, then \[ \bs{X} = (X_1, X_2, \ldots, X_n) \] where \(X_i\) is the vector of measurements for the \(i\)th object. The most important special case occurs when \((X_1, X_2, \ldots, X_n)\) are independent and identically distributed. In this case, we have a random sample of size \(n\) from the common distribution.

The purpose of this section is to define and discuss the basic concepts of statistical hypothesis testing . Collectively, these concepts are sometimes referred to as the Neyman-Pearson framework, in honor of Jerzy Neyman and Egon Pearson, who first formalized them.

A statistical hypothesis is a statement about the distribution of \(\bs{X}\). Equivalently, a statistical hypothesis specifies a set of possible distributions of \(\bs{X}\): the set of distributions for which the statement is true. A hypothesis that specifies a single distribution for \(\bs{X}\) is called simple ; a hypothesis that specifies more than one distribution for \(\bs{X}\) is called composite .

In hypothesis testing , the goal is to see if there is sufficient statistical evidence to reject a presumed null hypothesis in favor of a conjectured alternative hypothesis . The null hypothesis is usually denoted \(H_0\) while the alternative hypothesis is usually denoted \(H_1\).

An hypothesis test is a statistical decision ; the conclusion will either be to reject the null hypothesis in favor of the alternative, or to fail to reject the null hypothesis. The decision that we make must, of course, be based on the observed value \(\bs{x}\) of the data vector \(\bs{X}\). Thus, we will find an appropriate subset \(R\) of the sample space \(S\) and reject \(H_0\) if and only if \(\bs{x} \in R\). The set \(R\) is known as the rejection region or the critical region . Note the asymmetry between the null and alternative hypotheses. This asymmetry is due to the fact that we assume the null hypothesis, in a sense, and then see if there is sufficient evidence in \(\bs{x}\) to overturn this assumption in favor of the alternative.

An hypothesis test is a statistical analogy to proof by contradiction, in a sense. Suppose for a moment that \(H_1\) is a statement in a mathematical theory and that \(H_0\) is its negation. One way that we can prove \(H_1\) is to assume \(H_0\) and work our way logically to a contradiction. In an hypothesis test, we don't prove anything of course, but there are similarities. We assume \(H_0\) and then see if the data \(\bs{x}\) are sufficiently at odds with that assumption that we feel justified in rejecting \(H_0\) in favor of \(H_1\).

Often, the critical region is defined in terms of a statistic \(w(\bs{X})\), known as a test statistic , where \(w\) is a function from \(S\) into another set \(T\). We find an appropriate rejection region \(R_T \subseteq T\) and reject \(H_0\) when the observed value \(w(\bs{x}) \in R_T\). Thus, the rejection region in \(S\) is then \(R = w^{-1}(R_T) = \left\{\bs{x} \in S: w(\bs{x}) \in R_T\right\}\). As usual, the use of a statistic often allows significant data reduction when the dimension of the test statistic is much smaller than the dimension of the data vector.

The ultimate decision may be correct or may be in error. There are two types of errors, depending on which of the hypotheses is actually true.

Types of errors:

  • A type 1 error is rejecting the null hypothesis \(H_0\) when \(H_0\) is true.
  • A type 2 error is failing to reject the null hypothesis \(H_0\) when the alternative hypothesis \(H_1\) is true.

Similarly, there are two ways to make a correct decision: we could reject \(H_0\) when \(H_1\) is true or we could fail to reject \(H_0\) when \(H_0\) is true. The possibilities are summarized in the following table:

Of course, when we observe \(\bs{X} = \bs{x}\) and make our decision, either we will have made the correct decision or we will have committed an error, and usually we will never know which of these events has occurred. Prior to gathering the data, however, we can consider the probabilities of the various errors.

If \(H_0\) is true (that is, the distribution of \(\bs{X}\) is specified by \(H_0\)), then \(\P(\bs{X} \in R)\) is the probability of a type 1 error for this distribution. If \(H_0\) is composite, then \(H_0\) specifies a variety of different distributions for \(\bs{X}\) and thus there is a set of type 1 error probabilities.

The maximum probability of a type 1 error, over the set of distributions specified by \( H_0 \), is the significance level of the test or the size of the critical region.

The significance level is often denoted by \(\alpha\). Usually, the rejection region is constructed so that the significance level is a prescribed, small value (typically 0.1, 0.05, 0.01).

If \(H_1\) is true (that is, the distribution of \(\bs{X}\) is specified by \(H_1\)), then \(\P(\bs{X} \notin R)\) is the probability of a type 2 error for this distribution. Again, if \(H_1\) is composite then \(H_1\) specifies a variety of different distributions for \(\bs{X}\), and thus there will be a set of type 2 error probabilities. Generally, there is a tradeoff between the type 1 and type 2 error probabilities. If we reduce the probability of a type 1 error, by making the rejection region \(R\) smaller, we necessarily increase the probability of a type 2 error because the complementary region \(S \setminus R\) is larger.

The extreme cases can give us some insight. First consider the decision rule in which we never reject \(H_0\), regardless of the evidence \(\bs{x}\). This corresponds to the rejection region \(R = \emptyset\). A type 1 error is impossible, so the significance level is 0. On the other hand, the probability of a type 2 error is 1 for any distribution defined by \(H_1\). At the other extreme, consider the decision rule in which we always rejects \(H_0\) regardless of the evidence \(\bs{x}\). This corresponds to the rejection region \(R = S\). A type 2 error is impossible, but now the probability of a type 1 error is 1 for any distribution defined by \(H_0\). In between these two worthless tests are meaningful tests that take the evidence \(\bs{x}\) into account.

If \(H_1\) is true, so that the distribution of \(\bs{X}\) is specified by \(H_1\), then \(\P(\bs{X} \in R)\), the probability of rejecting \(H_0\) is the power of the test for that distribution.

Thus the power of the test for a distribution specified by \( H_1 \) is the probability of making the correct decision.

Suppose that we have two tests, corresponding to rejection regions \(R_1\) and \(R_2\), respectively, each having significance level \(\alpha\). The test with region \(R_1\) is uniformly more powerful than the test with region \(R_2\) if \[ \P(\bs{X} \in R_1) \ge \P(\bs{X} \in R_2) \text{ for every distribution of } \bs{X} \text{ specified by } H_1 \]

Naturally, in this case, we would prefer the first test. Often, however, two tests will not be uniformly ordered; one test will be more powerful for some distributions specified by \(H_1\) while the other test will be more powerful for other distributions specified by \(H_1\).

If a test has significance level \(\alpha\) and is uniformly more powerful than any other test with significance level \(\alpha\), then the test is said to be a uniformly most powerful test at level \(\alpha\).

Clearly a uniformly most powerful test is the best we can do.

\(P\)-value

In most cases, we have a general procedure that allows us to construct a test (that is, a rejection region \(R_\alpha\)) for any given significance level \(\alpha \in (0, 1)\). Typically, \(R_\alpha\) decreases (in the subset sense) as \(\alpha\) decreases.

The \(P\)-value of the observed value \(\bs{x}\) of \(\bs{X}\), denoted \(P(\bs{x})\), is defined to be the smallest \(\alpha\) for which \(\bs{x} \in R_\alpha\); that is, the smallest significance level for which \(H_0\) is rejected, given \(\bs{X} = \bs{x}\).

Knowing \(P(\bs{x})\) allows us to test \(H_0\) at any significance level for the given data \(\bs{x}\): If \(P(\bs{x}) \le \alpha\) then we would reject \(H_0\) at significance level \(\alpha\); if \(P(\bs{x}) \gt \alpha\) then we fail to reject \(H_0\) at significance level \(\alpha\). Note that \(P(\bs{X})\) is a statistic . Informally, \(P(\bs{x})\) can often be thought of as the probability of an outcome as or more extreme than the observed value \(\bs{x}\), where extreme is interpreted relative to the null hypothesis \(H_0\).

Analogy with Justice Systems

There is a helpful analogy between statistical hypothesis testing and the criminal justice system in the US and various other countries. Consider a person charged with a crime. The presumed null hypothesis is that the person is innocent of the crime; the conjectured alternative hypothesis is that the person is guilty of the crime. The test of the hypotheses is a trial with evidence presented by both sides playing the role of the data. After considering the evidence, the jury delivers the decision as either not guilty or guilty . Note that innocent is not a possible verdict of the jury, because it is not the point of the trial to prove the person innocent. Rather, the point of the trial is to see whether there is sufficient evidence to overturn the null hypothesis that the person is innocent in favor of the alternative hypothesis of that the person is guilty. A type 1 error is convicting a person who is innocent; a type 2 error is acquitting a person who is guilty. Generally, a type 1 error is considered the more serious of the two possible errors, so in an attempt to hold the chance of a type 1 error to a very low level, the standard for conviction in serious criminal cases is beyond a reasonable doubt .

Tests of an Unknown Parameter

Hypothesis testing is a very general concept, but an important special class occurs when the distribution of the data variable \(\bs{X}\) depends on a parameter \(\theta\) taking values in a parameter space \(\Theta\). The parameter may be vector-valued, so that \(\bs{\theta} = (\theta_1, \theta_2, \ldots, \theta_n)\) and \(\Theta \subseteq \R^k\) for some \(k \in \N_+\). The hypotheses generally take the form \[ H_0: \theta \in \Theta_0 \text{ versus } H_1: \theta \notin \Theta_0 \] where \(\Theta_0\) is a prescribed subset of the parameter space \(\Theta\). In this setting, the probabilities of making an error or a correct decision depend on the true value of \(\theta\). If \(R\) is the rejection region, then the power function \( Q \) is given by \[ Q(\theta) = \P_\theta(\bs{X} \in R), \quad \theta \in \Theta \] The power function gives a lot of information about the test.

The power function satisfies the following properties:

  • \(Q(\theta)\) is the probability of a type 1 error when \(\theta \in \Theta_0\).
  • \(\max\left\{Q(\theta): \theta \in \Theta_0\right\}\) is the significance level of the test.
  • \(1 - Q(\theta)\) is the probability of a type 2 error when \(\theta \notin \Theta_0\).
  • \(Q(\theta)\) is the power of the test when \(\theta \notin \Theta_0\).

If we have two tests, we can compare them by means of their power functions.

Suppose that we have two tests, corresponding to rejection regions \(R_1\) and \(R_2\), respectively, each having significance level \(\alpha\). The test with rejection region \(R_1\) is uniformly more powerful than the test with rejection region \(R_2\) if \( Q_1(\theta) \ge Q_2(\theta)\) for all \( \theta \notin \Theta_0 \).

Most hypothesis tests of an unknown real parameter \(\theta\) fall into three special cases:

Suppose that \( \theta \) is a real parameter and \( \theta_0 \in \Theta \) a specified value. The tests below are respectively the two-sided test , the left-tailed test , and the right-tailed test .

  • \(H_0: \theta = \theta_0\) versus \(H_1: \theta \ne \theta_0\)
  • \(H_0: \theta \ge \theta_0\) versus \(H_1: \theta \lt \theta_0\)
  • \(H_0: \theta \le \theta_0\) versus \(H_1: \theta \gt \theta_0\)

Thus the tests are named after the conjectured alternative. Of course, there may be other unknown parameters besides \(\theta\) (known as nuisance parameters ).

Equivalence Between Hypothesis Test and Confidence Sets

There is an equivalence between hypothesis tests and confidence sets for a parameter \(\theta\).

Suppose that \(C(\bs{x})\) is a \(1 - \alpha\) level confidence set for \(\theta\). The following test has significance level \(\alpha\) for the hypothesis \( H_0: \theta = \theta_0 \) versus \( H_1: \theta \ne \theta_0 \): Reject \(H_0\) if and only if \(\theta_0 \notin C(\bs{x})\)

By definition, \(\P[\theta \in C(\bs{X})] = 1 - \alpha\). Hence if \(H_0\) is true so that \(\theta = \theta_0\), then the probability of a type 1 error is \(P[\theta \notin C(\bs{X})] = \alpha\).

Equivalently, we fail to reject \(H_0\) at significance level \(\alpha\) if and only if \(\theta_0\) is in the corresponding \(1 - \alpha\) level confidence set. In particular, this equivalence applies to interval estimates of a real parameter \(\theta\) and the common tests for \(\theta\) given above .

In each case below, the confidence interval has confidence level \(1 - \alpha\) and the test has significance level \(\alpha\).

  • Suppose that \(\left[L(\bs{X}, U(\bs{X})\right]\) is a two-sided confidence interval for \(\theta\). Reject \(H_0: \theta = \theta_0\) versus \(H_1: \theta \ne \theta_0\) if and only if \(\theta_0 \lt L(\bs{X})\) or \(\theta_0 \gt U(\bs{X})\).
  • Suppose that \(L(\bs{X})\) is a confidence lower bound for \(\theta\). Reject \(H_0: \theta \le \theta_0\) versus \(H_1: \theta \gt \theta_0\) if and only if \(\theta_0 \lt L(\bs{X})\).
  • Suppose that \(U(\bs{X})\) is a confidence upper bound for \(\theta\). Reject \(H_0: \theta \ge \theta_0\) versus \(H_1: \theta \lt \theta_0\) if and only if \(\theta_0 \gt U(\bs{X})\).

Pivot Variables and Test Statistics

Recall that confidence sets of an unknown parameter \(\theta\) are often constructed through a pivot variable , that is, a random variable \(W(\bs{X}, \theta)\) that depends on the data vector \(\bs{X}\) and the parameter \(\theta\), but whose distribution does not depend on \(\theta\) and is known. In this case, a natural test statistic for the basic tests given above is \(W(\bs{X}, \theta_0)\).

  • Machine Learning Tutorial
  • Data Analysis Tutorial
  • Python - Data visualization tutorial
  • Machine Learning Projects
  • Machine Learning Interview Questions
  • Machine Learning Mathematics
  • Deep Learning Tutorial
  • Deep Learning Project
  • Deep Learning Interview Questions
  • Computer Vision Tutorial
  • Computer Vision Projects
  • NLP Project
  • NLP Interview Questions
  • Statistics with Python
  • 100 Days of Machine Learning
  • Data Analysis with Python

Introduction to Data Analysis

  • What is Data Analysis?
  • Data Analytics and its type
  • How to Install Numpy on Windows?
  • How to Install Pandas in Python?
  • How to Install Matplotlib on python?
  • How to Install Python Tensorflow in Windows?

Data Analysis Libraries

  • Pandas Tutorial
  • NumPy Tutorial - Python Library
  • Data Analysis with SciPy
  • Introduction to TensorFlow

Data Visulization Libraries

  • Matplotlib Tutorial
  • Python Seaborn Tutorial
  • Plotly tutorial
  • Introduction to Bokeh in Python

Exploratory Data Analysis (EDA)

  • Univariate, Bivariate and Multivariate data and its analysis
  • Measures of Central Tendency in Statistics
  • Measures of spread - Range, Variance, and Standard Deviation
  • Interquartile Range and Quartile Deviation using NumPy and SciPy
  • Anova Formula
  • Skewness of Statistical Data
  • How to Calculate Skewness and Kurtosis in Python?
  • Difference Between Skewness and Kurtosis
  • Histogram | Meaning, Example, Types and Steps to Draw
  • Interpretations of Histogram
  • Quantile Quantile plots
  • What is Univariate, Bivariate & Multivariate Analysis in Data Visualisation?
  • Using pandas crosstab to create a bar plot
  • Exploring Correlation in Python
  • Mathematics | Covariance and Correlation
  • Factor Analysis | Data Analysis
  • Data Mining - Cluster Analysis
  • MANOVA Test in R Programming
  • Python - Central Limit Theorem
  • Probability Distribution Function
  • Probability Density Estimation & Maximum Likelihood Estimation
  • Exponential Distribution in R Programming - dexp(), pexp(), qexp(), and rexp() Functions
  • Mathematics | Probability Distributions Set 4 (Binomial Distribution)
  • Poisson Distribution - Definition, Formula, Table and Examples
  • P-Value: Comprehensive Guide to Understand, Apply, and Interpret
  • Z-Score in Statistics
  • How to Calculate Point Estimates in R?
  • Confidence Interval
  • Chi-square test in Machine Learning

Understanding Hypothesis Testing

Data preprocessing.

  • ML | Data Preprocessing in Python
  • ML | Overview of Data Cleaning
  • ML | Handling Missing Values
  • Detect and Remove the Outliers using Python

Data Transformation

  • Data Normalization Machine Learning
  • Sampling distribution Using Python

Time Series Data Analysis

  • Data Mining - Time-Series, Symbolic and Biological Sequences Data
  • Basic DateTime Operations in Python
  • Time Series Analysis & Visualization in Python
  • How to deal with missing values in a Timeseries in Python?
  • How to calculate MOVING AVERAGE in a Pandas DataFrame?
  • What is a trend in time series?
  • How to Perform an Augmented Dickey-Fuller Test in R
  • AutoCorrelation

Case Studies and Projects

  • Top 8 Free Dataset Sources to Use for Data Science Projects
  • Step by Step Predictive Analysis - Machine Learning
  • 6 Tips for Creating Effective Data Visualizations

Hypothesis testing involves formulating assumptions about population parameters based on sample statistics and rigorously evaluating these assumptions against empirical evidence. This article sheds light on the significance of hypothesis testing and the critical steps involved in the process.

What is Hypothesis Testing?

Hypothesis testing is a statistical method that is used to make a statistical decision using experimental data. Hypothesis testing is basically an assumption that we make about a population parameter. It evaluates two mutually exclusive statements about a population to determine which statement is best supported by the sample data. 

Example: You say an average height in the class is 30 or a boy is taller than a girl. All of these is an assumption that we are assuming, and we need some statistical way to prove these. We need some mathematical conclusion whatever we are assuming is true.

Defining Hypotheses

\mu

Key Terms of Hypothesis Testing

\alpha

  • P-value: The P value , or calculated probability, is the probability of finding the observed/extreme results when the null hypothesis(H0) of a study-given problem is true. If your P-value is less than the chosen significance level then you reject the null hypothesis i.e. accept that your sample claims to support the alternative hypothesis.
  • Test Statistic: The test statistic is a numerical value calculated from sample data during a hypothesis test, used to determine whether to reject the null hypothesis. It is compared to a critical value or p-value to make decisions about the statistical significance of the observed results.
  • Critical value : The critical value in statistics is a threshold or cutoff point used to determine whether to reject the null hypothesis in a hypothesis test.
  • Degrees of freedom: Degrees of freedom are associated with the variability or freedom one has in estimating a parameter. The degrees of freedom are related to the sample size and determine the shape.

Why do we use Hypothesis Testing?

Hypothesis testing is an important procedure in statistics. Hypothesis testing evaluates two mutually exclusive population statements to determine which statement is most supported by sample data. When we say that the findings are statistically significant, thanks to hypothesis testing. 

One-Tailed and Two-Tailed Test

One tailed test focuses on one direction, either greater than or less than a specified value. We use a one-tailed test when there is a clear directional expectation based on prior knowledge or theory. The critical region is located on only one side of the distribution curve. If the sample falls into this critical region, the null hypothesis is rejected in favor of the alternative hypothesis.

One-Tailed Test

There are two types of one-tailed test:

\mu \geq 50

Two-Tailed Test

A two-tailed test considers both directions, greater than and less than a specified value.We use a two-tailed test when there is no specific directional expectation, and want to detect any significant difference.

\mu =

What are Type 1 and Type 2 errors in Hypothesis Testing?

In hypothesis testing, Type I and Type II errors are two possible errors that researchers can make when drawing conclusions about a population based on a sample of data. These errors are associated with the decisions made regarding the null hypothesis and the alternative hypothesis.

\alpha

How does Hypothesis Testing work?

Step 1: define null and alternative hypothesis.

H_0

We first identify the problem about which we want to make an assumption keeping in mind that our assumption should be contradictory to one another, assuming Normally distributed data.

Step 2 – Choose significance level

\alpha

Step 3 – Collect and Analyze data.

Gather relevant data through observation or experimentation. Analyze the data using appropriate statistical methods to obtain a test statistic.

Step 4-Calculate Test Statistic

The data for the tests are evaluated in this step we look for various scores based on the characteristics of data. The choice of the test statistic depends on the type of hypothesis test being conducted.

There are various hypothesis tests, each appropriate for various goal to calculate our test. This could be a Z-test , Chi-square , T-test , and so on.

  • Z-test : If population means and standard deviations are known. Z-statistic is commonly used.
  • t-test : If population standard deviations are unknown. and sample size is small than t-test statistic is more appropriate.
  • Chi-square test : Chi-square test is used for categorical data or for testing independence in contingency tables
  • F-test : F-test is often used in analysis of variance (ANOVA) to compare variances or test the equality of means across multiple groups.

We have a smaller dataset, So, T-test is more appropriate to test our hypothesis.

T-statistic is a measure of the difference between the means of two groups relative to the variability within each group. It is calculated as the difference between the sample means divided by the standard error of the difference. It is also known as the t-value or t-score.

Step 5 – Comparing Test Statistic:

In this stage, we decide where we should accept the null hypothesis or reject the null hypothesis. There are two ways to decide where we should accept or reject the null hypothesis.

Method A: Using Crtical values

Comparing the test statistic and tabulated critical value we have,

  • If Test Statistic>Critical Value: Reject the null hypothesis.
  • If Test Statistic≤Critical Value: Fail to reject the null hypothesis.

Note: Critical values are predetermined threshold values that are used to make a decision in hypothesis testing. To determine critical values for hypothesis testing, we typically refer to a statistical distribution table , such as the normal distribution or t-distribution tables based on.

Method B: Using P-values

We can also come to an conclusion using the p-value,

p\leq\alpha

Note : The p-value is the probability of obtaining a test statistic as extreme as, or more extreme than, the one observed in the sample, assuming the null hypothesis is true. To determine p-value for hypothesis testing, we typically refer to a statistical distribution table , such as the normal distribution or t-distribution tables based on.

Step 7- Interpret the Results

At last, we can conclude our experiment using method A or B.

Calculating test statistic

To validate our hypothesis about a population parameter we use statistical functions . We use the z-score, p-value, and level of significance(alpha) to make evidence for our hypothesis for normally distributed data .

1. Z-statistics:

When population means and standard deviations are known.

z = \frac{\bar{x} - \mu}{\frac{\sigma}{\sqrt{n}}}

  • μ represents the population mean, 
  • σ is the standard deviation
  • and n is the size of the sample.

2. T-Statistics

T test is used when n<30,

t-statistic calculation is given by:

t=\frac{x̄-μ}{s/\sqrt{n}}

  • t = t-score,
  • x̄ = sample mean
  • μ = population mean,
  • s = standard deviation of the sample,
  • n = sample size

3. Chi-Square Test

Chi-Square Test for Independence categorical Data (Non-normally distributed) using:

\chi^2 = \sum \frac{(O_{ij} - E_{ij})^2}{E_{ij}}

  • i,j are the rows and columns index respectively.

E_{ij}

Real life Hypothesis Testing example

Let’s examine hypothesis testing using two real life situations,

Case A: D oes a New Drug Affect Blood Pressure?

Imagine a pharmaceutical company has developed a new drug that they believe can effectively lower blood pressure in patients with hypertension. Before bringing the drug to market, they need to conduct a study to assess its impact on blood pressure.

  • Before Treatment: 120, 122, 118, 130, 125, 128, 115, 121, 123, 119
  • After Treatment: 115, 120, 112, 128, 122, 125, 110, 117, 119, 114

Step 1 : Define the Hypothesis

  • Null Hypothesis : (H 0 )The new drug has no effect on blood pressure.
  • Alternate Hypothesis : (H 1 )The new drug has an effect on blood pressure.

Step 2: Define the Significance level

Let’s consider the Significance level at 0.05, indicating rejection of the null hypothesis.

If the evidence suggests less than a 5% chance of observing the results due to random variation.

Step 3 : Compute the test statistic

Using paired T-test analyze the data to obtain a test statistic and a p-value.

The test statistic (e.g., T-statistic) is calculated based on the differences between blood pressure measurements before and after treatment.

t = m/(s/√n)

  • m  = mean of the difference i.e X after, X before
  • s  = standard deviation of the difference (d) i.e d i ​= X after, i ​− X before,
  • n  = sample size,

then, m= -3.9, s= 1.8 and n= 10

we, calculate the , T-statistic = -9 based on the formula for paired t test

Step 4: Find the p-value

The calculated t-statistic is -9 and degrees of freedom df = 9, you can find the p-value using statistical software or a t-distribution table.

thus, p-value = 8.538051223166285e-06

Step 5: Result

  • If the p-value is less than or equal to 0.05, the researchers reject the null hypothesis.
  • If the p-value is greater than 0.05, they fail to reject the null hypothesis.

Conclusion: Since the p-value (8.538051223166285e-06) is less than the significance level (0.05), the researchers reject the null hypothesis. There is statistically significant evidence that the average blood pressure before and after treatment with the new drug is different.

Python Implementation of Hypothesis Testing

Let’s create hypothesis testing with python, where we are testing whether a new drug affects blood pressure. For this example, we will use a paired T-test. We’ll use the scipy.stats library for the T-test.

Scipy is a mathematical library in Python that is mostly used for mathematical equations and computations.

We will implement our first real life problem via python,

In the above example, given the T-statistic of approximately -9 and an extremely small p-value, the results indicate a strong case to reject the null hypothesis at a significance level of 0.05. 

  • The results suggest that the new drug, treatment, or intervention has a significant effect on lowering blood pressure.
  • The negative T-statistic indicates that the mean blood pressure after treatment is significantly lower than the assumed population mean before treatment.

Case B : Cholesterol level in a population

Data: A sample of 25 individuals is taken, and their cholesterol levels are measured.

Cholesterol Levels (mg/dL): 205, 198, 210, 190, 215, 205, 200, 192, 198, 205, 198, 202, 208, 200, 205, 198, 205, 210, 192, 205, 198, 205, 210, 192, 205.

Populations Mean = 200

Population Standard Deviation (σ): 5 mg/dL(given for this problem)

Step 1: Define the Hypothesis

  • Null Hypothesis (H 0 ): The average cholesterol level in a population is 200 mg/dL.
  • Alternate Hypothesis (H 1 ): The average cholesterol level in a population is different from 200 mg/dL.

As the direction of deviation is not given , we assume a two-tailed test, and based on a normal distribution table, the critical values for a significance level of 0.05 (two-tailed) can be calculated through the z-table and are approximately -1.96 and 1.96.

(203.8 - 200) / (5 \div \sqrt{25})

Step 4: Result

Since the absolute value of the test statistic (2.04) is greater than the critical value (1.96), we reject the null hypothesis. And conclude that, there is statistically significant evidence that the average cholesterol level in the population is different from 200 mg/dL

Limitations of Hypothesis Testing

  • Although a useful technique, hypothesis testing does not offer a comprehensive grasp of the topic being studied. Without fully reflecting the intricacy or whole context of the phenomena, it concentrates on certain hypotheses and statistical significance.
  • The accuracy of hypothesis testing results is contingent on the quality of available data and the appropriateness of statistical methods used. Inaccurate data or poorly formulated hypotheses can lead to incorrect conclusions.
  • Relying solely on hypothesis testing may cause analysts to overlook significant patterns or relationships in the data that are not captured by the specific hypotheses being tested. This limitation underscores the importance of complimenting hypothesis testing with other analytical approaches.

Hypothesis testing stands as a cornerstone in statistical analysis, enabling data scientists to navigate uncertainties and draw credible inferences from sample data. By systematically defining null and alternative hypotheses, choosing significance levels, and leveraging statistical tests, researchers can assess the validity of their assumptions. The article also elucidates the critical distinction between Type I and Type II errors, providing a comprehensive understanding of the nuanced decision-making process inherent in hypothesis testing. The real-life example of testing a new drug’s effect on blood pressure using a paired T-test showcases the practical application of these principles, underscoring the importance of statistical rigor in data-driven decision-making.

Frequently Asked Questions (FAQs)

1. what are the 3 types of hypothesis test.

There are three types of hypothesis tests: right-tailed, left-tailed, and two-tailed. Right-tailed tests assess if a parameter is greater, left-tailed if lesser. Two-tailed tests check for non-directional differences, greater or lesser.

2.What are the 4 components of hypothesis testing?

Null Hypothesis ( ): No effect or difference exists. Alternative Hypothesis ( ): An effect or difference exists. Significance Level ( ): Risk of rejecting null hypothesis when it’s true (Type I error). Test Statistic: Numerical value representing observed evidence against null hypothesis.

3.What is hypothesis testing in ML?

Statistical method to evaluate the performance and validity of machine learning models. Tests specific hypotheses about model behavior, like whether features influence predictions or if a model generalizes well to unseen data.

4.What is the difference between Pytest and hypothesis in Python?

Pytest purposes general testing framework for Python code while Hypothesis is a Property-based testing framework for Python, focusing on generating test cases based on specified properties of the code.

Please Login to comment...

Similar reads.

  • data-science
  • Data Science
  • Machine Learning
  • 10 Best Slack Integrations to Enhance Your Team's Productivity
  • 10 Best Zendesk Alternatives and Competitors
  • 10 Best Trello Power-Ups for Maximizing Project Management
  • Google Rolls Out Gemini In Android Studio For Coding Assistance
  • 30 OOPs Interview Questions and Answers (2024)

Improve your Coding Skills with Practice

 alt=

What kind of Experience do you want to share?

Logo for Kwantlen Polytechnic University

Want to create or adapt books like this? Learn more about how Pressbooks supports open publishing practices.

11 Hypothesis testing

The process of induction is the process of assuming the simplest law that can be made to harmonize with our experience. This process, however, has no logical foundation but only a psychological one. It is clear that there are no grounds for believing that the simplest course of events will really happen. It is an hypothesis that the sun will rise tomorrow: and this means that we do not know whether it will rise. – Ludwig Wittgenstein 157

In the last chapter, I discussed the ideas behind estimation, which is one of the two “big ideas” in inferential statistics. It’s now time to turn out attention to the other big idea, which is hypothesis testing . In its most abstract form, hypothesis testing really a very simple idea: the researcher has some theory about the world, and wants to determine whether or not the data actually support that theory. However, the details are messy, and most people find the theory of hypothesis testing to be the most frustrating part of statistics. The structure of the chapter is as follows. Firstly, I’ll describe how hypothesis testing works, in a fair amount of detail, using a simple running example to show you how a hypothesis test is “built”. I’ll try to avoid being too dogmatic while doing so, and focus instead on the underlying logic of the testing procedure. 158 Afterwards, I’ll spend a bit of time talking about the various dogmas, rules and heresies that surround the theory of hypothesis testing.

11.1 A menagerie of hypotheses

Eventually we all succumb to madness. For me, that day will arrive once I’m finally promoted to full professor. Safely ensconced in my ivory tower, happily protected by tenure, I will finally be able to take leave of my senses (so to speak), and indulge in that most thoroughly unproductive line of psychological research: the search for extrasensory perception (ESP). 159

Let’s suppose that this glorious day has come. My first study is a simple one, in which I seek to test whether clairvoyance exists. Each participant sits down at a table, and is shown a card by an experimenter. The card is black on one side and white on the other. The experimenter takes the card away, and places it on a table in an adjacent room. The card is placed black side up or white side up completely at random, with the randomisation occurring only after the experimenter has left the room with the participant. A second experimenter comes in and asks the participant which side of the card is now facing upwards. It’s purely a one-shot experiment. Each person sees only one card, and gives only one answer; and at no stage is the participant actually in contact with someone who knows the right answer. My data set, therefore, is very simple. I have asked the question of \(N\) people, and some number \(X\) of these people have given the correct response. To make things concrete, let’s suppose that I have tested \(N = 100\) people, and \(X = 62\) of these got the answer right… a surprisingly large number, sure, but is it large enough for me to feel safe in claiming I’ve found evidence for ESP? This is the situation where hypothesis testing comes in useful. However, before we talk about how to test hypotheses, we need to be clear about what we mean by hypotheses.

11.1.1 Research hypotheses versus statistical hypotheses

The first distinction that you need to keep clear in your mind is between research hypotheses and statistical hypotheses. In my ESP study, my overall scientific goal is to demonstrate that clairvoyance exists. In this situation, I have a clear research goal: I am hoping to discover evidence for ESP. In other situations I might actually be a lot more neutral than that, so I might say that my research goal is to determine whether or not clairvoyance exists. Regardless of how I want to portray myself, the basic point that I’m trying to convey here is that a research hypothesis involves making a substantive, testable scientific claim… if you are a psychologist, then your research hypotheses are fundamentally about psychological constructs. Any of the following would count as research hypotheses :

  • Listening to music reduces your ability to pay attention to other things. This is a claim about the causal relationship between two psychologically meaningful concepts (listening to music and paying attention to things), so it’s a perfectly reasonable research hypothesis.
  • Intelligence is related to personality . Like the last one, this is a relational claim about two psychological constructs (intelligence and personality), but the claim is weaker: correlational not causal.
  • Intelligence is* speed of information processing . This hypothesis has a quite different character: it’s not actually a relational claim at all. It’s an ontological claim about the fundamental character of intelligence (and I’m pretty sure it’s wrong). It’s worth expanding on this one actually: It’s usually easier to think about how to construct experiments to test research hypotheses of the form “does X affect Y?” than it is to address claims like “what is X?” And in practice, what usually happens is that you find ways of testing relational claims that follow from your ontological ones. For instance, if I believe that intelligence is* speed of information processing in the brain, my experiments will often involve looking for relationships between measures of intelligence and measures of speed. As a consequence, most everyday research questions do tend to be relational in nature, but they’re almost always motivated by deeper ontological questions about the state of nature.

Notice that in practice, my research hypotheses could overlap a lot. My ultimate goal in the ESP experiment might be to test an ontological claim like “ESP exists”, but I might operationally restrict myself to a narrower hypothesis like “Some people can `see’ objects in a clairvoyant fashion”. That said, there are some things that really don’t count as proper research hypotheses in any meaningful sense:

  • Love is a battlefield . This is too vague to be testable. While it’s okay for a research hypothesis to have a degree of vagueness to it, it has to be possible to operationalise your theoretical ideas. Maybe I’m just not creative enough to see it, but I can’t see how this can be converted into any concrete research design. If that’s true, then this isn’t a scientific research hypothesis, it’s a pop song. That doesn’t mean it’s not interesting – a lot of deep questions that humans have fall into this category. Maybe one day science will be able to construct testable theories of love, or to test to see if God exists, and so on; but right now we can’t, and I wouldn’t bet on ever seeing a satisfying scientific approach to either.
  • The first rule of tautology club is the first rule of tautology club . This is not a substantive claim of any kind. It’s true by definition. No conceivable state of nature could possibly be inconsistent with this claim. As such, we say that this is an unfalsifiable hypothesis, and as such it is outside the domain of science. Whatever else you do in science, your claims must have the possibility of being wrong.
  • More people in my experiment will say “yes” than “no” . This one fails as a research hypothesis because it’s a claim about the data set, not about the psychology (unless of course your actual research question is whether people have some kind of “yes” bias!). As we’ll see shortly, this hypothesis is starting to sound more like a statistical hypothesis than a research hypothesis.

As you can see, research hypotheses can be somewhat messy at times; and ultimately they are scientific claims. Statistical hypotheses are neither of these two things. Statistical hypotheses must be mathematically precise, and they must correspond to specific claims about the characteristics of the data generating mechanism (i.e., the “population”). Even so, the intent is that statistical hypotheses bear a clear relationship to the substantive research hypotheses that you care about! For instance, in my ESP study my research hypothesis is that some people are able to see through walls or whatever. What I want to do is to “map” this onto a statement about how the data were generated. So let’s think about what that statement would be. The quantity that I’m interested in within the experiment is \(P(\mbox{“correct”})\) , the true-but-unknown probability with which the participants in my experiment answer the question correctly. Let’s use the Greek letter \(\theta\) (theta) to refer to this probability. Here are four different statistical hypotheses:

  • If ESP doesn’t exist and if my experiment is well designed, then my participants are just guessing. So I should expect them to get it right half of the time and so my statistical hypothesis is that the true probability of choosing correctly is \(\theta = 0.5\) .
  • Alternatively, suppose ESP does exist and participants can see the card. If that’s true, people will perform better than chance. The statistical hypotheis would be that \(\theta > 0.5\) .
  • A third possibility is that ESP does exist, but the colours are all reversed and people don’t realise it (okay, that’s wacky, but you never know…). If that’s how it works then you’d expect people’s performance to be below chance. This would correspond to a statistical hypothesis that \(\theta < 0.5\) .
  • Finally, suppose ESP exists, but I have no idea whether people are seeing the right colour or the wrong one. In that case, the only claim I could make about the data would be that the probability of making the correct answer is not equal to 50. This corresponds to the statistical hypothesis that \(\theta \neq 0.5\) .

All of these are legitimate examples of a statistical hypothesis because they are statements about a population parameter and are meaningfully related to my experiment.

What this discussion makes clear, I hope, is that when attempting to construct a statistical hypothesis test the researcher actually has two quite distinct hypotheses to consider. First, he or she has a research hypothesis (a claim about psychology), and this corresponds to a statistical hypothesis (a claim about the data generating population). In my ESP example, these might be

And the key thing to recognise is this: a statistical hypothesis test is a test of the statistical hypothesis, not the research hypothesis . If your study is badly designed, then the link between your research hypothesis and your statistical hypothesis is broken. To give a silly example, suppose that my ESP study was conducted in a situation where the participant can actually see the card reflected in a window; if that happens, I would be able to find very strong evidence that \(\theta \neq 0.5\) , but this would tell us nothing about whether “ESP exists”.

11.1.2 Null hypotheses and alternative hypotheses

So far, so good. I have a research hypothesis that corresponds to what I want to believe about the world, and I can map it onto a statistical hypothesis that corresponds to what I want to believe about how the data were generated. It’s at this point that things get somewhat counterintuitive for a lot of people. Because what I’m about to do is invent a new statistical hypothesis (the “null” hypothesis, \(H_0\) ) that corresponds to the exact opposite of what I want to believe, and then focus exclusively on that, almost to the neglect of the thing I’m actually interested in (which is now called the “alternative” hypothesis, \(H_1\) ). In our ESP example, the null hypothesis is that \(\theta = 0.5\) , since that’s what we’d expect if ESP didn’t exist. My hope, of course, is that ESP is totally real, and so the alternative to this null hypothesis is \(\theta \neq 0.5\) . In essence, what we’re doing here is dividing up the possible values of \(\theta\) into two groups: those values that I really hope aren’t true (the null), and those values that I’d be happy with if they turn out to be right (the alternative). Having done so, the important thing to recognise is that the goal of a hypothesis test is not to show that the alternative hypothesis is (probably) true; the goal is to show that the null hypothesis is (probably) false. Most people find this pretty weird.

The best way to think about it, in my experience, is to imagine that a hypothesis test is a criminal trial 160 … the trial of the null hypothesis . The null hypothesis is the defendant, the researcher is the prosecutor, and the statistical test itself is the judge. Just like a criminal trial, there is a presumption of innocence: the null hypothesis is deemed to be true unless you, the researcher, can prove beyond a reasonable doubt that it is false. You are free to design your experiment however you like (within reason, obviously!), and your goal when doing so is to maximise the chance that the data will yield a conviction… for the crime of being false. The catch is that the statistical test sets the rules of the trial, and those rules are designed to protect the null hypothesis – specifically to ensure that if the null hypothesis is actually true, the chances of a false conviction are guaranteed to be low. This is pretty important: after all, the null hypothesis doesn’t get a lawyer. And given that the researcher is trying desperately to prove it to be false, someone has to protect it.

11.2 Two types of errors

Before going into details about how a statistical test is constructed, it’s useful to understand the philosophy behind it. I hinted at it when pointing out the similarity between a null hypothesis test and a criminal trial, but I should now be explicit. Ideally, we would like to construct our test so that we never make any errors. Unfortunately, since the world is messy, this is never possible. Sometimes you’re just really unlucky: for instance, suppose you flip a coin 10 times in a row and it comes up heads all 10 times. That feels like very strong evidence that the coin is biased (and it is!), but of course there’s a 1 in 1024 chance that this would happen even if the coin was totally fair. In other words, in real life we always have to accept that there’s a chance that we did the wrong thing. As a consequence, the goal behind statistical hypothesis testing is not to eliminate errors, but to minimise them.

At this point, we need to be a bit more precise about what we mean by “errors”. Firstly, let’s state the obvious: it is either the case that the null hypothesis is true, or it is false; and our test will either reject the null hypothesis or retain it. 161 So, as the table below illustrates, after we run the test and make our choice, one of four things might have happened:

As a consequence there are actually two different types of error here. If we reject a null hypothesis that is actually true, then we have made a type I error . On the other hand, if we retain the null hypothesis when it is in fact false, then we have made a type II error .

Remember how I said that statistical testing was kind of like a criminal trial? Well, I meant it. A criminal trial requires that you establish “beyond a reasonable doubt” that the defendant did it. All of the evidentiary rules are (in theory, at least) designed to ensure that there’s (almost) no chance of wrongfully convicting an innocent defendant. The trial is designed to protect the rights of a defendant: as the English jurist William Blackstone famously said, it is “better that ten guilty persons escape than that one innocent suffer.” In other words, a criminal trial doesn’t treat the two types of error in the same way~… punishing the innocent is deemed to be much worse than letting the guilty go free. A statistical test is pretty much the same: the single most important design principle of the test is to control the probability of a type I error, to keep it below some fixed probability. This probability, which is denoted \(\alpha\) , is called the significance level of the test (or sometimes, the size of the test). And I’ll say it again, because it is so central to the whole set-up~… a hypothesis test is said to have significance level \(\alpha\) if the type I error rate is no larger than \(\alpha\) .

So, what about the type II error rate? Well, we’d also like to keep those under control too, and we denote this probability by \(\beta\) . However, it’s much more common to refer to the power of the test, which is the probability with which we reject a null hypothesis when it really is false, which is \(1-\beta\) . To help keep this straight, here’s the same table again, but with the relevant numbers added:

A “powerful” hypothesis test is one that has a small value of \(\beta\) , while still keeping \(\alpha\) fixed at some (small) desired level. By convention, scientists make use of three different \(\alpha\) levels: \(.05\) , \(.01\) and \(.001\) . Notice the asymmetry here~… the tests are designed to ensure that the \(\alpha\) level is kept small, but there’s no corresponding guarantee regarding \(\beta\) . We’d certainly like the type II error rate to be small, and we try to design tests that keep it small, but this is very much secondary to the overwhelming need to control the type I error rate. As Blackstone might have said if he were a statistician, it is “better to retain 10 false null hypotheses than to reject a single true one”. To be honest, I don’t know that I agree with this philosophy – there are situations where I think it makes sense, and situations where I think it doesn’t – but that’s neither here nor there. It’s how the tests are built.

11.3 Test statistics and sampling distributions

At this point we need to start talking specifics about how a hypothesis test is constructed. To that end, let’s return to the ESP example. Let’s ignore the actual data that we obtained, for the moment, and think about the structure of the experiment. Regardless of what the actual numbers are, the form of the data is that \(X\) out of \(N\) people correctly identified the colour of the hidden card. Moreover, let’s suppose for the moment that the null hypothesis really is true: ESP doesn’t exist, and the true probability that anyone picks the correct colour is exactly \(\theta = 0.5\) . What would we expect the data to look like? Well, obviously, we’d expect the proportion of people who make the correct response to be pretty close to 50%. Or, to phrase this in more mathematical terms, we’d say that \(X/N\) is approximately \(0.5\) . Of course, we wouldn’t expect this fraction to be exactly 0.5: if, for example we tested \(N=100\) people, and \(X = 53\) of them got the question right, we’d probably be forced to concede that the data are quite consistent with the null hypothesis. On the other hand, if \(X = 99\) of our participants got the question right, then we’d feel pretty confident that the null hypothesis is wrong. Similarly, if only \(X=3\) people got the answer right, we’d be similarly confident that the null was wrong. Let’s be a little more technical about this: we have a quantity \(X\) that we can calculate by looking at our data; after looking at the value of \(X\) , we make a decision about whether to believe that the null hypothesis is correct, or to reject the null hypothesis in favour of the alternative. The name for this thing that we calculate to guide our choices is a test statistic .

Having chosen a test statistic, the next step is to state precisely which values of the test statistic would cause is to reject the null hypothesis, and which values would cause us to keep it. In order to do so, we need to determine what the sampling distribution of the test statistic would be if the null hypothesis were actually true (we talked about sampling distributions earlier in Section 10.3.1 ). Why do we need this? Because this distribution tells us exactly what values of \(X\) our null hypothesis would lead us to expect. And therefore, we can use this distribution as a tool for assessing how closely the null hypothesis agrees with our data.

The sampling distribution for our test statistic $X$ when the null hypothesis is true. For our ESP scenario, this is a binomial distribution. Not surprisingly, since the null hypothesis says that the probability of a correct response is $\theta = .5$, the sampling distribution says that the most likely value is 50 (our of 100) correct responses. Most of the probability mass lies between 40 and 60.

Figure 11.1: The sampling distribution for our test statistic \(X\) when the null hypothesis is true. For our ESP scenario, this is a binomial distribution. Not surprisingly, since the null hypothesis says that the probability of a correct response is \(\theta = .5\) , the sampling distribution says that the most likely value is 50 (our of 100) correct responses. Most of the probability mass lies between 40 and 60.

How do we actually determine the sampling distribution of the test statistic? For a lot of hypothesis tests this step is actually quite complicated, and later on in the book you’ll see me being slightly evasive about it for some of the tests (some of them I don’t even understand myself). However, sometimes it’s very easy. And, fortunately for us, our ESP example provides us with one of the easiest cases. Our population parameter \(\theta\) is just the overall probability that people respond correctly when asked the question, and our test statistic \(X\) is the count of the number of people who did so, out of a sample size of \(N\) . We’ve seen a distribution like this before, in Section 9.4 : that’s exactly what the binomial distribution describes! So, to use the notation and terminology that I introduced in that section, we would say that the null hypothesis predicts that \(X\) is binomially distributed, which is written \[ X \sim \mbox{Binomial}(\theta,N) \] Since the null hypothesis states that \(\theta = 0.5\) and our experiment has \(N=100\) people, we have the sampling distribution we need. This sampling distribution is plotted in Figure 11.1 . No surprises really: the null hypothesis says that \(X=50\) is the most likely outcome, and it says that we’re almost certain to see somewhere between 40 and 60 correct responses.

11.4 Making decisions

Okay, we’re very close to being finished. We’ve constructed a test statistic ( \(X\) ), and we chose this test statistic in such a way that we’re pretty confident that if \(X\) is close to \(N/2\) then we should retain the null, and if not we should reject it. The question that remains is this: exactly which values of the test statistic should we associate with the null hypothesis, and which exactly values go with the alternative hypothesis? In my ESP study, for example, I’ve observed a value of \(X=62\) . What decision should I make? Should I choose to believe the null hypothesis, or the alternative hypothesis?

11.4.1 Critical regions and critical values

To answer this question, we need to introduce the concept of a critical region for the test statistic \(X\) . The critical region of the test corresponds to those values of \(X\) that would lead us to reject null hypothesis (which is why the critical region is also sometimes called the rejection region). How do we find this critical region? Well, let’s consider what we know:

  • \(X\) should be very big or very small in order to reject the null hypothesis.
  • If the null hypothesis is true, the sampling distribution of \(X\) is Binomial \((0.5, N)\) .
  • If \(\alpha =.05\) , the critical region must cover 5% of this sampling distribution.

It’s important to make sure you understand this last point: the critical region corresponds to those values of \(X\) for which we would reject the null hypothesis, and the sampling distribution in question describes the probability that we would obtain a particular value of \(X\) if the null hypothesis were actually true. Now, let’s suppose that we chose a critical region that covers 20% of the sampling distribution, and suppose that the null hypothesis is actually true. What would be the probability of incorrectly rejecting the null? The answer is of course 20%. And therefore, we would have built a test that had an \(\alpha\) level of \(0.2\) . If we want \(\alpha = .05\) , the critical region is only allowed to cover 5% of the sampling distribution of our test statistic.

how relevant learning hypothesis testing is

Figure 11.2: The critical region associated with the hypothesis test for the ESP study, for a hypothesis test with a significance level of \(\alpha = .05\) . The plot itself shows the sampling distribution of \(X\) under the null hypothesis: the grey bars correspond to those values of \(X\) for which we would retain the null hypothesis. The black bars show the critical region: those values of \(X\) for which we would reject the null. Because the alternative hypothesis is two sided (i.e., allows both \(\theta <.5\) and \(\theta >.5\) ), the critical region covers both tails of the distribution. To ensure an \(\alpha\) level of \(.05\) , we need to ensure that each of the two regions encompasses 2.5% of the sampling distribution.

As it turns out, those three things uniquely solve the problem: our critical region consists of the most extreme values , known as the tails of the distribution. This is illustrated in Figure 11.2 . As it turns out, if we want \(\alpha = .05\) , then our critical regions correspond to \(X \leq 40\) and \(X \geq 60\) . 162 That is, if the number of people saying “true” is between 41 and 59, then we should retain the null hypothesis. If the number is between 0 to 40 or between 60 to 100, then we should reject the null hypothesis. The numbers 40 and 60 are often referred to as the critical values , since they define the edges of the critical region.

At this point, our hypothesis test is essentially complete: (1) we choose an \(\alpha\) level (e.g., \(\alpha = .05\) , (2) come up with some test statistic (e.g., \(X\) ) that does a good job (in some meaningful sense) of comparing \(H_0\) to \(H_1\) , (3) figure out the sampling distribution of the test statistic on the assumption that the null hypothesis is true (in this case, binomial) and then (4) calculate the critical region that produces an appropriate \(\alpha\) level (0-40 and 60-100). All that we have to do now is calculate the value of the test statistic for the real data (e.g., \(X = 62\) ) and then compare it to the critical values to make our decision. Since 62 is greater than the critical value of 60, we would reject the null hypothesis. Or, to phrase it slightly differently, we say that the test has produced a significant result.

11.4.2 A note on statistical “significance”

Like other occult techniques of divination, the statistical method has a private jargon deliberately contrived to obscure its methods from non-practitioners. – Attributed to G. O. Ashley 163

A very brief digression is in order at this point, regarding the word “significant”. The concept of statistical significance is actually a very simple one, but has a very unfortunate name. If the data allow us to reject the null hypothesis, we say that “the result is statistically significant ”, which is often shortened to “the result is significant”. This terminology is rather old, and dates back to a time when “significant” just meant something like “indicated”, rather than its modern meaning, which is much closer to “important”. As a result, a lot of modern readers get very confused when they start learning statistics, because they think that a “significant result” must be an important one. It doesn’t mean that at all. All that “statistically significant” means is that the data allowed us to reject a null hypothesis. Whether or not the result is actually important in the real world is a very different question, and depends on all sorts of other things.

11.4.3 The difference between one sided and two sided tests

There’s one more thing I want to point out about the hypothesis test that I’ve just constructed. If we take a moment to think about the statistical hypotheses I’ve been using, \[ \begin{array}{cc} H_0 : & \theta = .5 \\ H_1 : & \theta \neq .5 \end{array} \] we notice that the alternative hypothesis covers both the possibility that \(\theta < .5\) and the possibility that \(\theta > .5\) . This makes sense if I really think that ESP could produce better-than-chance performance or worse-than-chance performance (and there are some people who think that). In statistical language, this is an example of a two-sided test . It’s called this because the alternative hypothesis covers the area on both “sides” of the null hypothesis, and as a consequence the critical region of the test covers both tails of the sampling distribution (2.5% on either side if \(\alpha =.05\) ), as illustrated earlier in Figure 11.2 .

However, that’s not the only possibility. It might be the case, for example, that I’m only willing to believe in ESP if it produces better than chance performance. If so, then my alternative hypothesis would only covers the possibility that \(\theta > .5\) , and as a consequence the null hypothesis now becomes \(\theta \leq .5\) : \[ \begin{array}{cc} H_0 : & \theta \leq .5 \\ H_1 : & \theta > .5 \end{array} \] When this happens, we have what’s called a one-sided test , and when this happens the critical region only covers one tail of the sampling distribution. This is illustrated in Figure 11.3 .

how relevant learning hypothesis testing is

Figure 11.3: The critical region for a one sided test. In this case, the alternative hypothesis is that \(\theta > .05\) , so we would only reject the null hypothesis for large values of \(X\) . As a consequence, the critical region only covers the upper tail of the sampling distribution; specifically the upper 5% of the distribution. Contrast this to the two-sided version earlier)

11.5 The \(p\) value of a test

In one sense, our hypothesis test is complete; we’ve constructed a test statistic, figured out its sampling distribution if the null hypothesis is true, and then constructed the critical region for the test. Nevertheless, I’ve actually omitted the most important number of all: the \(p\) value . It is to this topic that we now turn. There are two somewhat different ways of interpreting a \(p\) value, one proposed by Sir Ronald Fisher and the other by Jerzy Neyman. Both versions are legitimate, though they reflect very different ways of thinking about hypothesis tests. Most introductory textbooks tend to give Fisher’s version only, but I think that’s a bit of a shame. To my mind, Neyman’s version is cleaner, and actually better reflects the logic of the null hypothesis test. You might disagree though, so I’ve included both. I’ll start with Neyman’s version…

11.5.1 A softer view of decision making

One problem with the hypothesis testing procedure that I’ve described is that it makes no distinction at all between a result this “barely significant” and those that are “highly significant”. For instance, in my ESP study the data I obtained only just fell inside the critical region – so I did get a significant effect, but was a pretty near thing. In contrast, suppose that I’d run a study in which \(X=97\) out of my \(N=100\) participants got the answer right. This would obviously be significant too, but my a much larger margin; there’s really no ambiguity about this at all. The procedure that I described makes no distinction between the two. If I adopt the standard convention of allowing \(\alpha = .05\) as my acceptable Type I error rate, then both of these are significant results.

This is where the \(p\) value comes in handy. To understand how it works, let’s suppose that we ran lots of hypothesis tests on the same data set: but with a different value of \(\alpha\) in each case. When we do that for my original ESP data, what we’d get is something like this

When we test ESP data ( \(X=62\) successes out of \(N=100\) observations) using \(\alpha\) levels of .03 and above, we’d always find ourselves rejecting the null hypothesis. For \(\alpha\) levels of .02 and below, we always end up retaining the null hypothesis. Therefore, somewhere between .02 and .03 there must be a smallest value of \(\alpha\) that would allow us to reject the null hypothesis for this data. This is the \(p\) value; as it turns out the ESP data has \(p = .021\) . In short:

\(p\) is defined to be the smallest Type I error rate ( \(\alpha\) ) that you have to be willing to tolerate if you want to reject the null hypothesis.

If it turns out that \(p\) describes an error rate that you find intolerable, then you must retain the null. If you’re comfortable with an error rate equal to \(p\) , then it’s okay to reject the null hypothesis in favour of your preferred alternative.

In effect, \(p\) is a summary of all the possible hypothesis tests that you could have run, taken across all possible \(\alpha\) values. And as a consequence it has the effect of “softening” our decision process. For those tests in which \(p \leq \alpha\) you would have rejected the null hypothesis, whereas for those tests in which \(p > \alpha\) you would have retained the null. In my ESP study I obtained \(X=62\) , and as a consequence I’ve ended up with \(p = .021\) . So the error rate I have to tolerate is 2.1%. In contrast, suppose my experiment had yielded \(X=97\) . What happens to my \(p\) value now? This time it’s shrunk to \(p = 1.36 \times 10^{-25}\) , which is a tiny, tiny 164 Type I error rate. For this second case I would be able to reject the null hypothesis with a lot more confidence, because I only have to be “willing” to tolerate a type I error rate of about 1 in 10 trillion trillion in order to justify my decision to reject.

11.5.2 The probability of extreme data

The second definition of the \(p\) -value comes from Sir Ronald Fisher, and it’s actually this one that you tend to see in most introductory statistics textbooks. Notice how, when I constructed the critical region, it corresponded to the tails (i.e., extreme values) of the sampling distribution? That’s not a coincidence: almost all “good” tests have this characteristic (good in the sense of minimising our type II error rate, \(\beta\) ). The reason for that is that a good critical region almost always corresponds to those values of the test statistic that are least likely to be observed if the null hypothesis is true. If this rule is true, then we can define the \(p\) -value as the probability that we would have observed a test statistic that is at least as extreme as the one we actually did get. In other words, if the data are extremely implausible according to the null hypothesis, then the null hypothesis is probably wrong.

11.5.3 A common mistake

Okay, so you can see that there are two rather different but legitimate ways to interpret the \(p\) value, one based on Neyman’s approach to hypothesis testing and the other based on Fisher’s. Unfortunately, there is a third explanation that people sometimes give, especially when they’re first learning statistics, and it is absolutely and completely wrong . This mistaken approach is to refer to the \(p\) value as “the probability that the null hypothesis is true”. It’s an intuitively appealing way to think, but it’s wrong in two key respects: (1) null hypothesis testing is a frequentist tool, and the frequentist approach to probability does not allow you to assign probabilities to the null hypothesis… according to this view of probability, the null hypothesis is either true or it is not; it cannot have a “5% chance” of being true. (2) even within the Bayesian approach, which does let you assign probabilities to hypotheses, the \(p\) value would not correspond to the probability that the null is true; this interpretation is entirely inconsistent with the mathematics of how the \(p\) value is calculated. Put bluntly, despite the intuitive appeal of thinking this way, there is no justification for interpreting a \(p\) value this way. Never do it.

11.6 Reporting the results of a hypothesis test

When writing up the results of a hypothesis test, there’s usually several pieces of information that you need to report, but it varies a fair bit from test to test. Throughout the rest of the book I’ll spend a little time talking about how to report the results of different tests (see Section 12.1.9 for a particularly detailed example), so that you can get a feel for how it’s usually done. However, regardless of what test you’re doing, the one thing that you always have to do is say something about the \(p\) value, and whether or not the outcome was significant.

The fact that you have to do this is unsurprising; it’s the whole point of doing the test. What might be surprising is the fact that there is some contention over exactly how you’re supposed to do it. Leaving aside those people who completely disagree with the entire framework underpinning null hypothesis testing, there’s a certain amount of tension that exists regarding whether or not to report the exact \(p\) value that you obtained, or if you should state only that \(p < \alpha\) for a significance level that you chose in advance (e.g., \(p<.05\) ).

11.6.1 The issue

To see why this is an issue, the key thing to recognise is that \(p\) values are terribly convenient. In practice, the fact that we can compute a \(p\) value means that we don’t actually have to specify any \(\alpha\) level at all in order to run the test. Instead, what you can do is calculate your \(p\) value and interpret it directly: if you get \(p = .062\) , then it means that you’d have to be willing to tolerate a Type I error rate of 6.2% to justify rejecting the null. If you personally find 6.2% intolerable, then you retain the null. Therefore, the argument goes, why don’t we just report the actual \(p\) value and let the reader make up their own minds about what an acceptable Type I error rate is? This approach has the big advantage of “softening” the decision making process – in fact, if you accept the Neyman definition of the \(p\) value, that’s the whole point of the \(p\) value. We no longer have a fixed significance level of \(\alpha = .05\) as a bright line separating “accept” from “reject” decisions; and this removes the rather pathological problem of being forced to treat \(p = .051\) in a fundamentally different way to \(p = .049\) .

This flexibility is both the advantage and the disadvantage to the \(p\) value. The reason why a lot of people don’t like the idea of reporting an exact \(p\) value is that it gives the researcher a bit too much freedom. In particular, it lets you change your mind about what error tolerance you’re willing to put up with after you look at the data. For instance, consider my ESP experiment. Suppose I ran my test, and ended up with a \(p\) value of .09. Should I accept or reject? Now, to be honest, I haven’t yet bothered to think about what level of Type I error I’m “really” willing to accept. I don’t have an opinion on that topic. But I do have an opinion about whether or not ESP exists, and I definitely have an opinion about whether my research should be published in a reputable scientific journal. And amazingly, now that I’ve looked at the data I’m starting to think that a 9% error rate isn’t so bad, especially when compared to how annoying it would be to have to admit to the world that my experiment has failed. So, to avoid looking like I just made it up after the fact, I now say that my \(\alpha\) is .1: a 10% type I error rate isn’t too bad, and at that level my test is significant! I win.

In other words, the worry here is that I might have the best of intentions, and be the most honest of people, but the temptation to just “shade” things a little bit here and there is really, really strong. As anyone who has ever run an experiment can attest, it’s a long and difficult process, and you often get very attached to your hypotheses. It’s hard to let go and admit the experiment didn’t find what you wanted it to find. And that’s the danger here. If we use the “raw” \(p\) -value, people will start interpreting the data in terms of what they want to believe, not what the data are actually saying… and if we allow that, well, why are we bothering to do science at all? Why not let everyone believe whatever they like about anything, regardless of what the facts are? Okay, that’s a bit extreme, but that’s where the worry comes from. According to this view, you really must specify your \(\alpha\) value in advance, and then only report whether the test was significant or not. It’s the only way to keep ourselves honest.

11.6.2 Two proposed solutions

In practice, it’s pretty rare for a researcher to specify a single \(\alpha\) level ahead of time. Instead, the convention is that scientists rely on three standard significance levels: .05, .01 and .001. When reporting your results, you indicate which (if any) of these significance levels allow you to reject the null hypothesis. This is summarised in Table 11.1 . This allows us to soften the decision rule a little bit, since \(p<.01\) implies that the data meet a stronger evidentiary standard than \(p<.05\) would. Nevertheless, since these levels are fixed in advance by convention, it does prevent people choosing their \(\alpha\) level after looking at the data.

Nevertheless, quite a lot of people still prefer to report exact \(p\) values. To many people, the advantage of allowing the reader to make up their own mind about how to interpret \(p = .06\) outweighs any disadvantages. In practice, however, even among those researchers who prefer exact \(p\) values it is quite common to just write \(p<.001\) instead of reporting an exact value for small \(p\) . This is in part because a lot of software doesn’t actually print out the \(p\) value when it’s that small (e.g., SPSS just writes \(p = .000\) whenever \(p<.001\) ), and in part because a very small \(p\) value can be kind of misleading. The human mind sees a number like .0000000001 and it’s hard to suppress the gut feeling that the evidence in favour of the alternative hypothesis is a near certainty. In practice however, this is usually wrong. Life is a big, messy, complicated thing: and every statistical test ever invented relies on simplifications, approximations and assumptions. As a consequence, it’s probably not reasonable to walk away from any statistical analysis with a feeling of confidence stronger than \(p<.001\) implies. In other words, \(p<.001\) is really code for “as far as this test is concerned, the evidence is overwhelming.”

In light of all this, you might be wondering exactly what you should do. There’s a fair bit of contradictory advice on the topic, with some people arguing that you should report the exact \(p\) value, and other people arguing that you should use the tiered approach illustrated in Table 11.1 . As a result, the best advice I can give is to suggest that you look at papers/reports written in your field and see what the convention seems to be. If there doesn’t seem to be any consistent pattern, then use whichever method you prefer.

11.7 Running the hypothesis test in practice

At this point some of you might be wondering if this is a “real” hypothesis test, or just a toy example that I made up. It’s real. In the previous discussion I built the test from first principles, thinking that it was the simplest possible problem that you might ever encounter in real life. However, this test already exists: it’s called the binomial test , and it’s implemented by an R function called binom.test() . To test the null hypothesis that the response probability is one-half p = .5 , 165 using data in which x = 62 of n = 100 people made the correct response, here’s how to do it in R:

Right now, this output looks pretty unfamiliar to you, but you can see that it’s telling you more or less the right things. Specifically, the \(p\) -value of 0.02 is less than the usual choice of \(\alpha = .05\) , so you can reject the null. We’ll talk a lot more about how to read this sort of output as we go along; and after a while you’ll hopefully find it quite easy to read and understand. For now, however, I just wanted to make the point that R contains a whole lot of functions corresponding to different kinds of hypothesis test. And while I’ll usually spend quite a lot of time explaining the logic behind how the tests are built, every time I discuss a hypothesis test the discussion will end with me showing you a fairly simple R command that you can use to run the test in practice.

11.8 Effect size, sample size and power

In previous sections I’ve emphasised the fact that the major design principle behind statistical hypothesis testing is that we try to control our Type I error rate. When we fix \(\alpha = .05\) we are attempting to ensure that only 5% of true null hypotheses are incorrectly rejected. However, this doesn’t mean that we don’t care about Type II errors. In fact, from the researcher’s perspective, the error of failing to reject the null when it is actually false is an extremely annoying one. With that in mind, a secondary goal of hypothesis testing is to try to minimise \(\beta\) , the Type II error rate, although we don’t usually talk in terms of minimising Type II errors. Instead, we talk about maximising the power of the test. Since power is defined as \(1-\beta\) , this is the same thing.

11.8.1 The power function

Sampling distribution under the *alternative* hypothesis, for a population parameter value of $\theta = 0.55$. A reasonable proportion of the distribution lies in the rejection region.

Figure 11.4: Sampling distribution under the alternative hypothesis, for a population parameter value of \(\theta = 0.55\) . A reasonable proportion of the distribution lies in the rejection region.

Let’s take a moment to think about what a Type II error actually is. A Type II error occurs when the alternative hypothesis is true, but we are nevertheless unable to reject the null hypothesis. Ideally, we’d be able to calculate a single number \(\beta\) that tells us the Type II error rate, in the same way that we can set \(\alpha = .05\) for the Type I error rate. Unfortunately, this is a lot trickier to do. To see this, notice that in my ESP study the alternative hypothesis actually corresponds to lots of possible values of \(\theta\) . In fact, the alternative hypothesis corresponds to every value of \(\theta\) except 0.5. Let’s suppose that the true probability of someone choosing the correct response is 55% (i.e., \(\theta = .55\) ). If so, then the true sampling distribution for \(X\) is not the same one that the null hypothesis predicts: the most likely value for \(X\) is now 55 out of 100. Not only that, the whole sampling distribution has now shifted, as shown in Figure 11.4 . The critical regions, of course, do not change: by definition, the critical regions are based on what the null hypothesis predicts. What we’re seeing in this figure is the fact that when the null hypothesis is wrong, a much larger proportion of the sampling distribution distribution falls in the critical region. And of course that’s what should happen: the probability of rejecting the null hypothesis is larger when the null hypothesis is actually false! However \(\theta = .55\) is not the only possibility consistent with the alternative hypothesis. Let’s instead suppose that the true value of \(\theta\) is actually 0.7. What happens to the sampling distribution when this occurs? The answer, shown in Figure 11.5 , is that almost the entirety of the sampling distribution has now moved into the critical region. Therefore, if \(\theta = 0.7\) the probability of us correctly rejecting the null hypothesis (i.e., the power of the test) is much larger than if \(\theta = 0.55\) . In short, while \(\theta = .55\) and \(\theta = .70\) are both part of the alternative hypothesis, the Type II error rate is different.

Sampling distribution under the *alternative* hypothesis, for a population parameter value of $\theta = 0.70$. Almost all of the distribution lies in the rejection region.

Figure 11.5: Sampling distribution under the alternative hypothesis, for a population parameter value of \(\theta = 0.70\) . Almost all of the distribution lies in the rejection region.

The probability that we will reject the null hypothesis, plotted as a function of the true value of $\theta$. Obviously, the test is more powerful (greater chance of correct rejection) if the true value of $\theta$ is very different from the value that the null hypothesis specifies (i.e., $\theta=.5$). Notice that when $\theta$ actually is equal to .5 (plotted as a black dot), the null hypothesis is in fact true: rejecting the null hypothesis in this instance would be a Type I error.

Figure 11.6: The probability that we will reject the null hypothesis, plotted as a function of the true value of \(\theta\) . Obviously, the test is more powerful (greater chance of correct rejection) if the true value of \(\theta\) is very different from the value that the null hypothesis specifies (i.e., \(\theta=.5\) ). Notice that when \(\theta\) actually is equal to .5 (plotted as a black dot), the null hypothesis is in fact true: rejecting the null hypothesis in this instance would be a Type I error.

What all this means is that the power of a test (i.e., \(1-\beta\) ) depends on the true value of \(\theta\) . To illustrate this, I’ve calculated the expected probability of rejecting the null hypothesis for all values of \(\theta\) , and plotted it in Figure 11.6 . This plot describes what is usually called the power function of the test. It’s a nice summary of how good the test is, because it actually tells you the power ( \(1-\beta\) ) for all possible values of \(\theta\) . As you can see, when the true value of \(\theta\) is very close to 0.5, the power of the test drops very sharply, but when it is further away, the power is large.

11.8.2 Effect size

Since all models are wrong the scientist must be alert to what is importantly wrong. It is inappropriate to be concerned with mice when there are tigers abroad – George Box 1976

The plot shown in Figure 11.6 captures a fairly basic point about hypothesis testing. If the true state of the world is very different from what the null hypothesis predicts, then your power will be very high; but if the true state of the world is similar to the null (but not identical) then the power of the test is going to be very low. Therefore, it’s useful to be able to have some way of quantifying how “similar” the true state of the world is to the null hypothesis. A statistic that does this is called a measure of effect size (e.g. Cohen 1988 ; Ellis 2010 ) . Effect size is defined slightly differently in different contexts, 166 (and so this section just talks in general terms) but the qualitative idea that it tries to capture is always the same: how big is the difference between the true population parameters, and the parameter values that are assumed by the null hypothesis? In our ESP example, if we let \(\theta_0 = 0.5\) denote the value assumed by the null hypothesis, and let \(\theta\) denote the true value, then a simple measure of effect size could be something like the difference between the true value and null (i.e., \(\theta – \theta_0\) ), or possibly just the magnitude of this difference, \(\mbox{abs}(\theta – \theta_0)\) .

Why calculate effect size? Let’s assume that you’ve run your experiment, collected the data, and gotten a significant effect when you ran your hypothesis test. Isn’t it enough just to say that you’ve gotten a significant effect? Surely that’s the point of hypothesis testing? Well, sort of. Yes, the point of doing a hypothesis test is to try to demonstrate that the null hypothesis is wrong, but that’s hardly the only thing we’re interested in. If the null hypothesis claimed that \(\theta = .5\) , and we show that it’s wrong, we’ve only really told half of the story. Rejecting the null hypothesis implies that we believe that \(\theta \neq .5\) , but there’s a big difference between \(\theta = .51\) and \(\theta = .8\) . If we find that \(\theta = .8\) , then not only have we found that the null hypothesis is wrong, it appears to be very wrong. On the other hand, suppose we’ve successfully rejected the null hypothesis, but it looks like the true value of \(\theta\) is only .51 (this would only be possible with a large study). Sure, the null hypothesis is wrong, but it’s not at all clear that we actually care , because the effect size is so small. In the context of my ESP study we might still care, since any demonstration of real psychic powers would actually be pretty cool 167 , but in other contexts a 1% difference isn’t very interesting, even if it is a real difference. For instance, suppose we’re looking at differences in high school exam scores between males and females, and it turns out that the female scores are 1% higher on average than the males. If I’ve got data from thousands of students, then this difference will almost certainly be statistically significant , but regardless of how small the \(p\) value is it’s just not very interesting. You’d hardly want to go around proclaiming a crisis in boys education on the basis of such a tiny difference would you? It’s for this reason that it is becoming more standard (slowly, but surely) to report some kind of standard measure of effect size along with the the results of the hypothesis test. The hypothesis test itself tells you whether you should believe that the effect you have observed is real (i.e., not just due to chance); the effect size tells you whether or not you should care.

11.8.3 Increasing the power of your study

Not surprisingly, scientists are fairly obsessed with maximising the power of their experiments. We want our experiments to work, and so we want to maximise the chance of rejecting the null hypothesis if it is false (and of course we usually want to believe that it is false!) As we’ve seen, one factor that influences power is the effect size. So the first thing you can do to increase your power is to increase the effect size. In practice, what this means is that you want to design your study in such a way that the effect size gets magnified. For instance, in my ESP study I might believe that psychic powers work best in a quiet, darkened room; with fewer distractions to cloud the mind. Therefore I would try to conduct my experiments in just such an environment: if I can strengthen people’s ESP abilities somehow, then the true value of \(\theta\) will go up 168 and therefore my effect size will be larger. In short, clever experimental design is one way to boost power; because it can alter the effect size.

Unfortunately, it’s often the case that even with the best of experimental designs you may have only a small effect. Perhaps, for example, ESP really does exist, but even under the best of conditions it’s very very weak. Under those circumstances, your best bet for increasing power is to increase the sample size. In general, the more observations that you have available, the more likely it is that you can discriminate between two hypotheses. If I ran my ESP experiment with 10 participants, and 7 of them correctly guessed the colour of the hidden card, you wouldn’t be terribly impressed. But if I ran it with 10,000 participants and 7,000 of them got the answer right, you would be much more likely to think I had discovered something. In other words, power increases with the sample size. This is illustrated in Figure 11.7 , which shows the power of the test for a true parameter of \(\theta = 0.7\) , for all sample sizes \(N\) from 1 to 100, where I’m assuming that the null hypothesis predicts that \(\theta_0 = 0.5\) .

The power of our test, plotted as a function of the sample size $N$. In this case, the true value of $\theta$ is 0.7, but the null hypothesis is that $\theta = 0.5$. Overall, larger $N$ means greater power. (The small zig-zags in this function occur because of some odd interactions between $\theta$, $\alpha$ and the fact that the binomial distribution is discrete; it doesn't matter for any serious purpose)

Figure 11.7: The power of our test, plotted as a function of the sample size \(N\) . In this case, the true value of \(\theta\) is 0.7, but the null hypothesis is that \(\theta = 0.5\) . Overall, larger \(N\) means greater power. (The small zig-zags in this function occur because of some odd interactions between \(\theta\) , \(\alpha\) and the fact that the binomial distribution is discrete; it doesn’t matter for any serious purpose)

Because power is important, whenever you’re contemplating running an experiment it would be pretty useful to know how much power you’re likely to have. It’s never possible to know for sure, since you can’t possibly know what your effect size is. However, it’s often (well, sometimes) possible to guess how big it should be. If so, you can guess what sample size you need! This idea is called power analysis , and if it’s feasible to do it, then it’s very helpful, since it can tell you something about whether you have enough time or money to be able to run the experiment successfully. It’s increasingly common to see people arguing that power analysis should be a required part of experimental design, so it’s worth knowing about. I don’t discuss power analysis in this book, however. This is partly for a boring reason and partly for a substantive one. The boring reason is that I haven’t had time to write about power analysis yet. The substantive one is that I’m still a little suspicious of power analysis. Speaking as a researcher, I have very rarely found myself in a position to be able to do one – it’s either the case that (a) my experiment is a bit non-standard and I don’t know how to define effect size properly, (b) I literally have so little idea about what the effect size will be that I wouldn’t know how to interpret the answers. Not only that, after extensive conversations with someone who does stats consulting for a living (my wife, as it happens), I can’t help but notice that in practice the only time anyone ever asks her for a power analysis is when she’s helping someone write a grant application. In other words, the only time any scientist ever seems to want a power analysis in real life is when they’re being forced to do it by bureaucratic process. It’s not part of anyone’s day to day work. In short, I’ve always been of the view that while power is an important concept, power analysis is not as useful as people make it sound, except in the rare cases where (a) someone has figured out how to calculate power for your actual experimental design and (b) you have a pretty good idea what the effect size is likely to be. Maybe other people have had better experiences than me, but I’ve personally never been in a situation where both (a) and (b) were true. Maybe I’ll be convinced otherwise in the future, and probably a future version of this book would include a more detailed discussion of power analysis, but for now this is about as much as I’m comfortable saying about the topic.

11.9 Some issues to consider

What I’ve described to you in this chapter is the orthodox framework for null hypothesis significance testing (NHST). Understanding how NHST works is an absolute necessity, since it has been the dominant approach to inferential statistics ever since it came to prominence in the early 20th century. It’s what the vast majority of working scientists rely on for their data analysis, so even if you hate it you need to know it. However, the approach is not without problems. There are a number of quirks in the framework, historical oddities in how it came to be, theoretical disputes over whether or not the framework is right, and a lot of practical traps for the unwary. I’m not going to go into a lot of detail on this topic, but I think it’s worth briefly discussing a few of these issues.

11.9.1 Neyman versus Fisher

The first thing you should be aware of is that orthodox NHST is actually a mash-up of two rather different approaches to hypothesis testing, one proposed by Sir Ronald Fisher and the other proposed by Jerzy Neyman (for a historical summary see Lehmann 2011 ) . The history is messy because Fisher and Neyman were real people whose opinions changed over time, and at no point did either of them offer “the definitive statement” of how we should interpret their work many decades later. That said, here’s a quick summary of what I take these two approaches to be.

First, let’s talk about Fisher’s approach. As far as I can tell, Fisher assumed that you only had the one hypothesis (the null), and what you want to do is find out if the null hypothesis is inconsistent with the data. From his perspective, what you should do is check to see if the data are “sufficiently unlikely” according to the null. In fact, if you remember back to our earlier discussion, that’s how Fisher defines the \(p\) -value. According to Fisher, if the null hypothesis provided a very poor account of the data, you could safely reject it. But, since you don’t have any other hypotheses to compare it to, there’s no way of “accepting the alternative” because you don’t necessarily have an explicitly stated alternative. That’s more or less all that there was to it.

In contrast, Neyman thought that the point of hypothesis testing was as a guide to action, and his approach was somewhat more formal than Fisher’s. His view was that there are multiple things that you could do (accept the null or accept the alternative) and the point of the test was to tell you which one the data support. From this perspective, it is critical to specify your alternative hypothesis properly. If you don’t know what the alternative hypothesis is, then you don’t know how powerful the test is, or even which action makes sense. His framework genuinely requires a competition between different hypotheses. For Neyman, the \(p\) value didn’t directly measure the probability of the data (or data more extreme) under the null, it was more of an abstract description about which “possible tests” were telling you to accept the null, and which “possible tests” were telling you to accept the alternative.

As you can see, what we have today is an odd mishmash of the two. We talk about having both a null hypothesis and an alternative (Neyman), but usually 169 define the \(p\) value in terms of exreme data (Fisher), but we still have \(\alpha\) values (Neyman). Some of the statistical tests have explicitly specified alternatives (Neyman) but others are quite vague about it (Fisher). And, according to some people at least, we’re not allowed to talk about accepting the alternative (Fisher). It’s a mess: but I hope this at least explains why it’s a mess.

11.9.2 Bayesians versus frequentists

Earlier on in this chapter I was quite emphatic about the fact that you cannot interpret the \(p\) value as the probability that the null hypothesis is true. NHST is fundamentally a frequentist tool (see Chapter 9 ) and as such it does not allow you to assign probabilities to hypotheses: the null hypothesis is either true or it is not. The Bayesian approach to statistics interprets probability as a degree of belief, so it’s totally okay to say that there is a 10% chance that the null hypothesis is true: that’s just a reflection of the degree of confidence that you have in this hypothesis. You aren’t allowed to do this within the frequentist approach. Remember, if you’re a frequentist, a probability can only be defined in terms of what happens after a large number of independent replications (i.e., a long run frequency). If this is your interpretation of probability, talking about the “probability” that the null hypothesis is true is complete gibberish: a null hypothesis is either true or it is false. There’s no way you can talk about a long run frequency for this statement. To talk about “the probability of the null hypothesis” is as meaningless as “the colour of freedom”. It doesn’t have one!

Most importantly, this isn’t a purely ideological matter. If you decide that you are a Bayesian and that you’re okay with making probability statements about hypotheses, you have to follow the Bayesian rules for calculating those probabilities. I’ll talk more about this in Chapter 17 , but for now what I want to point out to you is the \(p\) value is a terrible approximation to the probability that \(H_0\) is true. If what you want to know is the probability of the null, then the \(p\) value is not what you’re looking for!

11.9.3 Traps

As you can see, the theory behind hypothesis testing is a mess, and even now there are arguments in statistics about how it “should” work. However, disagreements among statisticians are not our real concern here. Our real concern is practical data analysis. And while the “orthodox” approach to null hypothesis significance testing has many drawbacks, even an unrepentant Bayesian like myself would agree that they can be useful if used responsibly. Most of the time they give sensible answers, and you can use them to learn interesting things. Setting aside the various ideologies and historical confusions that we’ve discussed, the fact remains that the biggest danger in all of statistics is thoughtlessness . I don’t mean stupidity, here: I literally mean thoughtlessness. The rush to interpret a result without spending time thinking through what each test actually says about the data, and checking whether that’s consistent with how you’ve interpreted it. That’s where the biggest trap lies.

To give an example of this, consider the following example (see Gelman and Stern 2006 ) . Suppose I’m running my ESP study, and I’ve decided to analyse the data separately for the male participants and the female participants. Of the male participants, 33 out of 50 guessed the colour of the card correctly. This is a significant effect ( \(p = .03\) ). Of the female participants, 29 out of 50 guessed correctly. This is not a significant effect ( \(p = .32\) ). Upon observing this, it is extremely tempting for people to start wondering why there is a difference between males and females in terms of their psychic abilities. However, this is wrong. If you think about it, we haven’t actually run a test that explicitly compares males to females. All we have done is compare males to chance (binomial test was significant) and compared females to chance (binomial test was non significant). If we want to argue that there is a real difference between the males and the females, we should probably run a test of the null hypothesis that there is no difference! We can do that using a different hypothesis test, 170 but when we do that it turns out that we have no evidence that males and females are significantly different ( \(p = .54\) ). Now do you think that there’s anything fundamentally different between the two groups? Of course not. What’s happened here is that the data from both groups (male and female) are pretty borderline: by pure chance, one of them happened to end up on the magic side of the \(p = .05\) line, and the other one didn’t. That doesn’t actually imply that males and females are different. This mistake is so common that you should always be wary of it: the difference between significant and not-significant is not evidence of a real difference – if you want to say that there’s a difference between two groups, then you have to test for that difference!

The example above is just that: an example. I’ve singled it out because it’s such a common one, but the bigger picture is that data analysis can be tricky to get right. Think about what it is you want to test, why you want to test it, and whether or not the answers that your test gives could possibly make any sense in the real world.

11.10 Summary

Null hypothesis testing is one of the most ubiquitous elements to statistical theory. The vast majority of scientific papers report the results of some hypothesis test or another. As a consequence it is almost impossible to get by in science without having at least a cursory understanding of what a \(p\) -value means, making this one of the most important chapters in the book. As usual, I’ll end the chapter with a quick recap of the key ideas that we’ve talked about:

  • Research hypotheses and statistical hypotheses. Null and alternative hypotheses. (Section 11.1 ).
  • Type 1 and Type 2 errors (Section 11.2 )
  • Test statistics and sampling distributions (Section 11.3 )
  • Hypothesis testing as a decision making process (Section 11.4 )
  • \(p\) -values as “soft” decisions (Section 11.5 )
  • Writing up the results of a hypothesis test (Section 11.6 )
  • Effect size and power (Section 11.8 )
  • A few issues to consider regarding hypothesis testing (Section 11.9 )

Later in the book, in Chapter 17 , I’ll revisit the theory of null hypothesis tests from a Bayesian perspective, and introduce a number of new tools that you can use if you aren’t particularly fond of the orthodox approach. But for now, though, we’re done with the abstract statistical theory, and we can start discussing specific data analysis tools.

Cohen, J. 1988. Statistical Power Analysis for the Behavioral Sciences . 2nd ed. Lawrence Erlbaum.

Ellis, P. D. 2010. The Essential Guide to Effect Sizes: Statistical Power, Meta-Analysis, and the Interpretation of Research Results . Cambridge, UK: Cambridge University Press.

Lehmann, Erich L. 2011. Fisher, Neyman, and the Creation of Classical Statistics . Springer.

Gelman, A., and H. Stern. 2006. “The Difference Between ‘Significant’ and ‘Not Significant’ Is Not Itself Statistically Significant.” The American Statistician 60: 328–31.

  • The quote comes from Wittgenstein’s (1922) text, Tractatus Logico-Philosphicus . ↩
  • A technical note. The description below differs subtly from the standard description given in a lot of introductory texts. The orthodox theory of null hypothesis testing emerged from the work of Sir Ronald Fisher and Jerzy Neyman in the early 20th century; but Fisher and Neyman actually had very different views about how it should work. The standard treatment of hypothesis testing that most texts use is a hybrid of the two approaches. The treatment here is a little more Neyman-style than the orthodox view, especially as regards the meaning of the \(p\) value. ↩
  • My apologies to anyone who actually believes in this stuff, but on my reading of the literature on ESP, it’s just not reasonable to think this is real. To be fair, though, some of the studies are rigorously designed; so it’s actually an interesting area for thinking about psychological research design. And of course it’s a free country, so you can spend your own time and effort proving me wrong if you like, but I wouldn’t think that’s a terribly practical use of your intellect. ↩
  • This analogy only works if you’re from an adversarial legal system like UK/US/Australia. As I understand these things, the French inquisitorial system is quite different. ↩
  • An aside regarding the language you use to talk about hypothesis testing. Firstly, one thing you really want to avoid is the word “prove”: a statistical test really doesn’t prove that a hypothesis is true or false. Proof implies certainty, and as the saying goes, statistics means never having to say you’re certain. On that point almost everyone would agree. However, beyond that there’s a fair amount of confusion. Some people argue that you’re only allowed to make statements like “rejected the null”, “failed to reject the null”, or possibly “retained the null”. According to this line of thinking, you can’t say things like “accept the alternative” or “accept the null”. Personally I think this is too strong: in my opinion, this conflates null hypothesis testing with Karl Popper’s falsificationist view of the scientific process. While there are similarities between falsificationism and null hypothesis testing, they aren’t equivalent. However, while I personally think it’s fine to talk about accepting a hypothesis (on the proviso that “acceptance” doesn’t actually mean that it’s necessarily true, especially in the case of the null hypothesis), many people will disagree. And more to the point, you should be aware that this particular weirdness exists, so that you’re not caught unawares by it when writing up your own results. ↩
  • Strictly speaking, the test I just constructed has \(\alpha = .057\) , which is a bit too generous. However, if I’d chosen 39 and 61 to be the boundaries for the critical region, then the critical region only covers 3.5% of the distribution. I figured that it makes more sense to use 40 and 60 as my critical values, and be willing to tolerate a 5.7% type I error rate, since that’s as close as I can get to a value of \(\alpha = .05\) . ↩
  • The internet seems fairly convinced that Ashley said this, though I can’t for the life of me find anyone willing to give a source for the claim. ↩
  • That’s \(p = .000000000000000000000000136\) for folks that don’t like scientific notation! ↩
  • Note that the p here has nothing to do with a \(p\) value. The p argument in the binom.test() function corresponds to the probability of making a correct response, according to the null hypothesis. In other words, it’s the \(\theta\) value. ↩
  • There’s an R package called compute.es that can be used for calculating a very broad range of effect size measures; but for the purposes of the current book we won’t need it: all of the effect size measures that I’ll talk about here have functions in the lsr package ↩
  • Although in practice a very small effect size is worrying, because even very minor methodological flaws might be responsible for the effect; and in practice no experiment is perfect, so there are always methodological issues to worry about. ↩
  • Notice that the true population parameter \(\theta\) doesn’t necessarily correspond to an immutable fact of nature. In this context \(\theta\) is just the true probability that people would correctly guess the colour of the card in the other room. As such the population parameter can be influenced by all sorts of things. Of course, this is all on the assumption that ESP actually exists! ↩
  • Although this book describes both Neyman’s and Fisher’s definition of the \(p\) value, most don’t. Most introductory textbooks will only give you the Fisher version. ↩
  • In this case, the Pearson chi-square test of independence (Chapter 12 ; chisq.test() in R) is what we use; see also the prop.test() function. ↩

Learning Statistics with R Copyright © by Danielle Navarro is licensed under a Creative Commons Attribution-ShareAlike 4.0 International License , except where otherwise noted.

Share This Book

U.S. flag

An official website of the United States government

The .gov means it’s official. Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

The site is secure. The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

  • Publications
  • Account settings

Preview improvements coming to the PMC website in October 2024. Learn More or Try it out now .

  • Advanced Search
  • Journal List
  • PLoS Comput Biol
  • v.18(11); 2022 Nov

Logo of ploscomp

Humans combine value learning and hypothesis testing strategically in multi-dimensional probabilistic reward learning

Mingyu song.

1 Princeton Neuroscience Institute, Princeton University, Princeton, New Jersey, United States of America

Persis A. Baah

2 Department of Psychology, Princeton University, Princeton, New Jersey, United States of America

Ming Bo Cai

3 International Research Center for Neurointelligence (WPI-IRCN), The University of Tokyo, Tokyo, Japan

Associated Data

All data and code are available on GitHub: https://github.com/mingyus/humans-combine-value-learning-and-hypothesis-testing .

Realistic and complex decision tasks often allow for many possible solutions. How do we find the correct one? Introspection suggests a process of trying out solutions one after the other until success. However, such methodical serial testing may be too slow, especially in environments with noisy feedback. Alternatively, the underlying learning process may involve implicit reinforcement learning that learns about many possibilities in parallel. Here we designed a multi-dimensional probabilistic active-learning task tailored to study how people learn to solve such complex problems. Participants configured three-dimensional stimuli by selecting features for each dimension and received probabilistic reward feedback. We manipulated task complexity by changing how many feature dimensions were relevant to maximizing reward, as well as whether this information was provided to the participants. To investigate how participants learn the task, we examined models of serial hypothesis testing, feature-based reinforcement learning, and combinations of the two strategies. Model comparison revealed evidence for hypothesis testing that relies on reinforcement-learning when selecting what hypothesis to test. The extent to which participants engaged in hypothesis testing depended on the instructed task complexity: people tended to serially test hypotheses when instructed that there were fewer relevant dimensions, and relied more on gradual and parallel learning of feature values when the task was more complex. This demonstrates a strategic use of task information to balance the costs and benefits of the two methods of learning.

Author summary

When solving complex tasks with many potential solutions, we often try the solutions one at a time until success. However, when the set of solutions is too large to exhaust, or if feedback is noisy, we may also rely on implicit reinforcement learning to evaluate multiple options concurrently. In this study, with a novel task that allows participants to actively search for unknown rules in a large search space, we find that human participants combine both strategies, namely serial hypothesis testing and reinforcement learning, in their decisions. Depending on the complexity of the task participants change the balance between the strategies, in line with their costs and benefits.

Introduction

Learning in a complex environment, with numerous potentially relevant factors and noisy outcomes, can be quite challenging. For example, when learning to make bread, many decisions need to be made: the amount of yeast to use, the flour-to-water ratio, the proof time, the baking temperature. It can be hard to learn the correct decision for each of these factors, especially when the results are variable even if following the same procedure: the ambient temperature may affect rising, the oven temperature may not be as accurate as its marks, etc., making the outcome unreliable.

Learning scenarios like this are quite common in life. In controlled laboratory experiments, each of the key components of such learning—multiple dimensions of features interacting, probabilistic outcomes, and active choice of learning examples—has traditionally been investigated separately. For instance, decisions based on combining multiple factors (features) are common in category learning tasks [ 1 , 2 ] where multidimensional rules determine the category boundaries. However, feedback is often deterministic in these tasks, making it unclear how multidimensional learning occurs when choice outcomes are less reliable. In contrast, the need to integrate and learn from stochastic feedback has been widely studied in probabilistic learning tasks [ 3 – 5 ], but often with simple rules that involve only one relevant feature dimension. Finally, the freedom to choose learning examples (rather than selecting among a few available options) is at the core of active learning [ 6 – 8 ], where studies have focused on testing how well human decisions accord with principles of information gain maximization [ 9 ] or uncertainty-directed exploration [ 10 ].

As few tasks have combined all these components (but see [ 11 ] for active learning with probabilistic multidimensional stimuli), it remains unclear how people learn actively in an environment with complex rules (with multiple and potentially an unknown number of relevant dimensions) and probabilistic feedback. To study this, we developed a novel decision task: participants were asked to configure three-dimensional stimuli by choosing what features to use in each dimension, earning rewards that were probabilistically determined by features in a subset or all of these dimensions. To earn as much reward as possible, participants needed to figure out which dimensions were important through trial-and-error, and learn what specific features yielded rewarding outcomes in those dimensions.

Despite the computational challenge and combinatorial explosion of possible solutions, human beings are remarkably good at solving such complex tasks. Usually, after a few successful or unsuccessful attempts, an amateur baker will gradually figure out the rules for bread-making. Similarly, participants in our task improved their performance over time, and learned to correctly identify rewarding features through experience. To understand how they achieved this, we turned to the extensive literature regarding algorithms that support learning when it is not clear what features are relevant (i.e., representation learning) [ 12 , 13 ]. Previous work has suggested several mechanisms for such learning [ 14 , 15 ]: a value-based reinforcement-learning mechanism that incrementally learns the value of stimuli based on trial-and-error feedback, and a rule-based mechanism that explicitly represents and evaluates hypotheses. In previous studies, the two mechanisms were often examined separately, as which of them is used often depends on the specific task. For instance, in probabilistic reward learning tasks, people have been shown to learn through trial-and-error to identify relevant dimensions, and gradually focus their attention onto the rewarding features in those dimensions [ 3 – 5 ]. In contrast, in category learning, people seem to evaluate the probability of all possible rules via Bayesian inference, with a prior belief favoring simpler rules [ 2 , 16 , 17 ] (note there also exists other strategies in category learning [ 14 , 15 , 18 , 19 ], e.g., exemplar-based models). However, the two learning mechanisms are likely simultaneously engaged in most tasks [ 20 ], and contribute to different extents depending on how efficient they are in each specific setting. Direct hypothesis-testing can be more efficient when fewer hypotheses are likely and when feedback is relatively deterministic, whereas incremental learning may be more beneficial with numerous possible combinations and stochastic outcomes.

Here, we systematically examined the integration of the two learning mechanisms and how it depends on task condition. Specifically, we varied task complexity by setting the rules such that one, two, or all three dimensions of the stimuli were relevant for obtaining reward; in addition, we manipulated whether such information (i.e., rule dimensionality) was explicitly provided to participants. We fit computational models that represent each learning mechanism, and their combination, to participants’ responses, and compared how well they predicted participants’ choices. We found evidence that people used a combination of the two learning mechanisms when solving our task. Furthermore, when participants were informed of the task complexity, they used this information to set the balance between the two mechanisms, relying more on serial hypothesis testing when the task was simpler, with fewer candidate rules, and more on reinforcement learning when more rules were possible. Our findings shed light on how rule-based and value-based mechanisms cooperate to support representation learning in complex and stochastic scenarios, and suggest that humans use task complexity to evaluate the effectiveness of different learning mechanisms and strategically balance between them.

Experiment: The “build your own icon” task

In our task, stimuli were characterized by features in three dimensions: color (red, green, blue), shape (square, circle, triangle) and texture (plaid, dots, waves). In each of a series of games, a subset of the three dimensions was relevant for reward, meaning that one feature in each of these relevant dimensions would render stimuli more rewarding (henceforth the “rewarding feature”).

To earn rewards and figure out the underlying rule, participants were asked to configure stimuli (“icons”) by selecting features for any of the dimensions ( Fig 1 ); for dimensions in which they did not make a selection, the computer would randomly select a feature. The resulting stimulus was then shown on the screen, and the participant would receive probabilistic reward feedback (one or zero points) based on the stimulus: the more rewarding features included in the stimulus, the higher the reward probability, with the lowest reward probability being p = 0.2 and the highest being p = 0.8 (see Table 1 ). The participants’ goal was to earn as many reward points as possible.

An external file that holds a picture, illustration, etc.
Object name is pcbi.1010699.g001.jpg

Participants built stimuli by selecting a feature in zero to three dimensions (marked by black squares). After hitting “Done”, the stimulus showed up on the screen, with features randomly determined for any dimension in which participant did not make a selection (in this example, circle was randomly determined). Reward feedback was then shown.

Each row corresponds to one game type. Across all game types, the reward probabilities were 20% if the stimulus contained no rewarding features, 80% if it contained all rewarding features, and linear interpolations between 20% and 80% if it contained a subset of rewarding features. For example, in a 3D-relevant game, if the stimulus contained two of the three rewarding features, the reward probability for that trial would be 60%. These probabilities guarantee that a participant who performs randomly would have 40% probability of obtaining a reward across all game types. This can be seen by calculating, for each game type, the chance of randomly choosing a certain number of rewarding features, multiplied by the corresponding reward probability. Equal chance probability across game types ensured that chance behavior would not be informative about the number of relevant dimensions in unknown games.

Each game had one, two, or three relevant dimensions (henceforth 1D-, 2D-, and 3D-relevant conditions). This information was provided to participants in half of the games (“known” condition), with the other half designated as “unknown” games. This resulted in six game types in total. Each participant played three games of each type for a total of 18 games, in a randomized order. Each game was comprised of 30 trials. The relevant dimensions and rewarding features changed between games.

102 participants were recruited through Amazon Mechanical Turk. In an instruction phase, participants were told that each game could have one, two or three dimensions that were important for reward, and were explicitly informed about the reward probabilities in Table 1 . They were tested on their understanding of the instructions, and each played three practice games with informed rules (relevant dimensions and rewarding features). The main experiment then commenced. In “known” games, the number of relevant dimensions was informed before the start of the game in the form of a “hint”; participants were, however, never told which dimensions were relevant or which features were more rewarding. The start of “unknown” games was also signaled; however, no hint was provided in these games. At the end of each game, participants were asked to explicitly report, to their best knowledge, the rewarding feature for each dimension, or indicate that this dimensions is irrelevant to reward, as well as their confidence level (0–100) in these judgements. After the experiment, participants received a performance bonus proportional to the points they earned in three randomly selected games.

Learning performance and choice behavior

Across all six game types, participants’ performance improved over the course of games, with overall better performance and faster learning in less complex games, i.e., games with fewer relevant dimensions ( Fig 2A ). A mixed-effects regression on reward probability against trial index, task complexity (1D-/2D-/3D-relevant) and game knowledge (known/unknown) showed significant effects of trial index (estimated slope 0.0012 ± 0.0008, p < .001) and task complexity (estimated slope −0.044 ± 0.007, p < .001), as well as a two-way interaction between trial index and task complexity (estimated slope −0.0027 ± 0.0003, p < .001).

An external file that holds a picture, illustration, etc.
Object name is pcbi.1010699.g002.jpg

(A, B): Performance and choices over the course of a game, by game type . (A) Participants’ average probability of reward (based on the number of rewarding features in their configured stimuli), over the course of 1D-, 2D- and 3D-relevant games (left, middle and right columns). Red and blue curves represent “known” and “unknown” conditions, respectively. For all game types, chance reward probability is 0.4 and 0.8 is the maximum reward probability. Shading (ribbons around the lines) represents ±1 s.e.m. across participants. ** p < .01. For grouping of these learning curves by task complexity, see S1 Fig . (B) Same as in (A), but for the number of features selected. (C, D): Responses to post-game questions regarding the rewarding features in each game condition . (C) Average number of correctly-identified rewarding features; (D) Average number of false positive responses, i.e., falsely identifying an irrelevant dimension as relevant. *** p < .001. Error bars represent ±1 s.e.m. across participants.

The overall worse performance in more complex games was not necessarily a failure of learning, but rather the result of limited experience (only 30 trials per game), as participants’ average reward rate across all games was 90.2% of that of an approximately optimal agent (see Methods ) playing this same task (87%, 89% and 95% in the 1D-, 2D- and 3D-relevant games, respectively). Participants’ performance was better when informed of the task complexity in 3D-relevant games (paired-sample t-test on reward probability for 3D-relevant games between “known” and “unknown” conditions: t 101 = 3.37, p = .001, uncorrected, same for tests below). There was no effect of game knowledge on performance in simpler games (1D-relevant: t 101 = −1.9, p = .060; 2D-relevant: t 101 = 0.02, p = .98).

Participants also showed distinct choice behavior in different game types ( Fig 2B ): a mixed-effects regression on the number of features selected showed significant effects of trial index (more features were selected over time; estimated slope 0.0087 ± 0.0003, p = .013) and game knowledge (more features were selected in “unknown” games; estimated slope −0.63 ± 0.09, p < .001), two-way interaction effects for all pairs of variables (all p < .05), and a significant three-way interaction ( p < .001). Specifically, in “known” games, participants selected more features when informed that more dimensions were relevant (mixed-effects linear regression slope: 0.29 ± 0.03, p < .001); in “unknown” games, unsurprisingly, the number of selected features did not differ between task complexities ( p = .47).

Participants’ responses to the post-game questions also reflected similar behavioral patterns (see full results in S1(C) Fig ). Specifically, we analyzed how often they correctly identified the rewarding features ( Fig 2C ), and when they falsely identified an irrelevant dimension as relevant (“false positive”, Fig 2D ; note that in 3D-relevant games, this measure was 0 by design, thus these games were excluded from this analysis). A two-way repeated-measures ANOVA on correct responses showed a significant main effect of task complexity ( F 2,202 = 273.7, p < .001), and a significant interaction between task complexity and game knowledge ( F 2,202 = 21.3, p < .001); the ANOVA on false positive responses showed significant main effects of both task complexity ( F 1,101 = 32.0, p < .001) and game knowledge ( F 1,101 = 93.3, p < .001), and a significant interaction between them ( F 1,101 = 90.8, p < .001). Comparing the “known” and “unknown” conditions: in 1D-relevant games, participants? correct responses did not differ based on condition ( Fig 2C ; post hoc Tukey test: t 101 = 1.81, p = .46), consistent with the choice behavior in Fig 2A ; however, participants made more false positive responses in the “unknown” condition ( Fig 2D ; t 101 = −6.27, p < .001), indicating that not knowing the dimensionality of the underlying rule led them to incorrectly attribute rewards to features on multiple dimensions, which might be the reason for the larger number of features selected in the “unknown” condition ( Fig 2B ). In 3D-relevant games, participants identified more correct features in the “known” condition than in the “unknown” condition ( Fig 2C ; t 101 = 13.53, p < .001), consistent with their better learning performance in “known” 3D-relevant games observed in Fig 2A .

In sum, participants’ behavior was sensitive to both task complexity and game knowledge. They performed better and learned faster in simpler games. Game knowledge had a smaller impact on performance, and participants showed different choice behavior in “known” versus “unknown” games: in “known” games, the number of features they selected was moderated by the instructed task complexity; while in “unknown” games, the number was similar across different complexities.

Modeling two learning mechanisms

To characterize participants’ learning strategy and explain the behavioral differences between game conditions, we considered two candidate learning mechanisms [ 15 , 20 ]: an incremental value-based mechanism that learns the value of stimuli based on trial-and-error feedback, and a rule-based mechanism that explicitly represents possible rules and evaluates them. We tested computational models representing each of these mechanisms, as well as a hybrid combination of the two, by fitting each model to participants’ trial-by-trial choices and comparing how well they predict task behavior. We describe each model below; the mathematical details are provided in Methods.

The value-based mechanism was captured by a feature-based reinforcement learning model [ 3 ]. Reinforcement learning is commonly used to model behavior in probabilistic reward-learning tasks, where participants need to accumulate evidence across multiple trials to estimate the value of each choice. In particular, we used the feature RL with decay model from prior work with a task similar to ours [ 3 ]. This model assumes that participants learn values for each of the nine features using a Rescorla-Wagner update rule [ 21 ]: feature values in the current stimulus are updated proportional to the reward prediction error (the difference between the outcome and the expected reward). The expected reward for each choice (i.e., combination of features selected) is calculated as the sum of its feature values. At decision time, choice probability is determined by comparing the expected reward for all choices using a softmax function. Additionally, values of features not present in the current stimulus are decayed towards zero. This is particularly relevant for features that had been valued previously but are later not consistently selected, i.e. features that the participant presumably no longer deems to have high values, or those originally selected by the computer. The decay mechanism allows their value to decay down to zero despite not being chosen (otherwise, the model updates only the values of chosen features). Note that, this feature-based RL model, although simple, is well suited to the additive reward structure of the task, and provides a better fit than more complex RL models, such a conjunction-based RL model [ 22 ] or an Expert RL model that combines a few RL “experts” each learning different combinations of the dimensions [ 23 ].

In contrast to the value-based mechanism, the rule-based mechanism directly evaluates hypotheses regarding what combinations of features yield the most reward in a game, which we refer to as “rules”. In “known” games, there are 9, 27 and 27 possible rules for 1D-, 2D- and 3D-relevant games, respectively; in “unknown” games, all 63 rules are possible.

There are multiple possibilities for how people learn the correct rule. One is to use Bayesian principles to evaluate the probability that each rule is the correct one; we term this a Bayesian rule-learning model . After each outcome, this model optimally utilizes feedback to calculate the likelihood of each candidate rule, and combines this with the prior belief of the probability that each rule is correct (initially assumed to be uniform across all rules that accord with the “hint”) to obtain the posterior probabilities of each rule. The expected reward for a choice is then calculated by marginalizing over the posterior belief of all possible rules. Mirroring the reinforcement learning model above, in our implementation, the final choice probability was determined by a softmax function over the expected reward from each choice. In a multi-dimensional category learning task, a similar Bayesian rule learning model has been shown to characterize how people learn categories better than reinforcement learning models [ 2 ].

Bayesian inference is computationally expensive and memory-intensive. A simpler alternative for the rule-base strategy is serial hypothesis testing, which assumes that people only test one rule at a time: if the evidence supports their hypothesis, they will continue with it; otherwise, they switch to a different rule, until the correct one is found. The idea of serial hypothesis testing has long roots in the category learning literature [ 24 , 25 ]. Recently, it has also been applied in probabilistic reward learning tasks [ 26 ] and was shown to better account for human behavior than the Bayesian model. Following [ 26 ], we considered a random-switch serial hypothesis-testing model (random-switch SHT model; Fig 3 ) that assumes that people test hypotheses about the underlying rule one at a time. When testing a hypothesis, the model estimates its reward probability by counting how often recent choices following this rule were rewarded. The probability of abandoning the current hypothesis and switching to testing a random different hypothesis is inversely proportional to the reward probability. We assumed that people’s choices were often consistent with their hypotheses, but lapsed to random choices with a small ( p = λ) probability.

An external file that holds a picture, illustration, etc.
Object name is pcbi.1010699.g003.jpg

The SHT and RL mechanisms are not necessarily mutually exclusive. We thus also considered a hybrid model by incorporating RL-acquired feature values into the choice of a new hypothesis in the serial hypothesis testing model. In particular, when switching hypotheses, the hybrid model favored hypotheses that contain recently rewarded features. We term this model value-based serial hypothesis testing model (value-based SHT model; Fig 3 ; see Methods for detailed equations for all models).

Evidence for a hybrid learning mechanism

We fit all four models to participants’ choice data in this task and evaluated model fits using leave-one-game-out cross-validation ( Fig 4A and S2(A) Fig ). Among them, the Bayesian rule learning model, even though optimal in utilizing feedback information, showed the worst fit to participants’ choices (likelihood per trial: 0.045 ± 0.003; mean ± s.e.m.). This was potentially because the large hypothesis space (up to 63 hypotheses) made exact Bayesian inference intractable. Both the feature RL with decay model and the random-switch SHT model showed better fits (likelihood per trial: 0.118 ± 0.008 and 0.160 ± 0.009, respectively). Compared to the Bayesian model, both models have lower computation and memory load: the RL model learns nine feature values individually and later combines them; the random-switch SHT model limits the consideration of hypotheses to one at a time. The hybrid value-based SHT model fit the data best (better than either component model; likelihood per trial: 0.202 ± 0.009), suggesting that participants used both learning strategies when solving this task.

An external file that holds a picture, illustration, etc.
Object name is pcbi.1010699.g004.jpg

(A) Geometric average likelihood per trial for each model (i.e., average total log likelihood divided by number of trials and exponentiated). Higher values indicate better model fits. Dashed lines indicate chance. Error bars represent ±1 s.e.m. across participants. (B, C) Simulation of the best-fitting value-based SHT model. The same learning curves as in Fig 2 but for model simulation.

There was additional evidence for the involvement of both learning mechanisms in participants’ behavior. The rule-based mechanism was evident from the influence of task instructions: both the numbers of features selected ( Fig 2B ) and the reported rewarding features in the post-game questions ( Fig 2C and 2D ) differed between “known” and “unknown” conditions. There is no direct way to incorporate such influences in a reinforcement learning model, but a rule-learning model can easily do so, for instance, by constraining the hypothesis spaces according to the instructions ( S3 Fig : the number of features selected differs between known and unknown games for SHT models but not the RL model). In fact, participants adapted their prior beliefs based on their knowledge of the game types ( S2(B) Fig ): in known games, they assigned a higher prior probability to the hypotheses that are consistent with the task instructions; in unknown games, they deemed more complex rules more likely a priori . On the other hand, the influence of value-based learning was evident in the order in which participants clicked on features to make selections. In most cases, participants followed the spatial order in which dimensions appeared on the screen, either top-to-bottom or the reverse. When the clicks violated the spatial orders, however, they followed the order of learned feature values, starting from the most valuable feature, at a frequency significantly above chance ( t 101 = 7.63, p < .001). Such behavior of following the order of learned feature values instead of the spatial order was more frequent in trials when participants switched hypotheses than when they continued testing the same hypothesis ( t 101 = 5.71, p < .001; in this analysis, for simplicity, switch trials were identified based on changes in choice), further supporting the value-based SHT model.

In sum, participants’ strategies in this task could not be explained by either reinforcement learning or serial hypothesis testing strategies alone. The combined hybrid model explained participants’ behavior best, also capturing the dependence of performance on task complexity ( Fig 4B ) and the qualitative differences between choice curves in “known” and “unknown” conditions ( Fig 4C ), which neither component model could capture ( S3 Fig ).

The contribution of the two mechanisms depends on task complexity

Given evidence that participants used both learning strategies in this task, we next asked to what extent each strategy contributed to decision making. We addressed this question by comparing the hybrid model with the two component models: the difference in likelihood per trial between the hybrid model and each component model was taken as a proxy for the contribution of the mechanism not included in the component model. Note that we can treat the RL and SHT models as component models. This is because setting the learning rate to zero effectively “turns off” the RL process, reducing the hybrid model to the random-switch SHT model. Similarly, setting model parameters such that hypotheses are switched every trial “turns off” the SHT process, resulting in a model very similar to the feature RL model (the only difference is the likelihood of returning to the previous hypothesis or choice).

Across participants, a higher contribution of SHT was associated with a faster reaction time ( Fig 5A ; Pearson correlation: r = −0.27, p = .01), and a higher contribution of RL was associated with a higher reward rate ( Fig 5B ; r = 0.23, p = .02). These results suggest that, comparatively, serial hypothesis testing was an overall faster and less effortful strategy, and augmenting hypothesis testing with values yielded more reward.

An external file that holds a picture, illustration, etc.
Object name is pcbi.1010699.g005.jpg

(A) The contribution of serial hypothesis testing (SHT) was inversely correlated with reaction time such that participants who responded faster used SHT to a greater extent. (B) The contribution of reinforcement learning (RL) was correlated with average reward rate: participants for whom adding the RL component improved the model fit to a greater extent earned more rewards on the task, on average. Each dot represents one participant. (C, D) Contribution of RL and SHT for each game type. The contribution of each component was measured as the difference in likelihood per trial between the hybrid value-based SHT model and the other component model (SHT: the feature RL with decay model; RL: the random-switch SHT model). Error bars represent ±1 s.e.m. across participants.

To optimize for reward and reduce mental effort costs, it is advantageous to rely on the serial hypothesis testing strategy when the task is simpler, for instance, in lower-dimensional games with smaller hypothesis spaces. Indeed, when tested separately, the correlation between reward rate and contribution of RL was only significant for 2D- and 3D-relevant games (1D: r = −0.03, p = .75; 2D: r = 0.27, p < .01; 3D: r = 0.32, p < .01; these correlations were significantly different between 2D- and 3D-relevant games and 1D-relevant games [ 27 ]: z = −2.3, p = .023 for 2D vs 1D games, and z = −2.7, p = .007 for 3D vs 1D games). In contrast, with a larger hypothesis space, serial hypothesis testing is less efficient, and there should be a higher incentive to use the value learning strategy.

We indeed observed such a strategic trade-off between the two learning mechanisms: in “known” games, the contribution of hypothesis testing decreased as the dimensionality of the task increased ( Fig 5C ; estimated slope in a mixed-effect linear regression: −0.0631 ± 0.0051, p < .001), whereas the contribution of value learning increased with task complexity ( Fig 5D ; estimated slope: 0.0178 ± 0.0013, p < .001). In contrast, in “unknown” games, in which task complexity information was unavailable to participants, the contribution of the two mechanisms was more stable across game conditions (estimated slopes: −0.0144 ± 0.0042 for SHT, p < .001; −0.0011 ± 0.0012 for RL, p = .389, consistent with a significant three-way interaction between task complexity, game knowledge and model component in a repeated measures ANOVA on likelihood difference per trial in Fig 5C and 5D : F (2, 202) = 47.9, p < .001). Taken together, these results suggest that participants took advantage of information regarding task complexity to strategically balance the use of two complementary learning mechanisms.

Using a novel “build your own icon” task, we studied learning of multi-dimensional rules with probabilistic feedback as a proxy for real-world learning in situations where it is unknown a priori what aspects of the task are relevant to solving it, and where learners have agency to intervene on the environment and test hypotheses. In our task, participants created stimuli and tried to earn more rewards by identifying the most rewarding stimulus features. Participants performed this task at various signalled or unsignalled complexity levels (i.e., rewarding features were in one, two or three stimulus dimensions). They demonstrated learning in all conditions, with their performance and strategies influenced by task condition. Through behavioral analyses and computational modeling, we investigated the use of two distinct but complementary learning mechanisms: serial hypothesis testing that evaluates one possible rule at a time and is therefore simple and fast to use, but results in slow learning when many rules are possible and must be tested sequentially, and reinforcement learning that learns about all features in parallel and is more accurate in the long run, but requires maintaining and updating more information. We found that a hybrid model that incorporated the advantages of both mechanisms explained participants’ behavior best. In addition, we showed that human participants demonstrated a strategic balance between the two mechanisms depending on task complexity, suggesting that they were able to gauge which mechanism is more suitable in each condition. Specifically, they tended to use the simpler and faster serial hypothesis testing strategy when they knew that fewer dimensions matter in the decision, but relied more on incrementally learning feature values when they knew multiple dimensions were important.

The current study ties together large bodies of work on reward learning and category learning in multi-dimensional environments. Previous studies have extensively investigated how humans learn about complex but deterministic categorization rules [ 1 , 2 , 15 ], as well as how they learn through trial-and-error to identify a single relevant dimension [ 3 , 22 , 28 , 29 ]. The former type of tasks are hard to learn because of the unknown form of the underlying rules, while the latter tasks focus on how humans integrate information over time in stochastic environments. Both are common challenges for human decision-making, and they often co-occur in daily tasks—in new situations, we often do not know a priori what aspects of the task are relevant to its correct solution, and feedback may be stochastic due to inherent task properties or—even in deterministic tasks—not knowing what dimensions are relevant to outcomes, making outcomes seem stochastic. Therefore, we imposed both challenges to investigate human learning strategies under such realistically complex scenarios. Our results help unite the various findings on value-based or rule-based strategies in previous studies. We show that learning in complex and stochastic environments engages both strategies, with participants combining them flexibly according to the demands of the task. This can potentially explain why value-based strategies are often observed in probabilistic learning tasks [ 3 – 5 ], and rule-based strategies in category learning tasks [ 2 ].

A few studies have pursued a similar path. For example, Choung and colleagues [ 30 ] studied a similar probabilistic reward-learning task with multiple relevant dimensions. They examined hypothesis-testing strategies based on values learned with naïve RL models. Through model comparison, they showed that values learned alongside hypothesis testing were carried over when hypotheses switched, consistent with our value-based SHT model. The novelty of our work is in systematically manipulating the complexity of the environment and participants’ knowledge about it, to help provide a comprehensive understanding on how people’s learning strategy adapts to different situations. Another similar set of tasks are contextual bandit problems [ 31 – 33 ], where the amount of reward for each bandit (option) is determined by the context (thus leading to multi-dimensional rules that depend on both stimulus and context). In these tasks, participants were found to use a Gaussian process learning strategy to generalize previous experience to similar instances. Gaussian processes define a probabilistic distribution over the underlying rules, from which one can sample candidate rules as hypotheses. For example, in a task with binary contextual features [ 31 ], participants were shown to consider alternative options that were expected to lead to improvements upon the current one, consistent with the rule-based strategy discovered in the current task.

Still, we considered only a simple linear combination of multiple dimensions to determine reward: each relevant dimension contributed equally to reward probability, in an additive manner. In everyday tasks, the composition can be more complex, with different dimensions contributing differently to rewards [ 11 , 29 ] and potential interactions between dimensions. We postulate that similar hybrid strategies will be adopted regardless. However, it can be hard to model the hypothesis-testing strategy in such scenarios, due to the much larger hypothesis space. An important question is how people construct their hypothesis space, and how likely they deem each hypothesis a priori . There is evidence that people favor simpler hypotheses [ 16 ]. They also may not have a fixed hypothesis space, but instead construct new hypotheses only when the existing ones can no longer account for observations [ 34 ], or they may modify their existing hypotheses on the go with small changes [ 35 ].

It is worth noting the unique free-configuration design of the current task. In most representation-learning tasks, stimuli (i.e., the combination of features) are pre-determined, and participants are asked to select between several available options, or make category judgements. These tasks are easy to perform, but it is hard to isolate participants’ preference for single features. Our task directly probed people’s preference (or lack thereof) in each of the three dimensions. In addition, we were able to hold baseline reward probability constant across different game types (participants responding randomly would always earn reward with p = 0.4) while varying the complexity of underlying rules, which avoided providing information on rule complexity in “unknown” games. Our free-configuration task also resembles many daily life decisions where choices across multiple dimensions have to be made voluntarily, from ordering a pizza takeout, to planning a weekend getaway trip.

Along with these advantages, the active-learning free-configuration design may also alter the strategy people use, compared to a passive learning scenario. On the one hand, free-choice may encourage hypothesis testing, making this strategy more efficient by allowing participants to seek direct evidence on their hypotheses. On the other hand, learning may be hindered due to confirmation bias, commonly observed in self-directed rule-learning tasks (aka “positive test strategy” [ 36 ]). Indeed, participants over-estimated the number of rewarding features in 1D “unknown” games as compared to “known games” ( Fig 2D ), suggesting that they failed to prune their hypotheses when the underlying rule was simpler. To fully understand the impact of free choice, future work can compare active and passive settings with a “yoked” design. This can help understand whether the findings reported here can be generalized to passive-learning tasks, and what may be unique to the active-learning setting.

To model the integration of the two learning strategies, we introduced the hybrid value-based SHT model. The assumptions in this model are relatively minimal, which can be a reason why the hybrid model failed to quantitatively predict the number of features participants selected ( Fig 4C ). To improve model prediction, we explored several alternatives for the model’s assumptions ( S4 Fig ; see Methods for details): (1) not always testing a hypothesis: if none of the hypotheses has a high value, the participant can decide not to test a hypothesis, and let the computer configure a completely random stimulus instead; (2) flexible threshold for determining whether to switch hypothesis or not, based on reward probability of the corresponding game condition ( Table 1 ); (3) favoring choices that are supersets of the current hypothesis: rather than designing stimuli consistent with the current hypothesis (with a lapse rate), participants may tend to select more features than what their hypothesis specifies. The first and third alternative assumptions improved model fits, but the second did not. We then considered a “full” model that used the better alternative for each assumption. This more complex model improved average likelihood per trial on held-out games by 0.033 ± 0.006. In terms of predicting the number of features selected by participants, however, this model behaved similarly to the original hybrid model ( S3 Fig ). For simplicity, we therefore reported the original hybrid model in the Results. We note that, despite the additional assumptions, the full model predictions still deviated from human behavior, e.g., it under-predicted the differences in the number of selected features between the “known” and “unknown” conditions, compared to the empirical data. This may be due to the simplified assumptions on hypothesis testing: for example, in the model, only one hypothesis was tested at each point in time, and hypothesis switching was purely based on values rather than systematically sweeping through features in a dimension, or decreasing the number of features chosen.

The flexibility of the value-based SHT model opens up the space for exploring more complex hypothesis-testing strategies. For instance, hypotheses may be formed in a hierarchical manner when the rule complexity is unknown, i.e., participants may first reason about the dimensionality of the game, and then the exact rule. Currently, the hypothesis-switching policy depends only on values, whereas participants may start from simpler rules, and switch to more complex rules, as suggested in the SUSTAIN model [ 37 ], or vice versa, starting with complex rules and then pruning them to only the necessary components. Another possibility is models that test multiple hypotheses in parallel. In the current model, only one hypothesis is tested at a time, yet participants may consider multiple possibilities simultaneously, for instance, the current configuration and all its subsets. Further, the current study did not evaluate the role of uncertainty-directed exploration [ 10 ] and when to terminate it during learning. This is due to the large number of options available in the current task, making the optimal uncertainty-directed policy intractable. Future studies can design targeted tasks to investigate this question. Lastly, the current model assumes that learning of feature values happens in parallel to and independently of hypothesis-testing. However, value learning may also be affected by hypothesis testing. For example, the amount of value update can be gated by the current hypothesis [ 20 , 38 ]. The current modeling framework (and openly accessible data) can be used in future work to systematically examine these and other alternative models.

In conclusion, we studied human active learning in complex and stochastic environments, with a novel self-configuration decision task. Through behavioral analyses and computational model comparison, our study revealed the strategic integration of two complementary learning mechanisms: serial hypothesis testing using reinforcement-learning values to select new hypotheses. Rule-based and gradual learning systems are often considered opponents or alternatives, whereas our results suggest cooperation rather than arbitration. This may be a general rule in complex, realistic decision tasks. When the going gets rough, the brain would do best to optimally integrate all the methods at its disposal.

Ethics statement

This study was approved by the Institutional Review Board at Princeton University (record number 11968). Formal written consent was obtained from each participant before they started the experiment.

Experimental procedure and participant exclusion criteria

Participants were recruited online from Amazon Mechanical Turk. They received a base payment of $12 for completing the task, with a performance-based bonus of $0.15 per reward point earned in three randomly-chosen games (one for each task complexity).

Participants went through a comprehensive instruction phase before starting the main task. During the instruction, they were first introduced to the “icons”, and asked to build a few examples. They were then explained the general rules of the experiment, including the complexity levels and their respective reward probabilities (as in Table 1 ). Participants were tested about these rules and probabilities with a set of multiple-choice questions. For each task complexity level, they were given an example rule, and asked about the reward probability of a few stimuli to test their understanding. Participants had to answer all questions correctly within a fixed number of attempts (5 for questions on the general rules, and 3 for all the other tests). In addition, they played a practice game in each complexity level with the rules informed (including what dimensions were relevant and what features were more rewarding; this information was not available in the main task, even in “known” games, where only the number of relevant dimensions was informed, see details below). During the experimental games, participants were required to respond within 5 seconds on each trial. Participants who did not pass the comprehension tests or missed five consecutive trials at any time in the experiment were not permitted to continue the experiment.

The main task consisted of 18 experimental games. Among them, half were “known” games, in which participants were informed of the number of relevant dimensions (1, 2 or 3) before the game started; the other half were “unknown” games. This corresponded to six game types in total. Each participant played three games of each type in a randomized order. Each game was comprised of 30 trials.

At the end of each game, participants were asked to report the rewarding feature for each dimension through a multiple choice question, or indicate that this dimensions was irrelevant to reward. They were also asked to rate their confidence level (0–100) in these judgements.

106 participants completed the entire experiment, out of which 4 were excluded from our analyses due to poor performance (an overall reward probability less than 0.468, which was two standard deviation below the group average).

Approximately optimal agent

It is computationally intractable to solve the optimal policy for this task. Therefore we trained a deep Q-network (DQN) [ 39 ] on the task to approximate the optimal solution, and compared participants’ performance with this well-trained DQN agent. Specifically, this DQN model uses Bayes rule to update belief states, and deep RL to learn (or approximate) the optimal decision policy.

Computational models of human behavior

Feature-based reinforcement learning with decay model.

The feature RL with decay model maintains values ( V ) for each of the nine features (denoted by f i , j ; i and j are indices for dimensions and features respectively). At decision time, the expected reward ( ER ) for each possible stimulus configuration c is calculated as the sum of its feature values:

where c i denotes the feature on dimension i of configuration c . For dimensions that are unspecified in the configuration (i.e., those the computer will choose randomly), the model uses the average value of all three features.

The choice probability is determined based on ER ( c ) using a softmax function, with β as a free parameter:

Feature values are updated according to a Rescorla-Wagner update rule, with separate learning rates for features that were selected by the participant ( η = η s ) and those that were randomly determined ( η = η r ). Values of features not in the current stimulus s t are decayed towards zero with a factor d ∈ [0, 1]. η s , η r and d are free parameters.

where r t is the reward outcome (0 or 1) on trial t , and s t i indicates the feature on dimension i of s t .

Bayesian rule learning model

The Bayesian rule-learning model maintains a probabilistic belief distribution over all possible hypotheses (denoted by h ). Note that the set of possible hypotheses (the hypothesis space) depends on the current task complexity: in known games, there are 9, 27 and 27 possible hypotheses in 1D, 2D and 3D games, respectively; in unknown games, all 63 hypotheses are possible. After each trial, the belief distribution is updated according to Bayes rule:

At decision time, the expected reward for each choice is calculated by marginalizing over the belief distribution:

The expected reward is then used to determine the choice probability as in Eq 2 .

We note that this model is not strictly optimal, even with no decision noise, as it maximizes the expected reward on the current trial, but not the total reward over a game.

Random-switch serial hypothesis testing (SHT) model

The random-switch SHT model assumes the participant tests one hypothesis at any given time. We do not directly observe what hypothesis the participant is testing, and need to infer that from their choices. We do so by using the change-point detection model in [ 26 ]. The basic idea is to infer the current hypothesis (denoted by h t ) from all the choices the participant has made and the reward outcomes they received so far in the current game (together denoted by d 1: t −1 ); see Supplementary Methods in S1 Text for implementation details. Once we obtain the posterior probability distribution over the current hypothesis P ( h t | d 1: t −1 ), we can use it to predict choice:

In order to calculate P ( h t | d 1: t −1 ), we consider the generative model of participants’ choices. First, we determine the participant’s hypothesis space: In “known” games, participants were informed about the number of relevant dimensions, which limits the set of possible hypotheses in these games. The way people interpret and follow instructions, however, may vary. Thus, we parameterize the hypothesis space (i.e., people’s prior over all possible hypotheses) with two weight parameters w l and w h (before normalization):

Here, D ( h ) is the dimensionality of hypothesis h (how many rewarding features are in h ), and D is the informed number of relevant dimensions of the current game. If a participant strictly follows the instruction, w l = w h = 0, i.e., only hypotheses with the same dimensionality as the instruction are considered; if the participant does not use the instruction information at all, w l = w h = 1, i.e., all 63 hypotheses are considered to be equally likely. For “unknown” games, the model uses the average P ( h ) of 1D, 2D and 3D “known” games to determine the prior probability of 1D, 2D and 3D hypotheses.

The generative model of participants’ choice behavior is assumed to contain three parts: the hypothesis-testing policy (whether to stay with the current hypothesis or switch to a new one), the hypothesis-switching policy (what the next hypothesis should be when switching hypotheses), and the choice policy given the currently tested hypothesis. The first two policies together determine the transition from the hypothesis on the previous trial to the current one, and the choice policy determines the mapping between the current hypothesis and the choice.

Following [ 26 ], we consider the following hypothesis testing policy: on each trial, the participant estimates the reward probability of the current hypothesis. Using a uniform Dirichlet prior, this is equivalent to counting how many times they have been rewarded since they started testing this hypothesis. The estimated reward probability is then compared to a soft threshold θ to determine whether to stay with this hypothesis or to switch to a different one:

where P ^ reward = reward count + 1 trial count + 2 is the estimated reward probability, and β stay and θ are free parameters. If the participant decides to switch, they randomly switch to any other hypothesis according to the prior over hypotheses specified in Eq 7 (i.e. the random hypothesis-switch policy):

Finally, we assume a choice policy where participants configure stimuli according to their hypothesis most of the time, but with a lapse rate of λ choose any configuration randomly.

Value-based serial hypothesis testing model

The value-based SHT model is the same as the random-switch SHT model, except that it uses a value-based hypothesis-switch policy. It maintains a set of feature values updated according to the feature RL with decay model, as in Eq 3 (but with a single learning rate), and calculates the expected reward for each alternative hypothesis by adding up its constituent feature values, similar to Eq 1 but for h instead of c . The probability of switching to h t ≠ h t −1 is:

where β switch is a free parameter.

Variants of the value-based SHT model

We considered several variants of the value-based SHT model by modifying the hypothesis-testing policy and the choice policy of the baseline value-based SHT model described above.

Not always testing a hypothesis

In the experiment, the participant could choose not to select any feature, and let the computer configure a random stimulus. Many participants did so, especially in the beginning of each game, potentially due to not having a good candidate hypothesis in mind. To model this, we add a soft threshold on hypothesis testing: if the expected reward of the best candidate hypothesis is below a threshold θ test , participants will be unlikely to test any hypothesis:

β test and θ test are additional free parameters of this model. This mechanism was applied to the first trial of each game and at hypothesis switch points.

Alternative hypothesis-testing policy: Using reward probability information

In the experiment, participants were informed of the reward probabilities for all game conditions ( Table 1 ). Our baseline model did not make use of this information. One way to use such information is to consider a target reward probability RP target for the current hypothesis h . If the hypothesis dimension D ( h ) is equal to or larger than the instructed dimension of a game (in known games) D , the hypothesis should attain the highest possible reward probability if all features in h are rewarding features so RP target = 0.8. However, if D ( h ) < D , the target should be lower. For example, when testing the same one-dimensional hypothesis, participants should expect a higher reward probability if they are in a 1D game ( RP target = 0.8) compared to in a 3D game ( RP target = 0.4). In “known” games, we therefore assumed that participants set their thresholds θ for switching hypotheses according to this target reward probability, with a free-parameter offset δ :

For “unknown” games, we assume participants use the average RP target of 1D, 2D and 3D “known” games, such that RP target = 0.6, 0.733 and 0.8 for 1D, 2D and 3D hypotheses, respectively.

Alternative choice policy: Selecting more features than prescribed by the hypothesis

In the baseline model, participants’ choices are assumed to be aligned with their current hypothesis, unless they lapse in their choice. In the experiment, however, we observed an overall tendency to select more features than instructed ( Fig 2B ). This was not surprising as there was no cost to selecting more features. In fact, it was strictly optimal to always make selections on all dimensions, as there was always a best feature within each dimension (at least equally good as the other two), and holding all features fixed helps test the current hypothesis (the computer randomly chooses features for any unselected dimensions, meaning that reward attained could be due to those features and not the hypothesis tested). Thus, we assumed in this alternative model that participants may select more features than their current hypothesis h t . The probability for choices that are supersets of h t was determined by the difference in the numbers of dimensions compared to h t , with a decay rate k as a free parameter:

In this model, participants could still lapse, meaning that all choices that are not supersets of h t were equally likely, with probabilities that summed to λ.

Model fitting and model comparison

We fit the models to each participant’s data using maximum likelihood estimation. We used the minimize function (L-BFGS-B algorithm) in Python package scipy.optimize as the optimizer; each optimization was repeated 10 times with random starting points. Models were evaluated with leave-one-game-out cross-validation: the likelihood of each game was calculated using the parameters obtained by fitting the other 17 games; the geometric average likelihood per trial across all held-out games is reported (i.e., total log likelihood across all trials a participant played divided by number of trials and exponentiated, and then averaged over participants).

Supporting information

(A, B) Same as Fig 2A and 2B but aggregated by known v.s. unknown games. (C) Post-game responses to questions about the rewarding features in each game condition. Kwn = known games, Unk = unknown games. After each game, participants were asked to report the rewarding feature for each dimension, or indicate this dimension as irrelevant to reward. Responses are classified into five categories. Correct feature: correctly identifying a rewarding feature; Incorrect feature: incorrectly reporting a non-rewarding feature as rewarding for a relevant dimension; Miss relevance: reporting a relevant dimension as irrelevant; False positive: incorrectly reporting a rewarding feature for an irrelevant dimension; Correct rejection: correctly identifying an irrelevant dimension. (D, E, F) The type of feature selection, the number of features changed in choices, and the type of choice change as a function of trial index, broken down by game types. (D) The number of features selected by participants was broken down into three types: correct, incorrect or false positive (i.e. selecting a feature when that dimension was irrelevant), and summed across three dimensions. Over the game, the number of correct features increased and the number of incorrect features decreased, consistent across all game types and indicating learning. The trends were mostly consistent between known and unknown games, except for 1D games: false positive responses decreased in the known condition but stayed steady in the unknown condition. These results are consistent with post-game questions ( Fig 2D ; participants were more likely to make false-positive responses in 1D unknown games compared to 1D known games). Interestingly, when games were more complex (e.g., 2D games), participants were unable to reduce false positive responses over time even in the known condition. (E) The average number of features changed from one choice to the next, for all trials (upper panel) and only for trials with a choice change (lower panel). Overall, participants changed more features in their choice in the beginning of a game, and this decreased over time. The pattern was mostly consistent across game types, except for 1D games: the reduction was slower in the known condition compared to the unknown condition. Specifically, in 1D known games, participants continued to change their choices in the later part of the game, despite already obtaining a high reward rate, suggesting that they were trying to further narrow down and find the exact rewarding feature, potentially driven by the game instruction (one dimension was relevant). This is consistent with a lower false-positive rate in 1D known games compared to 1D unknown games. In 3D games, this pattern is reversed, likely because participants knew there was no need to narrow down in 3D known games after achieving the maximal reward rate. (F) Choice change was divided into five categories: adding features (e.g. red to red circle), dropping features (e.g. red circle to red), switching within dimension (e.g. red circle to blue circle), switching across dimensions (e.g. red to circle), and all other changes (any mixture of the previous four types, e.g. red circle to blue). Among the five types, switching within dimension was the most common. There were very few occurrences of the mixture type (“Others”); whereas for a random-choice policy, this would be the most common type. This suggests that participants tended to make local, systematical changes in their choices, further supporting a serial hypothesis testing process.

(A) Model fits broken down for each game type. (B) The fitted prior probability for 1/2/3D hypothesis (x-axis) in different game types (subplots) in the main value-based SHT model. In known games, participants had a higher prior probability for the hypotheses consistent with the task instructions (darker red bars). In unknown games, more complex hypotheses were deemed a priori more likely.

Top and fourth rows are identical to Figs 2A, 2B , 4B and 4C , respectively.

(A) A diagram of the SHT models compared in the main text. Different variants for each model assumption are presented in colored boxes: in gray are the assumptions adopted by the baseline model; colors denote the different variants tested. (B) Difference in average likelihood per trial between variants of the SHT models and the baseline value-based SHT model. Each model except the full model is only different from the baseline model by one assumption as noted in the label; the full model adopts the better alternative in every assumption. Bar colors correspond to those in panel A, except for the full model (in white). Specifically, the purple bar corresponds to the random-switch SHT model. Error bars represent ±1 s.e.m. across participants.

Funding Statement

The funders had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript.

Data Availability

Library homepage

  • school Campus Bookshelves
  • menu_book Bookshelves
  • perm_media Learning Objects
  • login Login
  • how_to_reg Request Instructor Account
  • hub Instructor Commons
  • Download Page (PDF)
  • Download Full Book (PDF)
  • Periodic Table
  • Physics Constants
  • Scientific Calculator
  • Reference & Cite
  • Tools expand_more
  • Readability

selected template will load here

This action is not available.

Mathematics LibreTexts

8.1: The Elements of Hypothesis Testing

  • Last updated
  • Save as PDF
  • Page ID 130263

Learning Objectives

  • To understand the logical framework of tests of hypotheses.
  • To learn basic terminology connected with hypothesis testing.
  • To learn fundamental facts about hypothesis testing.

Types of Hypotheses

A hypothesis about the value of a population parameter is an assertion about its value. As in the introductory example we will be concerned with testing the truth of two competing hypotheses, only one of which can be true.

Definition: null hypothesis and alternative hypothesis

  • The null hypothesis , denoted \(H_0\), is the statement about the population parameter that is assumed to be true unless there is convincing evidence to the contrary.
  • The alternative hypothesis , denoted \(H_a\), is a statement about the population parameter that is contradictory to the null hypothesis, and is accepted as true only if there is convincing evidence in favor of it.

Definition: statistical procedure

Hypothesis testing is a statistical procedure in which a choice is made between a null hypothesis and an alternative hypothesis based on information in a sample.

The end result of a hypotheses testing procedure is a choice of one of the following two possible conclusions:

  • Reject \(H_0\) (and therefore accept \(H_a\)), or
  • Fail to reject \(H_0\) (and therefore fail to accept \(H_a\)).

The null hypothesis typically represents the status quo, or what has historically been true. In the example of the respirators, we would believe the claim of the manufacturer unless there is reason not to do so, so the null hypotheses is \(H_0:\mu =75\). The alternative hypothesis in the example is the contradictory statement \(H_a:\mu <75\). The null hypothesis will always be an assertion containing an equals sign, but depending on the situation the alternative hypothesis can have any one of three forms: with the symbol \(<\), as in the example just discussed, with the symbol \(>\), or with the symbol \(\neq\). The following two examples illustrate the latter two cases.

Example \(\PageIndex{1}\)

A publisher of college textbooks claims that the average price of all hardbound college textbooks is \(\$127.50\). A student group believes that the actual mean is higher and wishes to test their belief. State the relevant null and alternative hypotheses.

The default option is to accept the publisher’s claim unless there is compelling evidence to the contrary. Thus the null hypothesis is \(H_0:\mu =127.50\). Since the student group thinks that the average textbook price is greater than the publisher’s figure, the alternative hypothesis in this situation is \(H_a:\mu >127.50\).

Example \(\PageIndex{2}\)

The recipe for a bakery item is designed to result in a product that contains \(8\) grams of fat per serving. The quality control department samples the product periodically to insure that the production process is working as designed. State the relevant null and alternative hypotheses.

The default option is to assume that the product contains the amount of fat it was formulated to contain unless there is compelling evidence to the contrary. Thus the null hypothesis is \(H_0:\mu =8.0\). Since to contain either more fat than desired or to contain less fat than desired are both an indication of a faulty production process, the alternative hypothesis in this situation is that the mean is different from \(8.0\), so \(H_a:\mu \neq 8.0\).

In Example \(\PageIndex{1}\), the textbook example, it might seem more natural that the publisher’s claim be that the average price is at most \(\$127.50\), not exactly \(\$127.50\). If the claim were made this way, then the null hypothesis would be \(H_0:\mu \leq 127.50\), and the value \(\$127.50\) given in the example would be the one that is least favorable to the publisher’s claim, the null hypothesis. It is always true that if the null hypothesis is retained for its least favorable value, then it is retained for every other value.

Thus in order to make the null and alternative hypotheses easy for the student to distinguish, in every example and problem in this text we will always present one of the two competing claims about the value of a parameter with an equality. The claim expressed with an equality is the null hypothesis. This is the same as always stating the null hypothesis in the least favorable light. So in the introductory example about the respirators, we stated the manufacturer’s claim as “the average is \(75\) minutes” instead of the perhaps more natural “the average is at least \(75\) minutes,” essentially reducing the presentation of the null hypothesis to its worst case.

The first step in hypothesis testing is to identify the null and alternative hypotheses.

The Logic of Hypothesis Testing

Although we will study hypothesis testing in situations other than for a single population mean (for example, for a population proportion instead of a mean or in comparing the means of two different populations), in this section the discussion will always be given in terms of a single population mean \(\mu\).

The null hypothesis always has the form \(H_0:\mu =\mu _0\) for a specific number \(\mu _0\) (in the respirator example \(\mu _0=75\), in the textbook example \(\mu _0=127.50\), and in the baked goods example \(\mu _0=8.0\)). Since the null hypothesis is accepted unless there is strong evidence to the contrary, the test procedure is based on the initial assumption that \(H_0\) is true. This point is so important that we will repeat it in a display:

The test procedure is based on the initial assumption that \(H_0\) is true.

The criterion for judging between \(H_0\) and \(H_a\) based on the sample data is: if the value of \(\overline{X}\) would be highly unlikely to occur if \(H_0\) were true, but favors the truth of \(H_a\), then we reject \(H_0\) in favor of \(H_a\). Otherwise we do not reject \(H_0\).

Supposing for now that \(\overline{X}\) follows a normal distribution, when the null hypothesis is true the density function for the sample mean \(\overline{X}\) must be as in Figure \(\PageIndex{1}\): a bell curve centered at \(\mu _0\). Thus if \(H_0\) is true then \(\overline{X}\) is likely to take a value near \(\mu _0\) and is unlikely to take values far away. Our decision procedure therefore reduces simply to:

  • if \(H_a\) has the form \(H_a:\mu <\mu _0\) then reject \(H_0\) if \(\bar{x}\) is far to the left of \(\mu _0\);
  • if \(H_a\) has the form \(H_a:\mu >\mu _0\) then reject \(H_0\) if \(\bar{x}\) is far to the right of \(\mu _0\);
  • if \(H_a\) has the form \(H_a:\mu \neq \mu _0\) then reject \(H_0\) if \(\bar{x}\) is far away from \(\mu _0\) in either direction.

b91b73d0dbbd53dc069a390a463118a2.jpg

Think of the respirator example, for which the null hypothesis is \(H_0:\mu =75\), the claim that the average time air is delivered for all respirators is \(75\) minutes. If the sample mean is \(75\) or greater then we certainly would not reject \(H_0\) (since there is no issue with an emergency respirator delivering air even longer than claimed).

If the sample mean is slightly less than \(75\) then we would logically attribute the difference to sampling error and also not reject \(H_0\) either.

Values of the sample mean that are smaller and smaller are less and less likely to come from a population for which the population mean is \(75\). Thus if the sample mean is far less than \(75\), say around \(60\) minutes or less, then we would certainly reject \(H_0\), because we know that it is highly unlikely that the average of a sample would be so low if the population mean were \(75\). This is the rare event criterion for rejection: what we actually observed \((\overline{X}<60)\) would be so rare an event if \(\mu =75\) were true that we regard it as much more likely that the alternative hypothesis \(\mu <75\) holds.

In summary, to decide between \(H_0\) and \(H_a\) in this example we would select a “rejection region” of values sufficiently far to the left of \(75\), based on the rare event criterion, and reject \(H_0\) if the sample mean \(\overline{X}\) lies in the rejection region, but not reject \(H_0\) if it does not.

The Rejection Region

Each different form of the alternative hypothesis Ha has its own kind of rejection region:

  • if (as in the respirator example) \(H_a\) has the form \(H_a:\mu <\mu _0\), we reject \(H_0\) if \(\bar{x}\) is far to the left of \(\mu _0\), that is, to the left of some number \(C\), so the rejection region has the form of an interval \((-\infty ,C]\);
  • if (as in the textbook example) \(H_a\) has the form \(H_a:\mu >\mu _0\), we reject \(H_0\) if \(\bar{x}\) is far to the right of \(\mu _0\), that is, to the right of some number \(C\), so the rejection region has the form of an interval \([C,\infty )\);
  • if (as in the baked good example) \(H_a\) has the form \(H_a:\mu \neq \mu _0\), we reject \(H_0\) if \(\bar{x}\) is far away from \(\mu _0\) in either direction, that is, either to the left of some number \(C\) or to the right of some other number \(C′\), so the rejection region has the form of the union of two intervals \((-\infty ,C]\cup [C',\infty )\).

The key issue in our line of reasoning is the question of how to determine the number \(C\) or numbers \(C\) and \(C′\), called the critical value or critical values of the statistic, that determine the rejection region.

Definition: critical values

The critical value or critical values of a test of hypotheses are the number or numbers that determine the rejection region.

Suppose the rejection region is a single interval, so we need to select a single number \(C\). Here is the procedure for doing so. We select a small probability, denoted \(\alpha\), say \(1\%\), which we take as our definition of “rare event:” an event is “rare” if its probability of occurrence is less than \(\alpha\). (In all the examples and problems in this text the value of \(\alpha\) will be given already.) The probability that \(\overline{X}\) takes a value in an interval is the area under its density curve and above that interval, so as shown in Figure \(\PageIndex{2}\) (drawn under the assumption that \(H_0\) is true, so that the curve centers at \(\mu _0\)) the critical value \(C\) is the value of \(\overline{X}\) that cuts off a tail area \(\alpha\) in the probability density curve of \(\overline{X}\). When the rejection region is in two pieces, that is, composed of two intervals, the total area above both of them must be \(\alpha\), so the area above each one is \(\alpha /2\), as also shown in Figure \(\PageIndex{2}\).

72f0cd42fda04cdfb0341bcfe11601c1.jpg

The number \(\alpha\) is the total area of a tail or a pair of tails.

Example \(\PageIndex{3}\)

In the context of Example \(\PageIndex{2}\), suppose that it is known that the population is normally distributed with standard deviation \(\alpha =0.15\) gram, and suppose that the test of hypotheses \(H_0:\mu =8.0\) versus \(H_a:\mu \neq 8.0\) will be performed with a sample of size \(5\). Construct the rejection region for the test for the choice \(\alpha =0.10\). Explain the decision procedure and interpret it.

If \(H_0\) is true then the sample mean \(\overline{X}\) is normally distributed with mean and standard deviation

\[\begin{align} \mu _{\overline{X}} &=\mu \nonumber \\[5pt] &=8.0 \nonumber \end{align} \nonumber \]

\[\begin{align} \sigma _{\overline{X}}&=\dfrac{\sigma}{\sqrt{n}} \nonumber \\[5pt] &= \dfrac{0.15}{\sqrt{5}} \nonumber\\[5pt] &=0.067 \nonumber \end{align} \nonumber \]

Since \(H_a\) contains the \(\neq\) symbol the rejection region will be in two pieces, each one corresponding to a tail of area \(\alpha /2=0.10/2=0.05\). From Figure 7.1.6, \(z_{0.05}=1.645\), so \(C\) and \(C′\) are \(1.645\) standard deviations of \(\overline{X}\) to the right and left of its mean \(8.0\):

\[C=8.0-(1.645)(0.067) = 7.89 \; \; \text{and}\; \; C'=8.0 + (1.645)(0.067) = 8.11 \nonumber \]

The result is shown in Figure \(\PageIndex{3}\). α = 0.1

alt

The decision procedure is: take a sample of size \(5\) and compute the sample mean \(\bar{x}\). If \(\bar{x}\) is either \(7.89\) grams or less or \(8.11\) grams or more then reject the hypothesis that the average amount of fat in all servings of the product is \(8.0\) grams in favor of the alternative that it is different from \(8.0\) grams. Otherwise do not reject the hypothesis that the average amount is \(8.0\) grams.

The reasoning is that if the true average amount of fat per serving were \(8.0\) grams then there would be less than a \(10\%\) chance that a sample of size \(5\) would produce a mean of either \(7.89\) grams or less or \(8.11\) grams or more. Hence if that happened it would be more likely that the value \(8.0\) is incorrect (always assuming that the population standard deviation is \(0.15\) gram).

Because the rejection regions are computed based on areas in tails of distributions, as shown in Figure \(\PageIndex{2}\), hypothesis tests are classified according to the form of the alternative hypothesis in the following way.

Definitions: Test classifications

  • If \(H_a\) has the form \(\mu \neq \mu _0\) the test is called a two-tailed test .
  • If \(H_a\) has the form \(\mu < \mu _0\) the test is called a left-tailed test .
  • If \(H_a\) has the form \(\mu > \mu _0\)the test is called a right-tailed test .

Each of the last two forms is also called a one-tailed test .

Two Types of Errors

The format of the testing procedure in general terms is to take a sample and use the information it contains to come to a decision about the two hypotheses. As stated before our decision will always be either

  • reject the null hypothesis \(H_0\) in favor of the alternative \(H_a\) presented, or
  • do not reject the null hypothesis \(H_0\) in favor of the alternative \(H_0\) presented.

There are four possible outcomes of hypothesis testing procedure, as shown in the following table:

As the table shows, there are two ways to be right and two ways to be wrong. Typically to reject \(H_0\) when it is actually true is a more serious error than to fail to reject it when it is false, so the former error is labeled “ Type I ” and the latter error “ Type II ”.

Definition: Type I and Type II errors

In a test of hypotheses:

  • A Type I error is the decision to reject \(H_0\) when it is in fact true.
  • A Type II error is the decision not to reject \(H_0\) when it is in fact not true.

Unless we perform a census we do not have certain knowledge, so we do not know whether our decision matches the true state of nature or if we have made an error. We reject \(H_0\) if what we observe would be a “rare” event if \(H_0\) were true. But rare events are not impossible: they occur with probability \(\alpha\). Thus when \(H_0\) is true, a rare event will be observed in the proportion \(\alpha\) of repeated similar tests, and \(H_0\) will be erroneously rejected in those tests. Thus \(\alpha\) is the probability that in following the testing procedure to decide between \(H_0\) and \(H_a\) we will make a Type I error.

Definition: level of significance

The number \(\alpha\) that is used to determine the rejection region is called the level of significance of the test. It is the probability that the test procedure will result in a Type I error .

The probability of making a Type II error is too complicated to discuss in a beginning text, so we will say no more about it than this: for a fixed sample size, choosing \(alpha\) smaller in order to reduce the chance of making a Type I error has the effect of increasing the chance of making a Type II error . The only way to simultaneously reduce the chances of making either kind of error is to increase the sample size.

Standardizing the Test Statistic

Hypotheses testing will be considered in a number of contexts, and great unification as well as simplification results when the relevant sample statistic is standardized by subtracting its mean from it and then dividing by its standard deviation. The resulting statistic is called a standardized test statistic . In every situation treated in this and the following two chapters the standardized test statistic will have either the standard normal distribution or Student’s \(t\)-distribution.

Definition: hypothesis test

A standardized test statistic for a hypothesis test is the statistic that is formed by subtracting from the statistic of interest its mean and dividing by its standard deviation.

For example, reviewing Example \(\PageIndex{3}\), if instead of working with the sample mean \(\overline{X}\) we instead work with the test statistic

\[\frac{\overline{X}-8.0}{0.067} \nonumber \]

then the distribution involved is standard normal and the critical values are just \(\pm z_{0.05}\). The extra work that was done to find that \(C=7.89\) and \(C′=8.11\) is eliminated. In every hypothesis test in this book the standardized test statistic will be governed by either the standard normal distribution or Student’s \(t\)-distribution. Information about rejection regions is summarized in the following tables:

Every instance of hypothesis testing discussed in this and the following two chapters will have a rejection region like one of the six forms tabulated in the tables above.

No matter what the context a test of hypotheses can always be performed by applying the following systematic procedure, which will be illustrated in the examples in the succeeding sections.

Systematic Hypothesis Testing Procedure: Critical Value Approach

  • Identify the null and alternative hypotheses.
  • Identify the relevant test statistic and its distribution.
  • Compute from the data the value of the test statistic.
  • Construct the rejection region.
  • Compare the value computed in Step 3 to the rejection region constructed in Step 4 and make a decision. Formulate the decision in the context of the problem, if applicable.

The procedure that we have outlined in this section is called the “Critical Value Approach” to hypothesis testing to distinguish it from an alternative but equivalent approach that will be introduced at the end of Section 8.3.

Key Takeaway

  • A test of hypotheses is a statistical process for deciding between two competing assertions about a population parameter.
  • The testing procedure is formalized in a five-step procedure.

Sciencing_Icons_Science SCIENCE

Sciencing_icons_biology biology, sciencing_icons_cells cells, sciencing_icons_molecular molecular, sciencing_icons_microorganisms microorganisms, sciencing_icons_genetics genetics, sciencing_icons_human body human body, sciencing_icons_ecology ecology, sciencing_icons_chemistry chemistry, sciencing_icons_atomic &amp; molecular structure atomic & molecular structure, sciencing_icons_bonds bonds, sciencing_icons_reactions reactions, sciencing_icons_stoichiometry stoichiometry, sciencing_icons_solutions solutions, sciencing_icons_acids &amp; bases acids & bases, sciencing_icons_thermodynamics thermodynamics, sciencing_icons_organic chemistry organic chemistry, sciencing_icons_physics physics, sciencing_icons_fundamentals-physics fundamentals, sciencing_icons_electronics electronics, sciencing_icons_waves waves, sciencing_icons_energy energy, sciencing_icons_fluid fluid, sciencing_icons_astronomy astronomy, sciencing_icons_geology geology, sciencing_icons_fundamentals-geology fundamentals, sciencing_icons_minerals &amp; rocks minerals & rocks, sciencing_icons_earth scructure earth structure, sciencing_icons_fossils fossils, sciencing_icons_natural disasters natural disasters, sciencing_icons_nature nature, sciencing_icons_ecosystems ecosystems, sciencing_icons_environment environment, sciencing_icons_insects insects, sciencing_icons_plants &amp; mushrooms plants & mushrooms, sciencing_icons_animals animals, sciencing_icons_math math, sciencing_icons_arithmetic arithmetic, sciencing_icons_addition &amp; subtraction addition & subtraction, sciencing_icons_multiplication &amp; division multiplication & division, sciencing_icons_decimals decimals, sciencing_icons_fractions fractions, sciencing_icons_conversions conversions, sciencing_icons_algebra algebra, sciencing_icons_working with units working with units, sciencing_icons_equations &amp; expressions equations & expressions, sciencing_icons_ratios &amp; proportions ratios & proportions, sciencing_icons_inequalities inequalities, sciencing_icons_exponents &amp; logarithms exponents & logarithms, sciencing_icons_factorization factorization, sciencing_icons_functions functions, sciencing_icons_linear equations linear equations, sciencing_icons_graphs graphs, sciencing_icons_quadratics quadratics, sciencing_icons_polynomials polynomials, sciencing_icons_geometry geometry, sciencing_icons_fundamentals-geometry fundamentals, sciencing_icons_cartesian cartesian, sciencing_icons_circles circles, sciencing_icons_solids solids, sciencing_icons_trigonometry trigonometry, sciencing_icons_probability-statistics probability & statistics, sciencing_icons_mean-median-mode mean/median/mode, sciencing_icons_independent-dependent variables independent/dependent variables, sciencing_icons_deviation deviation, sciencing_icons_correlation correlation, sciencing_icons_sampling sampling, sciencing_icons_distributions distributions, sciencing_icons_probability probability, sciencing_icons_calculus calculus, sciencing_icons_differentiation-integration differentiation/integration, sciencing_icons_application application, sciencing_icons_projects projects, sciencing_icons_news news.

  • Share Tweet Email Print
  • Home ⋅
  • Science Fair Project Ideas for Kids, Middle & High School Students ⋅
  • Probability & Statistics

The Importance of Hypothesis Testing

how relevant learning hypothesis testing is

How to Chi-Square Test

A hypothesis is a theory or proposition set forth as an explanation for the occurrence of some observed phenomenon, asserted either as a provisional conjecture to guide investigation, called a working hypothesis, or accepted as highly probable in lieu of the established facts. A scientific hypothesis can become a theory or ultimately a law of nature if it is proven by repeatable experiments. Hypothesis testing is common in statistics as a method of making decisions using data. In other words, testing a hypothesis is trying to determine if your observation of some phenomenon is likely to have really occurred based on statistics.

Statistical Hypothesis Testing

Statistical hypothesis testing, also called confirmatory data analysis, is often used to decide whether experimental results contain enough information to cast doubt on conventional wisdom. For example, at one time it was thought that people of certain races or color had inferior intelligence compared to Caucasians. A hypothesis was made that intelligence is not based on race or color. People of various races, colors and cultures were given intelligence tests and the data was analyzed. Statistical hypothesis testing then proved that the results were statistically significant in that the similar measurements of intelligence between races are not merely sample error.

Null and Alternative Hypotheses

Before testing for phenomena, you form a hypothesis of what might be happening. Your hypothesis or guess about what’s occurring might be that certain groups are different from each other, or that intelligence is not correlated with skin color, or that some treatment has an effect on an outcome measure, for examples. From this, there are two possibilities: a “null hypothesis” that nothing happened, or there were no differences, or no cause and effect; or that you were correct in your theory, which is labeled the “alternative hypothesis.” In short, when you test a statistical hypothesis, you are trying to see if something happened and are comparing against the possibility that nothing happened. Confusingly, you are trying to disprove that nothing happened. If you disprove that nothing happened, then you can conclude that something happened.

Importance of Hypothesis Testing

According to the San Jose State University Statistics Department, hypothesis testing is one of the most important concepts in statistics because it is how you decide if something really happened, or if certain treatments have positive effects, or if groups differ from each other or if one variable predicts another. In short, you want to proof if your data is statistically significant and unlikely to have occurred by chance alone. In essence then, a hypothesis test is a test of significance.

Possible Conclusions

Once the statistics are collected and you test your hypothesis against the likelihood of chance, you draw your final conclusion. If you reject the null hypothesis, you are claiming that your result is statistically significant and that it did not happen by luck or chance. As such, the outcome proves the alternative hypothesis. If you fail to reject the null hypothesis, you must conclude that you did not find an effect or difference in your study. This method is how many pharmaceutical drugs and medical procedures are tested.

Related Articles

How to calculate a p-value, how to calculate significance, how to calculate statistical difference, advantages & disadvantages of finding variance, how to know if something is significant using spss, five characteristics of the scientific method, difference between correlation and causality, the difference between a t-test & a chi square, the definition of an uncontrolled variable, scientists now know why you sometimes feel psychic, characteristics of a good sample size, how to calculate mse, difference between proposition & hypothesis, how to calculate a two-tailed test, how to calculate reliability & probability, the advantages of using an independent group t-test, how to calculate bias, methods of probability, how to write a hypothesis for correlation.

  • Dictionary.com: Definition of Hypothesis
  • San Jose State University Statistics Department: Introduction to Hypothesis Testing

About the Author

Sirah Dubois is currently a PhD student in food science after having completed her master's degree in nutrition at the University of Alberta. She has worked in private practice as a dietitian in Edmonton, Canada and her nutrition-related articles have appeared in The Edmonton Journal newspaper.

Find Your Next Great Science Fair Project! GO

We Have More Great Sciencing Articles!

IMAGES

  1. Hypothesis Testing- Meaning, Types & Steps

    how relevant learning hypothesis testing is

  2. Hypothesis Testing Steps & Examples

    how relevant learning hypothesis testing is

  3. PPT

    how relevant learning hypothesis testing is

  4. Statistical Hypothesis Testing step by step procedure

    how relevant learning hypothesis testing is

  5. Hypothesis Testing Solved Examples(Questions and Solutions)

    how relevant learning hypothesis testing is

  6. What is Hypothesis Testing?

    how relevant learning hypothesis testing is

VIDEO

  1. Hypothesis Testing

  2. Statistics

  3. Hypothesis testing #study bs 7 semester statics

  4. TWO SAMPLE HYPOTHESIS TESTING IN SPSS

  5. Session 8- Hypothesis testing by Non Parametric Tests (7/12/23)

  6. Hypothesis Testing and why we need it

COMMENTS

  1. Hypothesis Testing

    Hypothesis testing is an indispensable tool in data science, allowing us to make data-driven decisions with confidence. By understanding its principles, conducting tests properly, and considering real-world applications, you can harness the power of hypothesis testing to unlock valuable insights from your data.

  2. Hypothesis Testing

    Present the findings in your results and discussion section. Though the specific details might vary, the procedure you will use when testing a hypothesis will always follow some version of these steps. Table of contents. Step 1: State your null and alternate hypothesis. Step 2: Collect data. Step 3: Perform a statistical test.

  3. A Gentle Introduction to Statistical Hypothesis Testing

    A statistical hypothesis test may return a value called p or the p-value. This is a quantity that we can use to interpret or quantify the result of the test and either reject or fail to reject the null hypothesis. This is done by comparing the p-value to a threshold value chosen beforehand called the significance level.

  4. Understanding Hypothesis Testing

    The process of hypothesis testing involves two hypotheses — a null hypothesis and an alternate hypothesis. The null hypothesis is a statement that assumes there is no relationship between two variables, no association between two groups or no change in the current situation — hence 'null'.

  5. Everything You Need To Know about Hypothesis Testing

    6. Test Statistic: The test statistic measures how close the sample has come to the null hypothesis. Its observed value changes randomly from one random sample to a different sample. A test statistic contains information about the data that is relevant for deciding whether to reject the null hypothesis or not.

  6. Everything you need to know about Hypothesis Testing in Machine Learning

    The null hypothesis represented as H₀ is the initial claim that is based on the prevailing belief about the population. The alternate hypothesis represented as H₁ is the challenge to the null hypothesis. It is the claim which we would like to prove as True. One of the main points which we should consider while formulating the null and alternative hypothesis is that the null hypothesis ...

  7. Introduction to Hypothesis Testing with Examples

    Likelihood ratio. In the likelihood ratio test, we reject the null hypothesis if the ratio is above a certain value i.e, reject the null hypothesis if L(X) > 𝜉, else accept it. 𝜉 is called the critical ratio.. So this is how we can draw a decision boundary: we separate the observations for which the likelihood ratio is greater than the critical ratio from the observations for which it ...

  8. Introduction to Hypothesis Testing

    Find out what you'll learn in Hypothesis Testing and why it's important. informational Introduction to Hypothesis Testing. article Descriptive vs. Inferential ... AI-Assisted Learning Get coding help quickly and when you need it to speed up your learning journey. Our AI features help you understand errors and solution code faster and get ...

  9. A Beginner's Guide to Hypothesis Testing in Business

    A hypothesis or hypothesis statement seeks to explain why something has happened, or what might happen, under certain conditions. It can also be used to understand how different variables relate to each other. Hypotheses are often written as if-then statements; for example, "If this happens, then this will happen.".

  10. Hypothesis Testing and Machine Learning: Interpreting Variable Effects

    influence and statistical significance. In doing so, I allow hypothesis testing with machine learning - in the full sense of classic inferential statistics. Using the deep learning open-source software library Keras 2.4.3 and the dalex model-agnostic explainer [24,28], I provide a Python sandbox implementation for other

  11. 9.1: Introduction to Hypothesis Testing

    In hypothesis testing, the goal is to see if there is sufficient statistical evidence to reject a presumed null hypothesis in favor of a conjectured alternative hypothesis.The null hypothesis is usually denoted \(H_0\) while the alternative hypothesis is usually denoted \(H_1\). An hypothesis test is a statistical decision; the conclusion will either be to reject the null hypothesis in favor ...

  12. Understanding Hypothesis Testing

    Hypothesis testing is a statistical method that is used to make a statistical decision using experimental data. Hypothesis testing is basically an assumption that we make about a population parameter. It evaluates two mutually exclusive statements about a population to determine which statement is best supported by the sample data.

  13. A Comprehensive Guide to Hypothesis Testing: Understanding ...

    In feature selection for machine learning models, hypothesis testing can be used to identify the most relevant variables. By assessing the significance of each feature in relation to the target ...

  14. Hypothesis Testing

    One of the most important assumption of z-test is that all sample observations are independent than each other. T-Test. T-Test is used when the standard deviation of the population is unknown and has to be approximated from the sample. It is generally used when two different populations are to be compared. There are three main types of t-test:

  15. Design principles for simulation-based learning of hypothesis testing

    General guidelines for promoting hypothesis testing. In the present study, we adopt a social constructivist perspective on teaching and learning (Nilsson, Citation 2009b).This means that we take a perspective on learning in mathematics classrooms as a constructive process, which takes place as individual students participate in and contribute to the norms and practices of the classroom ...

  16. Hypothesis testing

    11. Hypothesis testing. The process of induction is the process of assuming the simplest law that can be made to harmonize with our experience. This process, however, has no logical foundation but only a psychological one. It is clear that there are no grounds for believing that the simplest course of events will really happen.

  17. Learn Hypothesis Testing Online

    Learning hypothesis testing is important because it's what allows you to decide if something is true based on real data. If you're in marketing, for example, you can use hypothesis testing in consumer research to see how well your product is accepted by customers. If you're in the medical field, you can use it to see if a treatment has positive ...

  18. Humans combine value learning and hypothesis testing strategically in

    The extent to which participants engaged in hypothesis testing depended on the instructed task complexity: people tended to serially test hypotheses when instructed that there were fewer relevant dimensions, and relied more on gradual and parallel learning of feature values when the task was more complex.

  19. Hypothesis Testing: What, How and Why [+ 5 Learning Resources]

    Final Word. Hypothesis testing helps verify an assumption and then develop statistical data based on the assessment. It is being utilized in many sectors, from manufacturing and agriculture to clinical trials and IT. This method is not only accurate but also helps you make data-driven decisions for your organization.

  20. Hypothesis testing in Machine learning using Python

    Now Let's see some of widely used hypothesis testing type :-T Test ( Student T test) Z Test; ANOVA Test; Chi-Square Test; T- Test :- A t-test is a type of inferential statistic which is used to determine if there is a significant difference between the means of two groups which may be related in certain features.It is mostly used when the data sets, like the set of data recorded as outcome ...

  21. 8.1: The Elements of Hypothesis Testing

    Hypothesis testing is a statistical procedure in which a choice is made between a null hypothesis and an alternative hypothesis based on information in a sample. The end result of a hypotheses testing procedure is a choice of one of the following two possible conclusions: Reject H0 (and therefore accept Ha ), or.

  22. The Importance of Hypothesis Testing

    Importance of Hypothesis Testing. According to the San Jose State University Statistics Department, hypothesis testing is one of the most important concepts in statistics because it is how you decide if something really happened, or if certain treatments have positive effects, or if groups differ from each other or if one variable predicts ...