
How to Perform Hypothesis Testing in Python (With Examples)

A hypothesis test is a formal statistical test we use to reject or fail to reject some statistical hypothesis.

This tutorial explains how to perform the following hypothesis tests in Python:

  • One sample t-test
  • Two sample t-test
  • Paired samples t-test

Let’s jump in!

Example 1: One Sample t-test in Python

A one sample t-test is used to test whether or not the mean of a population is equal to some value.

For example, suppose we want to know whether or not the mean weight of a certain species of some turtle is equal to 310 pounds.

To test this, we go out and collect a simple random sample of turtles with the following weights:

Weights : 300, 315, 320, 311, 314, 309, 300, 308, 305, 303, 305, 301, 303

The following code shows how to use the ttest_1samp() function from the scipy.stats library to perform a one sample t-test:

The t test statistic is  -1.5848 and the corresponding two-sided p-value is  0.1389 .

The two hypotheses for this particular one sample t-test are as follows:

  • H 0 :  µ = 310 (the mean weight for this species of turtle is 310 pounds)
  • H A :  µ ≠310 (the mean weight is not  310 pounds)

Because the p-value of our test (0.1389) is greater than alpha = 0.05, we fail to reject the null hypothesis of the test.

We do not have sufficient evidence to say that the mean weight for this particular species of turtle is different from 310 pounds.

Example 2: Two Sample t-test in Python

A two sample t-test is used to test whether or not the means of two populations are equal.

For example, suppose we want to know whether or not the mean weight between two different species of turtles is equal.

To test this, we collect a simple random sample of turtles from each species with the following weights:

Sample 1 : 300, 315, 320, 311, 314, 309, 300, 308, 305, 303, 305, 301, 303

Sample 2 : 335, 329, 322, 321, 324, 319, 304, 308, 305, 311, 307, 300, 305

The following code shows how to use the ttest_ind() function from the scipy.stats library to perform this two sample t-test:

The t test statistic is – 2.1009 and the corresponding two-sided p-value is 0.0463 .

The two hypotheses for this particular two sample t-test are as follows:

  • H 0 :  µ 1 = µ 2 (the mean weight between the two species is equal)
  • H A :  µ 1 ≠ µ 2 (the mean weight between the two species is not equal)

Since the p-value of the test (0.0463) is less than .05, we reject the null hypothesis.

This means we have sufficient evidence to say that the mean weight between the two species is not equal.

Example 3: Paired Samples t-test in Python

A paired samples t-test is used to compare the means of two samples when each observation in one sample can be paired with an observation in the other sample.

For example, suppose we want to know whether or not a certain training program is able to increase the max vertical jump (in inches) of basketball players.

To test this, we may recruit a simple random sample of 12 college basketball players and measure each of their max vertical jumps. Then, we may have each player use the training program for one month and then measure their max vertical jump again at the end of the month.

The following data shows the max jump height (in inches) before and after using the training program for each player:

Before : 22, 24, 20, 19, 19, 20, 22, 25, 24, 23, 22, 21

After : 23, 25, 20, 24, 18, 22, 23, 28, 24, 25, 24, 20

The following code shows how to use the ttest_rel() function from the scipy.stats library to perform this paired samples t-test:

The t test statistic is – 2.5289  and the corresponding two-sided p-value is 0.0280 .

The two hypotheses for this particular paired samples t-test are as follows:

  • H 0 :  µ 1 = µ 2 (the mean jump height before and after using the program is equal)
  • H A :  µ 1 ≠ µ 2 (the mean jump height before and after using the program is not equal)

Since the p-value of the test (0.0280) is less than .05, we reject the null hypothesis.

This means we have sufficient evidence to say that the mean jump height before and after using the training program is not equal.

Additional Resources

You can use the following online calculators to automatically perform various t-tests:

One Sample t-test Calculator Two Sample t-test Calculator Paired Samples t-test Calculator

What Is Hypothesis Testing? Types and Python Code Example


Curiosity has always been a part of human nature. Since the beginning of time, this has been one of the most important tools for birthing civilizations. Still, our curiosity grows — it tests and expands our limits. Humanity has explored the plains of land, water, and air. We've built underwater habitats where we could live for weeks. Our civilization has explored various planets. We've explored land to an unlimited degree.

These things were possible because humans asked questions and searched until they found answers. However, for us to get these answers, a proven method must be used and followed through to validate our results. Historically, philosophers assumed the earth was flat and you would fall off when you reached the edge. While philosophers like Aristotle argued that the earth was spherical based on the formation of the stars, they could not prove it at the time.

This is because they didn't have adequate resources to explore space or mathematically prove Earth's shape. It was a Greek mathematician named Eratosthenes who calculated the earth's circumference with incredible precision. He used scientific methods to show that the Earth was not flat. Since then, other methods have been used to prove the Earth's spherical shape.

When there are questions or statements that are yet to be tested and confirmed based on some scientific method, they are called hypotheses. Basically, we have two types of hypotheses: null and alternate.

A null hypothesis is one's default belief or argument about a subject matter. In the case of the earth's shape, the null hypothesis was that the earth was flat.

An alternate hypothesis is a belief or argument a person might try to establish. Aristotle and Eratosthenes argued that the earth was spherical.

Other examples of a random alternate hypothesis include:

  • The weather may have an impact on a person's mood.
  • More people wear suits on Mondays compared to other days of the week.
  • Children are more likely to be brilliant if both parents are in academia, and so on.

What is Hypothesis Testing?

Hypothesis testing is the act of testing whether a hypothesis or inference is true. When an alternate hypothesis is introduced, we test it against the null hypothesis to know which is correct. Let's use a plant experiment by a 12-year-old student to see how this works.

The hypothesis is that a plant will grow taller when given a certain type of fertilizer. The student takes two samples of the same plant, fertilizes one, and leaves the other unfertilized. He measures the plants' height every few days and records the results in a table.

After a week or two, he compares the final height of both plants to see which grew taller. If the plant given fertilizer grew taller, the hypothesis is established as fact. If not, the hypothesis is not supported. This simple experiment shows how to form a hypothesis, test it experimentally, and analyze the results.

In hypothesis testing, there are two types of error: Type I and Type II.

When we reject the null hypothesis in a case where it is correct, we've committed a Type I error. Type II errors occur when we fail to reject the null hypothesis when it is incorrect.

In our plant experiment above, if the student finds out that both plants' heights are the same at the end of the test period yet opines that fertilizer helps with plant growth, he has committed a Type I error.

However, if the fertilized plant comes out taller and the student records that both plants are the same or that the one without fertilizer grew taller, he has committed a Type II error because he has failed to reject the null hypothesis.

What are the Steps in Hypothesis Testing?

The following steps explain how we can test a hypothesis:

Step #1 - Define the Null and Alternative Hypotheses

Before making any test, we must first define what we are testing and what the default assumption is about the subject. In this article, we'll be testing if the average weight of 10-year-old children is more than 32kg.

Our null hypothesis is that 10 year old children weigh 32 kg on average. Our alternate hypothesis is that the average weight is more than 32kg. Ho denotes a null hypothesis, while H1 denotes an alternate hypothesis.

Step #2 - Choose a Significance Level

The significance level is a threshold for determining if the test is valid. It gives credibility to our hypothesis test to ensure we are not just luck-dependent but have enough evidence to support our claims. We usually set our significance level before conducting our tests. The criterion for determining our significance value is known as p-value.

A lower p-value means that there is stronger evidence against the null hypothesis, and therefore, a greater degree of significance. A p-value of 0.05 is widely accepted to be significant in most fields of science. P-values do not denote the probability of the outcome of the result, they just serve as a benchmark for determining whether our test result is due to chance. For our test, our p-value will be 0.05.

Step #3 - Collect Data and Calculate a Test Statistic

You can obtain your data from online data stores or conduct your research directly. Data can be scraped or researched online. The methodology might depend on the research you are trying to conduct.

We can calculate our test using any of the appropriate hypothesis tests. This can be a T-test, Z-test, Chi-squared, and so on. There are several hypothesis tests, each suiting different purposes and research questions. In this article, we'll use the T-test to run our hypothesis, but I'll explain the Z-test, and chi-squared too.

T-test is used for comparison of two sets of data when we don't know the population standard deviation. It's a parametric test, meaning it makes assumptions about the distribution of the data. These assumptions include that the data is normally distributed and that the variances of the two groups are equal. In a more simple and practical sense, imagine that we have test scores in a class for males and females, but we don't know how different or similar these scores are. We can use a t-test to see if there's a real difference.

The Z-test is used for comparison between two sets of data when the population standard deviation is known. It is also a parametric test, but it makes fewer assumptions about the distribution of data. The z-test assumes that the data is normally distributed, but it does not assume that the variances of the two groups are equal. In our class test example, with the t-test, we can say that if we already know how spread out the scores are in both groups, we can now use the z-test to see if there's a difference in the average scores.

The Chi-squared test is used to compare two or more categorical variables. The chi-squared test is a non-parametric test, meaning it does not make any assumptions about the distribution of data. It can be used to test a variety of hypotheses, including whether two or more groups have equal proportions.

Step #4 - Decide on the Null Hypothesis Based on the Test Statistic and Significance Level

After conducting our test and calculating the test statistic, we can compare its value to the predetermined significance level. If the test statistic falls beyond the significance level, we can decide to reject the null hypothesis, indicating that there is sufficient evidence to support our alternative hypothesis.

On the other contrary, if the test statistic does not exceed the significance level, we fail to reject the null hypothesis, signifying that we do not have enough statistical evidence to conclude in favor of the alternative hypothesis.

Step #5 - Interpret the Results

Depending on the decision made in the previous step, we can interpret the result in the context of our study and the practical implications. For our case study, we can interpret whether we have significant evidence to support our claim that the average weight of 10 year old children is more than 32kg or not.

For our test, we are generating random dummy data for the weight of the children. We'll use a t-test to evaluate whether our hypothesis is correct or not.

For a better understanding, let's look at what each block of code does.

The first block is the import statement, where we import numpy and scipy.stats . Numpy is a Python library used for scientific computing. It has a large library of functions for working with arrays. Scipy is a library for mathematical functions. It has a stat module for performing statistical functions, and that's what we'll be using for our t-test.

The weights of the children were generated at random since we aren't working with an actual dataset. The random module within the Numpy library provides a function for generating random numbers, which is randint .

The randint function takes three arguments. The first (20) is the lower bound of the random numbers to be generated. The second (40) is the upper bound, and the third (100) specifies the number of random integers to generate. That is, we are generating random weight values for 100 children. In real circumstances, these weight samples would have been obtained by taking the weight of the required number of children needed for the test.

Using the code above, we declared our null and alternate hypotheses stating the average weight of a 10-year-old in both cases.

t_stat and p_value are the variables in which we'll store the results of our functions. stats.ttest_1samp is the function that calculates our test. It takes in two variables, the first is the data variable that stores the array of weights for children, and the second (32) is the value against which we'll test the mean of our array of weights or dataset in cases where we are using a real-world dataset.

The code above prints both values for t_stats and p_value .

Lastly, we evaluated our p_value against our significance value, which is 0.05. If our p_value is less than 0.05, we reject the null hypothesis. Otherwise, we fail to reject the null hypothesis. Below is the output of this program. Our null hypothesis was rejected.

In this article, we discussed the importance of hypothesis testing. We highlighted how science has advanced human knowledge and civilization through formulating and testing hypotheses.

We discussed Type I and Type II errors in hypothesis testing and how they underscore the importance of careful consideration and analysis in scientific inquiry. It reinforces the idea that conclusions should be drawn based on thorough statistical analysis rather than assumptions or biases.

We also generated a sample dataset using the relevant Python libraries and used the needed functions to calculate and test our alternate hypothesis.

Automation Playwright Testing Selenium Python Tutorial

  • What Is Hypothesis Testing in Python: A Hands-On Tutorial

In software testing, there is an approach known as property-based testing that leverages the concept of formal specification of code behavior and focuses on asserting properties that hold true for a wide range of inputs rather than individual test cases.

Python is an open-source programming language that provides a Hypothesis library for property-based testing. Hypothesis testing in Python provides a framework for generating diverse and random test data, allowing development and testing teams to thoroughly test their code against a broad spectrum of inputs.

In this blog, we will explore the fundamentals of Hypothesis testing in Python using Selenium and Playwright. We’ll learn various aspects of Hypothesis testing, from basic usage to advanced strategies, and demonstrate how it can improve the robustness and reliability of the codebase.


What Is a Hypothesis Library?

Decorators in hypothesis, strategies in hypothesis, setting up python environment for hypothesis testing, how to perform hypothesis testing in python, hypothesis testing in python with selenium and playwright.

  • How to Run Hypothesis Testing in Python With Date Strategy?
  • How to Write Composite Strategies in Hypothesis Testing in Python?

Frequently Asked Questions (FAQs)

Hypothesis is a property-based testing library that automates test data generation based on properties or invariants defined by the developers and testers.

In property-based testing, instead of specifying individual test cases, developers define general properties that the code should satisfy. Hypothesis then generates a wide range of input data to test these properties automatically.

Property-based testing using Hypothesis allows developers and testers to focus on defining the behavior of their code rather than writing specific test cases, resulting in more comprehensive testing coverage and the discovery of edge cases and unexpected behavior.

Writing property-based tests usually consists of deciding on guarantees our code should make – properties that should always hold, regardless of what the world throws at the code.

Examples of such guarantees can be:

  • Your code shouldn’t throw an exception or should only throw a particular type of exception (this works particularly well if you have a lot of internal assertions).
  • If you delete an object, it is no longer visible.
  • If you serialize and then deserialize a value, you get the same value back.

Before we proceed further, it’s worthwhile to understand decorators in Python a bit since the Hypothesis library exposes decorators that we need to use to write tests.

In Python, decorators are a powerful feature that allows you to modify or extend the behavior of functions or classes without changing their source code. Decorators are essentially functions themselves, which take another function (or class) as input and return a new function (or class) with added functionality.

Decorators are denoted by the @ symbol followed by the name of the decorator function placed directly before the definition of the function or class to be modified.

Let us understand this with the help of an example:

decorators in Python a bit since the Hypothesis library

In the example above, only authenticated users are allowed to create_post() . The logic to check authentication is wrapped in its own function, authenticate() .

This function can now be called using @authenticate before beginning a function where it’s needed & Python would automatically know that it needs to execute the code of authenticate() before calling the function.

If we no longer need the authentication logic in the future, we can simply remove the @authenticate line without disturbing the core logic. Thus, decorators are a powerful construct in Python that allows plug-n-play of repetitive logic into any function/method.

Now that we know the concept of Python decorators, let us understand the given decorators that which Hypothesis provides.

Hypothesis @given Decorator

This decorator turns a test function that accepts arguments into a randomized test. It serves as the main entry point to the Hypothesis.

The @given decorator can be used to specify which arguments of a function should be parameterized over. We can use either positional or keyword arguments, but not a mixture of both.

.given(*_given_arguments, **_given_kwargs)

Some valid declarations of the @given decorator are:

given(integers(), integers()) a(x, y): pass given(integers()) b(x, y): pass given(y=integers()) c(x, y): pass given(x=integers()) d(x, y): pass given(x=integers(), y=integers()) e(x, **kwargs): pass given(x=integers(), y=integers()) f(x, *args, **kwargs): pass SomeTest(TestCase): @given(integers()) def test_a_thing(self, x): pass

Some invalid declarations of @given are:

given(integers(), integers(), integers()) g(x, y): pass given(integers()) h(x, *args): pass given(integers(), x=integers()) i(x, y): pass given() j(x, y): pass

Hypothesis @example Decorator

When writing production-grade applications, the ability of a Hypothesis to generate a wide range of input test data plays a crucial role in ensuring robustness.

However, there are certain inputs/scenarios the testing team might deem mandatory to be tested as part of every test run. Hypothesis has the @example decorator in such cases where we can specify values we always want to be tested. The @example decorator works for all strategies.

Let’s understand by tweaking the factorial test example.

Hypothesis to generate a wide range of input test data

The above test will always run for the input value 41 along with other custom-generated test data by the Hypothesis st.integers() function.

By now, we understand that the crux of the Hypothesis is to test a function for a wide range of inputs. These inputs are generated automatically, and the Hypothesis lets us configure the range of inputs. Under the hood, the strategy method takes care of the process of generating this test data of the correct data type.

Hypothesis offers a wide range of strategies such as integers, text, boolean, datetime, etc. For more complex scenarios, which we will see a bit later in this blog, the hypothesis also lets us set up composite strategies.

While not exhaustive, here is a tabular summary of strategies available as part of the Hypothesis library.

Strategy Description
Generates none values.
Generates boolean values (True or False).
Generates integer values.
Generates floating-point values.
Generates unicode text strings.
Generates single unicode characters.
Generates lists of elements.
Generates tuples of elements.
Generates dictionaries with specified keys and values.
Generates sets of elements.
Generates binary data.
Generates datetime objects.
Generates timedelta objects.
Choose one of the given strategies with equal probability.
Chooses values from a given sequence with equal probability.
Generates lists of elements.
Generates date objects.
Generates datetime objects.
Generates a single value.
Generates strings that match a given regular expression.
Generates UUID objects.
Generates complex numbers.
Generates fraction objects.
Builds objects using a provided constructor and strategy for each argument.
Generates single unicode characters.
Generates unicode text strings.
Chooses values from a given sequence with equal probability.
Generates arbitrary data values.
Generates values that are shared between different parts of a test.
Generates recursively structured data.
Generates data based on the outcome of other strategies.

Let’s see the steps to how to set up a test environment to perform Hypothesis testing in Python.

  • Create a separate virtual environment for this project using the built-in venv module of Python using the command.

Create a separate virtual environment

  • Activate the newly created virtual environment using the activate script present within the environment.

Activate the newly created virtual environment

  • Install the Hypothesis library necessary for property-based testing using the pip install hypothesis command. The installed package can be viewed using the command pip list. When writing this blog, the latest version of Hypothesis is 6.102.4. For this article, we have used the Hypothesis version 6.99.6.

Install the Hypothesis library necessary for property-based testing

  • Install python-dotenv , pytest, Playwright, and Selenium packages which we will need to run the tests on the cloud. We will talk about this in more detail later in the blog.

Our final project structure setup looks like below:

Our final project structure setup looks like below

With the setup done, let us now understand Hypothesis testing in Python with various examples, starting with the introductory one and then working toward more complex ones.

Subscribe to the LambdaTest YouTube Channel for quick updates on the tutorials around Selenium Python and more.

Let’s now start writing tests to understand how we can leverage the Hypothesis library to perform Python automation .

For this, let’s look at one test scenario to understand Hypothesis testing in Python.

Test Scenario:


This is what the initial implementation of the function looks like:

factorial(num: int) -> int: if num < 0: raise ValueError("Input must be > 0") fact = 1 for _ in range(1, num + 1): fact *= _ return fact

It takes in an integer as an input. If the input is 0, it raises an error; if not, it uses the range() function to generate a list of numbers within, iterate over it, calculate the factorial, and return it.

Let’s now write a test using the Hypothesis library to test the above function:

hypothesis import given, strategies as st given(st.integers(min_value=1, max_value=30)) test_factorial(num: int): fact_num_result = factorial(num) fact_num_minus_one_result = factorial(num-1) result = fact_num_result / fact_num_minus_one_result assert num == result

Code Walkthrough:

Let’s now understand the step-by-step code walkthrough for Hypothesis testing in Python.

Step 1: From the Hypothesis library, we import the given decorator and strategies method.

 import the given decorator and strategies method

Step 2: Using the imported given and strategies, we set our test strategy of passing integer inputs within the range of 1 to 30 to the function under test using the min_value and max_value arguments.

set our test strategy of passing integer inputs

Step 3: We write the actual test_factorial where the integer generated by our strategy is passed automatically by Hypothesis into the value num.

Using this value we call the factorial function once for value num and num – 1.

Next, we divide the factorial of num by the factorial of num -1 and assert if the result of the operation is equal to the original num.

write the actual test_factorial where the integer generated

Test Execution:

Let’s now execute our hypothesis test using the pytest -v -k “test_factorial” command.

execute our hypothesis test using the pytest

And Hypothesis confirms that our function works perfectly for the given set of inputs, i.e., for integers from 1 to 30.

We can also view detailed statistics of the Hypothesis run by passing the argument –hypothesis-show-statistics to pytest command as:

-v --hypothesis-show-statistics -k "test_factorial"

view detailed statistics of the Hypothesis run

The difference between the reuse and generate phase in the output above is explained below:

  • Reuse Phase: During the reuse phase, the Hypothesis attempts to reuse previously generated test data. If a test case fails or raises an exception, the Hypothesis will try to shrink the failing example to find a minimal failing case.

This phase typically has a very short runtime, as it involves reusing existing test data or shrinking failing examples. The output provides statistics about the typical runtimes and the number of passing, failing, and invalid examples encountered during this phase.

  • Generate Phase: During the generate phase, the Hypothesis generates new test data based on the defined strategies. This phase involves generating a wide range of inputs to test the properties defined by the developer.

The output provides statistics about the typical runtimes and the number of passing, failing, and invalid examples generated during this phase. While this helped us understand what passing tests look like with a Hypothesis, it’s also worthwhile to understand how a Hypothesis can catch bugs in the code.

Let’s rewrite the factorial() function with an obvious bug, i.e., remove the check for when the input value is 0.

factorial(num: int) -> int: # if num < 0: #     raise ValueError("Number must be >= 0") fact = 1 for _ in range(1, num + 1): fact *= _ return fact

We also tweak the test to remove the min_value and max_value arguments.

given(st.integers()) test_factorial(num: int): fact_num_result = factorial(num) fact_num_minus_one_result = factorial(num-1) result = int(fact_num_result / fact_num_minus_one_result) assert num == result

Let us now rerun the test with the same command:

-v --hypothesis-show-statistics -k test_factorial
pytest -v --hypothesis-show-statistics -k test_factorial

We can clearly see how Hypothesis has caught the bug immediately, which is shown in the above output. Hypothesis presents the input that resulted in the failing test under the Falsifying example section of the output.

see how Hypothesis has caught the bug immediately

So far, we’ve performed Hypothesis testing locally. This works nicely for unit tests , but when setting up automation for building more robust and resilient test suites, we can leverage a cloud grid like LambdaTest that supports automation testing tools like Selenium and Playwright.

LambdaTest is an AI-powered test orchestration and execution platform that enables developers and testers to perform automation testing with Selenium and Playwright at scale. It provides a remote test lab of 3000+ real environments.

How to Perform Hypothesis Testing in Python Using Cloud Selenium Grid?

Selenium is an open-source suite of tools and libraries for web automation . When combined with a cloud grid, it can help you perform Hypothesis testing in Python with Selenium at scale.

Let’s look at one test scenario to understand Hypothesis testing in Python with Selenium.

The code to set up a connection to LambdaTest Selenium Grid is stored in a file.

selenium import webdriver import Options selenium.webdriver.common.keys import Keys time import sleep urllib3 warnings os selenium.webdriver import ChromeOptions selenium.webdriver import FirefoxOptions selenium.webdriver.remote.remote_connection import RemoteConnection hypothesis.strategies import integers dotenv import load_dotenv () = os.getenv('LT_USERNAME', None) = os.getenv('LT_ACCESS_KEY', None) CrossBrowserSetup: global web_driver def __init__(self): global remote_url urllib3.disable_warnings(urllib3.exceptions.InsecureRequestWarning) remote_url = "https://" + str(username) + ":" + str(access_key) + "" def add(self, browsertype): if (browsertype == "Firefox"): ff_options = webdriver.FirefoxOptions() ff_options.browser_version = "latest" ff_options.platform_name = "Windows 11" lt_options = {} lt_options["build"] = "Build: FF: Hypothesis Testing with Selenium & Pytest" lt_options["project"] = "Project: FF: Hypothesis Testing withSelenium & Pytest" lt_options["name"] = "Test: FF: Hypothesis Testing with Selenium & Pytest" lt_options["browserName"] = "Firefox" lt_options["browserVersion"] = "latest" lt_options["platformName"] = "Windows 11" lt_options["console"] = "error" lt_options["w3c"] = True lt_options["headless"] = False ff_options.set_capability('LT:Options', lt_options) web_driver = webdriver.Remote( command_executor = remote_url, options = ff_options ) self.driver = web_driver self.driver.get("")             sleep(1) if web_driver is not None: web_driver.execute_script("lambda-status=passed") web_driver.quit() return True else:               return False

The contains code to test the Hypothesis that tests will only run on the Firefox browser.

hypothesis import given, settings hypothesis import given, example hypothesis.strategies as strategy src.crossbrowser_selenium import CrossBrowserSetup settings(deadline=None) given(strategy.just("Firefox")) test_add(browsertype_1): cbt = CrossBrowserSetup() assert True == cbt.add(browsertype_1)

Let’s now understand the step-by-step code walkthrough for Hypothesis testing in Python using Selenium Grid.

Step 1: We import the necessary Selenium methods to initiate a connection to LambdaTest Selenium Grid.

The FirefoxOptions() method is used to configure the setup when connecting to LambdaTest Selenium Grid using Firefox.

 FirefoxOptions() method is used to configure the setup

Step 2: We use the load_dotenv package to access the LT_ACCESS_KEY required to access the LambdaTest Selenium Grid, which is stored in the form of environment variables.

use the load_dotenv package to access the LT_ACCESS_KEY

The LT_ACCESS_KEY can be obtained from your LambdaTest Profile > Account Settings > Password & Security .

LT_ACCESS_KEY can be obtained from your LambdaTest Profile

Step 3: We initialize the CrossBrowserSetup class, which prepares the remote connection URL using the username and access_key.

initialize the CrossBrowserSetup class

Step 4: The add() method is responsible for checking the browsertype and then setting the capabilities of the LambdaTest Selenium Grid.

add() method is responsible for checking the browsertype

LambdaTest offers a variety of capabilities, such as cross browser testing , which means we can test on various operating systems such as Windows, Linux, and macOS and multiple browsers such as Chrome, Firefox, Edge, and Safari.

For the purpose of this blog, we will be testing that connection to the LambdaTest Selenium Grid should only happen if the browsertype is Firefox.

Step 5: If the connection to LambdaTest happens, the add() returns True ; else, it returns False .

 LambdaTest happens, the add() returns True

Let’s now understand a step-by-step walkthrough of the file.

Step 1: We set up the imports of the given decorator and the Hypothesis strategy. We also import the CrossBrowserSetup class.

set up the imports of the given decorator

Step 2: @setting(deadline=None) ensures the test doesn’t timeout if the connection to the LambdaTest Grid takes more time.

We use the @given decorator to set the strategy to just use Firefox as an input to the test_add() argument broswertype_1. We then initialize an instance of the CrossBrowserSetup class & call the add() method using the broswertype_1 & assert if it returns True .

The commented strategy @given(strategy.just(‘Chrome’)) is to demonstrate that the add() method, when called with Chrome, returns False .

commented strategy @given(strategy.just(‘Chrome’))

Let’s now run the test using pytest -k “”.

 run the test using pytest -k

We can see that the test has passed, and the Web Automation Dashboard reflects that the connection to the Selenium Grid has been successful.

connection to the Selenium Grid has been successful

On opening one of the execution runs, we can see a detailed step-by-step test execution.

see a detailed step-by-step test execution

How to Perform Hypothesis Testing in Python Using Cloud Playwright Grid?

Playwright is a popular open-source tool for end-to-end testing developed by Microsoft. When combined with a cloud grid, it can help you perform Hypothesis testing in Python at scale.

Let’s look at one test scenario to understand Hypothesis testing in Python with Playwright.

os dotenv import load_dotenv playwright.sync_api import expect, sync_playwright hypothesis import given, strategies as st subprocess urllib json () = { 'browserName': 'Chrome',  # Browsers allowed: `Chrome`, `MicrosoftEdge`, `pw-chromium`, `pw-firefox` and `pw-webkit` 'browserVersion': 'latest', 'LT:Options': { 'platform': 'Windows 11', 'build': 'Playwright Hypothesis Demo Build', 'name': 'Playwright Locators Test For Windows 11 & Chrome', 'user': os.getenv('LT_USERNAME'), 'accessKey': os.getenv('LT_ACCESS_KEY'), 'network': True, 'video': True, 'visual': True, 'console': True, 'tunnel': False,   # Add tunnel configuration if testing locally hosted webpage 'tunnelName': '',  # Optional 'geoLocation': '', # country code can be fetched from } interact_with_lambdatest(quantity): with sync_playwright() as playwright: playwrightVersion = str(subprocess.getoutput('playwright --version')).strip().split(" ")[1] capabilities['LT:Options']['playwrightClientVersion'] = playwrightVersion         lt_cdp_url = 'wss://' + urllib.parse.quote(json.dumps(capabilities))     browser = playwright.chromium.connect(lt_cdp_url) page = browser.new_page()         page.goto("") page.get_by_role("button", name="Shop by Category").click() page.get_by_role("link", name="MP3 Players").click() page.get_by_role("link", name="HTC Touch HD HTC Touch HD HTC Touch HD HTC Touch HD").click()         page.get_by_role("button", name="Add to Cart").click(click_count=quantity) page.get_by_role("link", name="Checkout ") unit_price = float(page.get_by_role("cell", name="$146.00").first.inner_text().replace("$",""))         page.evaluate("_ => {}", "lambdatest_action: {\"action\": \"setTestStatus\", \"arguments\": {\"status\":\"" + "Passed" + "\", \"remark\": \"" + "pass" + "\"}}" ) page.close() total_price = quantity * unit_price         return total_price = st.integers(min_value=1, max_value=10) given(quantity=quantity_strategy) test_website_interaction(quantity):     assert interact_with_lambdatest(quantity) == quantity * 146.00

Let’s now understand the step-by-step code walkthrough for Hypothesis testing in Python using Playwright Grid.

Step 1: To connect to the LambdaTest Playwright Grid, we need a Username and Access Key, which can be obtained from the Profile page > Account Settings > Password & Security.

We use the python-dotenv module to load the Username and Access Key, which are stored as environment variables.

The capabilities dictionary is used to set up the Playwright Grid on LambdaTest.

We configure the Grid to use Windows 11 and the latest version of Chrome.


Step 3: The function interact_with_lambdatest interacts with the LambdaTest eCommerce Playground website to simulate adding a product to the cart and proceeding to checkout.

It starts a Playwright session and retrieves the version of the Playwright being used. The LambdaTest CDP URL is created with the appropriate capabilities. It connects to the Chromium browser instance on LambdaTest.

A new page instance is created, and the LambdaTest eCommerce Playground website is navigated. The specified product is added to the cart by clicking through the required buttons and links. The unit price of the product is extracted from the web page.

The browser page is then closed.


Step 4: We define a Hypothesis strategy quantity_strategy using st.integers to generate random integers representing product quantities. The generated integers range from 1 to 10

Using the @given decorator from the Hypothesis library, we define a property-based test function test_website_interaction that takes a quantity parameter generated by the quantity_strategy .

Inside the test function, we use the interact_with_lambdatest function to simulate interacting with the website and calculate the total price based on the generated quantity.

We assert that the total_price returned by interact_with_lambdatest matches the expected value calculated as quantity * 146.00.

Test Execution

Let’s now run the test on the Playwright Cloud Grid using pytest -v -k “ ”

passed tests

The LambdaTest Web Automation Dashboard shows successfully passed tests.

LambdaTest Web

How to Perform Hypothesis Testing in Python With Date Strategy?

In the previous test scenario, we saw a simple example where we used the integer() strategy available as part of the Hypothesis. Let’s now understand another strategy, the date() strategy, which can be effectively used to test date-based functions.

Also, the output of the Hypothesis run can be customized to produce detailed results. Often, we may wish to see an even more verbose output when executing a Hypothesis test.

To do so, we have two options: either use the @settings decorator or use the –hypothesis-verbosity=<verbosity_level> when performing pytest testing .

hypothesis import Verbosity,settings, given, strategies as st datetime import datetime, timedelta generate_expiry_alert(expiry_date): current_date = days_until_expiry = (expiry_date - current_date).days return days_until_expiry <= 45 given(expiry_date=st.dates()) settings(verbosity=Verbosity.verbose, max_examples=1000) test_expiry_alert_generation(expiry_date): alert_generated = generate_expiry_alert(expiry_date) # Check if the alert is generated correctly based on the expiry date days_until_expiry = (expiry_date - expected_alert = days_until_expiry <= 45 assert alert_generated == expected_alert

Let’s now understand the code step-by-step.

Step 1: The function generate_expiry_alert() , which takes in an expiry_date as input and returns a boolean depending on whether the difference between the current date and expiry_date is less than or equal to 45 days.


Step 2: To ensure we test the generate_expiry_alert() for a wide range of date inputs, we use the date() strategy.

We also enable verbose logging and set the max_examples=1000 , which requests Hypothesis to generate 1000 date inputs at the max.


Step 3: On the inputs generated by Hypothesis in Step 3, we call the generate_expiry_alert() function and store the returned boolean in alert_generated.

We then compare the value returned by the function generate_expiry_alert() with a locally calculated copy and assert if the match.


We execute the test using the below command in the verbose mode, which allows us to see the test input dates generated by the Hypothesis.

-s --hypothesis-show-statistics --hypothesis-verbosity=debug -k "test_expiry_alert_generation"

reused data

As we can see, Hypothesis ran 1000 tests, 2 with reused data and 998 with unique newly generated data, and found no issues with the code.

Now, imagine the trouble we would have had to take to write 1000 tests manually using traditional example-based testing.

How to Perform Hypothesis Testing in Python With Composite Strategies?

So far, we’ve been using simple standalone examples to demo the power of Hypothesis. Let’s now move on to more complicated scenarios.

website offers customer rewards points. A class tracks the customer reward points and their spending. class.

The implementation of the UserRewards class is stored in a file for better readability.

UserRewards: def __init__(self, initial_points): self.reward_points = initial_points def get_reward_points(self): return self.reward_points def spend_reward_points(self, spent_points): if spent_points<= self.reward_points: self.reward_points -= spent_points return True else: return False

The tests for the UserRewards class are stored in .

hypothesis import given, strategies as st src.user_rewards import UserRewards = st.integers(min_value=0, max_value=1000)   given(initial_points=reward_points_strategy) test_get_reward_points(initial_points): user_rewards = UserRewards(initial_points) assert user_rewards.get_reward_points() == initial_points given(initial_points=reward_points_strategy, spend_amount=st.integers(min_value=0, max_value=1000)) test_spend_reward_points(initial_points, spend_amount): user_rewards = UserRewards(initial_points) remaining_points = user_rewards.get_reward_points() if spend_amount <= initial_points: assert user_rewards.spend_reward_points(spend_amount) remaining_points -= spend_amount else: assert not user_rewards.spend_reward_points(spend_amount) assert user_rewards.get_reward_points() == remaining_points

Let’s now understand what is happening with both the class file and the test file step-by-step, starting first with the UserReward class.

Step 1: The class takes in a single argument initial_points to initialize the object.

single argument

Step 2: The get_reward_points() function returns the customers current reward points.

reward points

Step 3: The spend_reward_points() takes in the spent_points as input and returns True if spent_points are less than or equal to the customer current point balance and updates the customer reward_points by subtracting the spent_points , else it returns False .


That is it for our simple UserRewards class. Next, we understand what’s happening in the step-by-step.

Step 1: We import the @given decorator and strategies from Hypothesis and the UserRewards class.


Step 2: Since reward points will always be integers, we use the integer() Hypothesis strategy to generate 1000 sample inputs starting with 0 and store them in a reward_points_strategy variable.


Step 3: Use the rewards_point_strategy as an input we run the test_get_reward_points() for 1000 samples starting with value 0.

For each input, we initialize the UserRewards class and assert that the method get_reward_points() returns the same value as the initial_points .

Step 4: To test the spend_reward_points() function, we generate two sets of sample inputs first, an initial reward_points using the reward_points_strategy we defined in Step 2 and a spend_amount which simulates spending of points.

spending of points

Step 5: Write the test_spend_reward_points , which takes in the initial_points and spend_amount as arguments and initializes the UserRewards class with initial_point .

We also initialize a remaining_points variable to track the points remaining after the spend.


Step 6: If the spend_amount is less than the initial_points allocated to the customer, we assert if spend_reward_points returns True and update the remaining_points else, we assert spend_reward_points returns False .


Step 7: Lastly, we assert if the final remaining_points are correctly returned by get_rewards_points , which should be updated after spending the reward points.


Let’s now run the test and see if Hypothesis is able to find any bugs in the code.

-s --hypothesis-show-statistics --hypothesis-verbosity=debug -k "test_user_rewards"


To test if the Hypothesis indeed works, let’s make a small change to the UserRewards by commenting on the logic to deduct the spent_points in the spend_reward_points() function.


We run the test suite again using the command pytest -s –hypothesis-show-statistics -k “test_user_rewards “.


This time, the Hypothesis highlights the failures correctly.

Thus, we can catch any bugs and potential side effects of code changes early, making it perfect for unit testing and regression testing .

To understand composite strategies a bit more, let’s now test the shopping cart functionality and see how composite strategy can help write robust tests for even the most complicated of real-world scenarios.

and which handles the shopping cart feature of the website.

Let’s view the implementation of the ShoppingCart class written in the file.

random   enum import Enum, auto   Item(Enum):   """Item type"""   LUNIX_CAMERA = auto()   IMAC = auto()   HTC_TOUCH = auto()   CANNON_EOS = auto()   IPOD_TOUCH = auto()   APPLE_VISION_PRO = auto()   COFMACBOOKFEE = auto()   GALAXY_S24 = auto()   def __str__(self):   return   ShoppingCart:   def __init__(self):   """   ""   self.items = {}   def add_item(self, item: Item, price: int | float, quantity: int = 1) -> None:   """   ""   if in self.items:   self.items[]["quantity"] += quantity   else:   self.items[] = {"price": price, "quantity": quantity}   def remove_item(self, item: Item, quantity: int = 1) -> None:   """   ""   if in self.items:   if self.items[]["quantity"] <= quantity:   del self.items[]   else:   self.items[]["quantity"] -= quantity   def get_total_price(self):   total_price = 0   for item in self.items.values():   total_price += item["price"] * item["quantity"]   return total_price

Let’s now view the tests written to verify the correct behavior of all aspects of the ShoppingCart class stored in a separate file.

typing import Callable   hypothesis import given, strategies as st   hypothesis.strategies import SearchStrategy   src.shopping_cart import ShoppingCart, Item   st.composite   items_strategy(draw: Callable[[SearchStrategy[Item]], Item]):   return draw(st.sampled_from(list(Item)))   st.composite   price_strategy(draw: Callable[[SearchStrategy[int]], int]):   return draw(st.integers(min_value=1, max_value=100)) st.composite   qty_strategy(draw: Callable[[SearchStrategy[int]], int]):   return draw(st.integers(min_value=1, max_value=10))   given(items_strategy(), price_strategy(), qty_strategy())   test_add_item_hypothesis(item, price, quantity):   cart = ShoppingCart()   # Add items to cart   cart.add_item(item=item, price=price, quantity=quantity)   # Assert that the quantity of items in the cart is equal to the number of items added   assert in cart.items   assert cart.items[]["quantity"] == quantity   given(items_strategy(), price_strategy(), qty_strategy())   test_remove_item_hypothesis(item, price, quantity):   cart = ShoppingCart()   print("Adding Items")   # Add items to cart   cart.add_item(item=item, price=price, quantity=quantity)   cart.add_item(item=item, price=price, quantity=quantity)   print(cart.items)   # Remove item from cart   print(f"Removing Item {item}")   quantity_before = cart.items[]["quantity"]   cart.remove_item(item=item)   quantity_after = cart.items[]["quantity"]   # Assert that if we remove an item, the quantity of items in the cart is equal to the number of items added - 1   assert quantity_before == quantity_after + 1   given(items_strategy(), price_strategy(), qty_strategy())   test_calculate_total_hypothesis(item, price, quantity):   cart = ShoppingCart()   # Add items to cart   cart.add_item(item=item, price=price, quantity=quantity)   cart.add_item(item=item, price=price, quantity=quantity)   # Remove item from cart   cart.remove_item(item=item)   # Calculate total   total = cart.get_total_price()   assert total == cart.items[]["price"] * cart.items[]["quantity"]

Code Walkthrough of ShoppingCart class:

Let’s now understand what is happening in the ShoppingCart class step-by-step.

Step 1: We import the Python built-in Enum class and the auto() method.

The auto function within the Enum module automatically assigns sequential integer values to enumeration members, simplifying the process of defining enumerations with incremental values.


We define an Item enum corresponding to items available for sale on the LambdaTest eCommerce Playground website.

Step 2: We initialize the ShoppingCart class with an empty dictionary of items.

empty dictionary

Step 3: The add_item() method takes in the item, price, and quantity as input and adds it to the shopping cart state held in the item dictionary.


Step 4: The remove_item() method takes in an item and quantity and removes it from the shopping cart state indicated by the item dictionary.

item dictionary

Step 5: The get_total_price() method iterates over the item dictionary, multiples the quantity by price, and returns the total_price of items in the cart.


Code Walkthrough of test_shopping_cart:

Let’s now understand step-by-step the tests written to ensure the correct working of the ShoppingCart class.

Step 1: First, we set up the imports, including the @given decorator, strategies, and the ShoppingCart class and Item enum.

The SearchStrategy is one of the various strategies on offer as part of the Hypothesis. It represents a set of rules for generating valid inputs to test a specific property or behavior of a function or program.

Hypothesis strategy

Step 2: We use the @st.composite decorator to define a custom Hypothesis strategy named items_strategy. This strategy takes a single argument, draw, which is a callable used to draw values from other strategies.

The st.sampled_from strategy randomly samples values from a given iterable. Within the strategy, we use draw(st.sampled_from(list(Item))) to draw a random Item instance from a list of all enum members.

Each time the items_strategy is used in a Hypothesis test, it will generate a random instance of the Item enum for testing purposes.


Step 3: The price_strategy runs on similar logic as the item_strategy but generates an integer value between 1 and 100.


Step 4: The qty_strategy runs on similar logic as the item_strategy but generates an integer value between 1 and 10.


Step 5: We use the @given decorator from the Hypothesis library to define a property-based test.

The items_strategy() , price_strategy() , and qty_strategy() functions are used to generate random values for the item, price, and quantity parameters, respectively.

Inside the test function, we create a new instance of a ShoppingCart .

We then add an item to the cart using the generated values for item, price, and quantity.

Finally, we assert that the item was successfully added to the cart and that the quantity matches the generated quantity.

Hypothesis library

Step 6: We use the @given decorator from the Hypothesis library to define a property-based test.

The items_strategy(), price_strategy() , and qty_strategy() functions are used to generate random values for the item, price, and quantity parameters, respectively.

Inside the test function, we create a new instance of a ShoppingCart . We then add the same item to the cart twice to simulate two quantity additions to the cart.

We remove one instance of the item from the cart. After that, we compare the item quantity before and after removal to ensure it decreases by 1.

The test verifies the behavior of the remove_item() method of the ShoppingCart class by testing it with randomly generated inputs for item, price , and quantity.


Step 7: We use the @given decorator from the Hypothesis library to define a property-based test.

The items_strategy(), price_strategy(), and qty_strategy() functions are used to generate random values for the item, price, and quantity parameters, respectively.

We add the same item to the cart twice to ensure it’s present, then remove one instance of the item from the cart. After that, we calculate the total price of items remaining in the cart.

Finally, we assert that the total price matches the price of one item times its remaining quantity.

The test verifies the correctness of the get_total_price() method of the ShoppingCart class by testing it with randomly generated inputs for item, price , and quantity .

Let’s now run the test using the command pytest –hypothesis-show-statistics -k “test_shopping_cart”.

ShoppingCart class

We can verify that Hypothesis was able to find no issues with the ShoppingCart class.

Let’s now amend the price_strategy and qty_strategy to remove the min_value and max_value arguments.


And rerun the test pytest -k “test_shopping_cart” .

respect to handling

The tests run clearly reveal that we have bugs with respect to handling scenarios when quantity and price are passed as 0.

This also reveals the fact that setting the test inputs correctly to ensure we do comprehensive testing is key to writing robots and resilient tests.

Choosing min_val and max_val should only be done when we know beforehand the bounds of inputs the function under test will receive. If we are unsure what the inputs are, maybe it’s important to come up with the right strategies based on the behavior of the function under test.

In this blog we have seen in detail how Hypothesis testing in Python works using the popular Hypothesis library. Hypothesis testing falls under property-based testing and is much better than traditional testing in handling edge cases.

We also explored Hypothesis strategies and how we can use the @composite decorator to write custom strategies for testing complex functionalities.

We also saw how Hypothesis testing in Python can be performed with popular test automation frameworks like Selenium and Playwright. In addition, by performing Hypothesis testing in Python with LambdaTest on Cloud Grid, we can set up effective automation tests to enhance our confidence in the code we’ve written.

What are the three types of Hypothesis tests?

There are three main types of hypothesis tests based on the direction of the alternative hypothesis:

  • Right-tailed test: This tests if a parameter is greater than a certain value.
  • Left-tailed test: This tests if a parameter is less than a certain value.
  • Two-tailed test: This tests for any non-directional difference, either greater or lesser than the hypothesized value.

What is Hypothesis testing in the ML model?

Hypothesis testing is a statistical approach used to evaluate the performance and validity of machine learning models. It helps us determine if a pattern observed in the training data likely holds true for unseen data (generalizability).

Mastering hypothesis testing in python: a step-by-step guide, hypothesis testing in python.

Hypothesis testing is a statistical technique that allows us to draw conclusions about a population based on a sample of data. It is often used in fields like medicine, psychology, and economics to test the effectiveness of new treatments, analyze consumer behavior, or estimate the impact of policy changes.

In Python, hypothesis testing is facilitated by modules such as scipy.stats and statsmodels.stats . In this article, we’ll explore three examples of hypothesis testing in Python: the one sample t-test, the two sample t-test, and the paired samples t-test.

For each test, we’ll provide a brief explanation of the underlying concepts, an example of a research question that can be answered using the test, and a step-by-step guide to performing the test in Python. Let’s get started!

One Sample t-test

The one sample t-test is used to compare a sample mean to a known or hypothesized population mean. This allows us to determine whether the sample mean is significantly different from the population mean.

The test assumes that the data are normally distributed and that the sample is randomly drawn from the population. Example research question: Is the mean weight of a species of turtle significantly different from a known or hypothesized value?

Step-by-step guide:

  • Define the null hypothesis (H0) and alternative hypothesis (Ha).

The null hypothesis is typically that the sample mean is equal to the population mean. The alternative hypothesis is that they are not equal.

For example:

H0: The mean weight of a species of turtle is 100 grams. Ha: The mean weight of a species of turtle is not 100 grams.

  • Collect a random sample of data.

This can be done using Python’s random module or by importing data from a file. For example:

weight_sample = [95, 105, 110, 98, 102, 116, 101, 99, 104, 108]

  • Calculate the sample mean (x), sample standard deviation (s), and standard error (SE).

x = sum(weight_sample)/len(weight_sample) s = np.std(weight_sample) SE = s / (len(weight_sample)**0.5)

  • Calculate the t-value using the formula: t = (x - μ) / (SE) , where μ is the hypothesized population mean.

t = (x - 100) / SE

  • Calculate the p-value using a t-distribution table or a Python function like scipy.stats.ttest_1samp() .

p_value = scipy.stats.ttest_1samp(weight_sample, 100).pvalue

  • Compare the p-value to the level of significance (α), typically set to 0.05.

If the p-value is less than α, reject the null hypothesis and conclude that there is sufficient evidence to support the alternative hypothesis.

If the p-value is greater than α, fail to reject the null hypothesis and conclude that there is insufficient evidence to support the alternative hypothesis. For example:

Two Sample t-test

The two sample t-test is used to compare the means of two independent samples. This allows us to determine whether the means are significantly different from each other.

The test assumes that the data are normally distributed and that the samples are randomly drawn from their respective populations. Example research question: Is the mean weight of two different species of turtles significantly different from each other?

The null hypothesis is typically that the sample means are equal. The alternative hypothesis is that they are not equal.

H0: The mean weight of species A is equal to the mean weight of species B. Ha: The mean weight of species A is not equal to the mean weight of species B.

  • Collect two random samples of data.

This can be done using Python's random module or by importing data from a file. For example:

species_a = [95, 105, 110, 98, 102] species_b = [116, 101, 99, 104, 108]

  • Calculate the sample means (x1, x2), sample standard deviations (s1, s2), and pooled standard error (SE).

x1 = sum(species_a)/len(species_a) x2 = sum(species_b)/len(species_b) s1 = np.std(species_a) s2 = np.std(species_b) n1 = len(species_a) n2 = len(species_b) SE = (((n1-1)*s1**2 + (n2-1)*s2**2)/(n1+n2-2))**0.5 * (1/n1 + 1/n2)**0.5

  • Calculate the t-value using the formula: t = (x1 - x2) / (SE) , where x1 and x2 are the sample means.

t = (x1 - x2) / SE

  • Calculate the p-value using a t-distribution table or a Python function like scipy.stats.ttest_ind() .

p_value = scipy.stats.ttest_ind(species_a, species_b).pvalue

Paired Samples t-test

The paired samples t-test is used to compare the means of two related samples. This allows us to determine whether the means are significantly different from each other, while accounting for individual differences between the samples.

The test assumes that the differences between paired observations are normally distributed. Example research question: Is there a significant difference in the max vertical jump of basketball players before and after a training program?

The null hypothesis is typically that the mean difference is equal to zero. The alternative hypothesis is that it is not equal to zero.

H0: The mean difference in max vertical jump before and after training is zero. Ha: The mean difference in max vertical jump before and after training is not zero.

  • Collect two related samples of data.

This can be done by measuring the same variable in the same subjects before and after a treatment or intervention. For example:

before = [72, 69, 77, 71, 76] after = [80, 70, 75, 74, 78]

  • Calculate the differences between the paired observations and the sample mean difference (d), sample standard deviation (s), and standard error (SE).

differences = [after[i]-before[i] for i in range(len(before))] d = sum(differences)/len(differences) s = np.std(differences) SE = s / (len(differences)**0.5)

  • Calculate the t-value using the formula: t = (d - μ) / (SE) , where μ is the hypothesized population mean difference (usually zero).

t = (d - 0) / SE

  • Calculate the p-value using a t-distribution table or a Python function like scipy.stats.ttest_rel() .

p_value = scipy.stats.ttest_rel(after, before).pvalue

Two Sample t-test in Python

The two sample t-test is used to compare two independent samples and determine if there is a significant difference between the means of the two populations. In this test, the null hypothesis is that the means of the two samples are equal, while the alternative hypothesis is that they are not equal.

Example research question: Is the mean weight of two different species of turtles significantly different from each other? Step-by-step guide:

  • Define the null hypothesis (H0) and alternative hypothesis (Ha). The null hypothesis is that the mean weight of the two turtle species is the same.

The alternative hypothesis is that they are not equal. For example:

H0: The mean weight of species A is equal to the mean weight of species B.

Ha: The mean weight of species A is not equal to the mean weight of species B. 2.

  • Collect a random sample of data for each species. For example:

species_a = [4.3, 3.9, 5.1, 4.6, 4.2, 4.8] species_b = [4.9, 5.2, 5.5, 5.3, 5.0, 4.7]

  • Calculate the sample mean (x1, x2), sample standard deviation (s1, s2), and pooled standard error (SE).

import numpy as np from scipy.stats import ttest_ind x1 = np.mean(species_a) x2 = np.mean(species_b) s1 = np.std(species_a) s2 = np.std(species_b) n1 = len(species_a) n2 = len(species_b) SE = np.sqrt(s1**2/n1 + s2**2/n2)

  • Calculate the p-value using a t-distribution table or a Python function like ttest_ind() .

p_value = ttest_ind(species_a, species_b).pvalue

If the p-value is less than α, reject the null hypothesis and conclude that there is sufficient evidence to support the alternative hypothesis. If the p-value is greater than α, fail to reject the null hypothesis and conclude that there is insufficient evidence to support the alternative hypothesis.

alpha = 0.05 if p_value     print("Reject the null hypothesis.") else:     print("Fail to reject the null hypothesis.")

In this example, if the p-value is less than 0.05, we would reject the null hypothesis and conclude that there is a significant difference between the mean weight of the two turtle species.

Paired Samples t-test in Python

The paired samples t-test is used to compare the means of two related samples. In this test, the null hypothesis is that the difference between the two means is equal to zero, while the alternative hypothesis is that they are not equal.

Example research question: Is there a significant difference in the max vertical jump of basketball players before and after a training program? Step-by-step guide:

  • Define the null hypothesis (H0) and alternative hypothesis (Ha). The null hypothesis is that the mean difference in max vertical jump before and after the training program is zero.

The alternative hypothesis is that it is not zero. For example:

H0: The mean difference in max vertical jump before and after the training program is zero.

Ha: The mean difference in max vertical jump before and after the training program is not zero. 2.

  • Collect two related samples of data, such as the max vertical jump of basketball players before and after a training program. For example:

before_training = [58, 64, 62, 70, 68] after_training = [62, 66, 64, 74, 70]

differences = [after_training[i]-before_training[i] for i in range(len(before_training))] d = np.mean(differences) s = np.std(differences) n = len(differences) SE = s / np.sqrt(n)

  • Calculate the p-value using a t-distribution table or a Python function like ttest_rel() .

p_value = ttest_rel(after_training, before_training).pvalue

if p_value     print("Reject the null hypothesis.") else:     print("Fail to reject the null hypothesis.")

In this example, if the p-value is less than 0.05, we would reject the null hypothesis and conclude that there is a significant difference in the max vertical jump of basketball players before and after the training program.

Hypothesis testing is an essential tool in statistical analysis, which gives us insights into populations based on limited data. The two sample t-test and paired samples t-test are two popular statistical methods that enable researchers to compare means of samples and determine whether they are significantly different.

With the help of Python, hypothesis testing in practice is made more accessible and convenient than ever before. In this article, we have provided a step-by-step guide to performing these tests in Python, enabling researchers to perform rigorous analyses that generate meaningful and accurate results.

In conclusion, hypothesis testing in Python is a crucial step in making conclusions about populations based on data samples. The three common hypothesis tests in Python; one-sample t-test, two-sample t-test, and paired samples t-test can be effectively applied to explore various research questions.

By setting null and alternative hypotheses, collecting data, calculating mean and standard deviation values, computing t-value, and comparing it with the set significance level of α, we can determine if there's enough evidence to reject the null hypothesis. With the use of such powerful methods, scientists can give more accurate and informed conclusions to real-world problems and take critical decisions when needed.

Continual learning and expertise with hypothesis testing in Python tools can enable researchers to leverage this powerful statistical tool for better outcomes.

Pytest With Eric

How to Use Hypothesis and Pytest for Robust Property-Based Testing in Python

There will always be cases you didn’t consider, making this an ongoing maintenance job. Unit testing solves only some of these issues.

Example-Based Testing vs Property-Based Testing

Project set up, getting started, prerequisites.

Simple Example

Source code.


def find_largest_smallest_item(input_array: list) -> tuple:
Function to find the largest and smallest items in an array
:param input_array: Input array
:return: Tuple of largest and smallest items

if len(input_array) == 0:
raise ValueError
# Set the initial values of largest and smallest to the first item in the array
largest = input_array[0]
smallest = input_array[0]

# Iterate through the array
for i in range(1, len(input_array)):
# If the current item is larger than the current value of largest, update largest
if input_array[i] > largest:
largest = input_array[i]
# If the current item is smaller than the current value of smallest, update smallest
if input_array[i]

return largest, smallest

def sort_array(input_array: list, sort_key: str) -> list:
Function to sort an array
:param sort_key: Sort key
:param input_array: Input array
:return: Sorted array
if len(input_array) == 0:
raise ValueError
if sort_key not in input_array[0]:
raise KeyError
if not isinstance(input_array[0][sort_key], int):
raise TypeError
sorted_data = sorted(input_array, key=lambda x: x[sort_key], reverse=True)
return sorted_data


def reverse_string(input_string) -> str:
Function to reverse a string
:param input_string: Input string
:return: Reversed string
return input_string[::-1]

def complex_string_operation(input_string: str) -> str:
Function to perform a complex string operation
:param input_string: Input string
:return: Transformed string
# Remove Whitespace
input_string = input_string.strip().replace(" ", "")

# Convert to Upper Case
input_string = input_string.upper()

# Remove vowels
vowels = ("A", "E", "I", "O", "U")
for x in input_string.upper():
if x in vowels:
input_string = input_string.replace(x, "")

return input_string

Simple Example — Unit Tests

Example-based testing.

import logging
from src.random_operations import (

logger = logging.getLogger(__name__)

# Example Based Unit Testing
def test_find_largest_smallest_item():
assert find_largest_smallest_item([1, 2, 3]) == (3, 1)

def test_reverse_string():
assert reverse_string("hello") == "olleh"

def test_sort_array():
data = [
{"name": "Alice", "age": 25},
{"name": "Bob", "age": 30},
{"name": "Charlie", "age": 20},
{"name": "David", "age": 35},
assert sort_array(data, "age") == [
{"name": "David", "age": 35},
{"name": "Bob", "age": 30},
{"name": "Alice", "age": 25},
{"name": "Charlie", "age": 20},

def test_complex_string_operation():
assert complex_string_operation(" Hello World ") == "HLLWRLD"

Running The Unit Test

Property-based testing.

hypothesis import given, strategies as st
from hypothesis import assume as hypothesis_assume

# Property Based Unit Testing
@given(st.lists(st.integers(), min_size=1, max_size=25))
def test_find_largest_smallest_item_hypothesis(input_list):
assert find_largest_smallest_item(input_list) == (max(input_list), min(input_list))

@given( st.lists( st.fixed_dictionaries({"name": st.text(), "age": st.integers()}), ))
def test_sort_array_hypothesis(input_list):
if len(input_list) == 0:
with pytest.raises(ValueError):
sort_array(input_list, "age")

hypothesis_assume(len(input_list) > 0)
assert sort_array(input_list, "age") == sorted(
input_list, key=lambda x: x["age"], reverse=True

def test_reverse_string_hypothesis(input_string):
assert reverse_string(input_string) == input_string[::-1]

def test_complex_string_operation_hypothesis(input_string):
assert complex_string_operation(input_string) == input_string.strip().replace(
" ", ""
).upper().replace("A", "").replace("E", "").replace("I", "").replace(
"O", ""
"U", ""

Complex Example

Source code.

from enum import Enum, auto

class Item(Enum):
"""Item type"""

APPLE = auto()
ORANGE = auto()
BANANA = auto()
CHOCOLATE = auto()
CANDY = auto()
GUM = auto()
COFFEE = auto()
TEA = auto()
SODA = auto()
WATER = auto()

def __str__(self):

class ShoppingCart:
def __init__(self):
Creates a shopping cart object with an empty dictionary of items
self.items = {}

def add_item(self, item: Item, price: int | float, quantity: int = 1) -> None:
Adds an item to the shopping cart
:param quantity: Quantity of the item
:param item: Item to add
:param price: Price of the item
:return: None
if in self.items:
self.items[]["quantity"] += quantity
self.items[] = {"price": price, "quantity": quantity}

def remove_item(self, item: Item, quantity: int = 1) -> None:
Removes an item from the shopping cart
:param quantity: Quantity of the item
:param item: Item to remove
:return: None
if in self.items:
if self.items[]["quantity"] self.items[]
self.items[]["quantity"] -= quantity

def get_total_price(self):
total = 0
for item in self.items.values():
total += item["price"] * item["quantity"]
return total

def view_cart(self) -> None:
Prints the contents of the shopping cart
:return: None
print("Shopping Cart:")
for item, price in self.items.items():
print("- {}: ${}".format(item, price))

def clear_cart(self) -> None:
Clears the shopping cart
:return: None
self.items = {}

Complex Example — Unit Tests

from src.shopping_cart import ShoppingCart, Item

def cart():
return ShoppingCart()

# Define Items
apple = Item.APPLE
orange = Item.ORANGE
gum = Item.GUM
soda = Item.SODA
water = Item.WATER
coffee = Item.COFFEE
tea = Item.TEA

# Example Based Testing
def test_add_item(cart):
cart.add_item(apple, 1.00)
cart.add_item(orange, 1.00)
cart.add_item(gum, 2.00)
cart.add_item(soda, 2.50)
assert cart.items == {
"APPLE": {"price": 1.0, "quantity": 1},
"ORANGE": {"price": 1.0, "quantity": 1},
"GUM": {"price": 2.0, "quantity": 1},
"SODA": {"price": 2.5, "quantity": 1},

def test_remove_item(cart):
cart.add_item(orange, 1.00)
cart.add_item(tea, 3.00)
cart.add_item(coffee, 3.00)
assert cart.items == {
"TEA": {"price": 3.0, "quantity": 1},
"COFFEE": {"price": 3.0, "quantity": 1},

def test_total(cart):
cart.add_item(orange, 1.00)
cart.add_item(apple, 2.00)
cart.add_item(soda, 2.00)
cart.add_item(soda, 2.00)
cart.add_item(water, 1.00)
cart.add_item(gum, 2.50)
assert cart.get_total_price() == 8.50

def test_clear_cart(cart):
cart.add_item(apple, 1.00)
cart.add_item(soda, 2.00)
cart.add_item(water, 1.00)
assert cart.items == {}

typing import Callable
from hypothesis import given, strategies as st
from hypothesis.strategies import SearchStrategy
from src.shopping_cart import ShoppingCart, Item

# Create a strategy for items
def items_strategy(draw: Callable[[SearchStrategy[Item]], Item]):
return draw(st.sampled_from(list(Item)))

# Create a strategy for price
def price_strategy(draw: Callable[[SearchStrategy[float]], float]):
return round(draw(st.floats(min_value=0.01, max_value=100, allow_nan=False)), 2)

# Create a strategy for quantity
def qty_strategy(draw: Callable[[SearchStrategy[int]], int]):
return draw(st.integers(min_value=1, max_value=10))

@given(items_strategy(), price_strategy(), qty_strategy())
def test_add_item_hypothesis(item, price, quantity):
cart = ShoppingCart()

# Add items to cart
cart.add_item(item=item, price=price, quantity=quantity)

# Assert that the quantity of items in the cart is equal to the number of items added
assert in cart.items
assert cart.items[]["quantity"] == quantity

@given(items_strategy(), price_strategy(), qty_strategy())
def test_remove_item_hypothesis(item, price, quantity):
cart = ShoppingCart()

print("Adding Items")
# Add items to cart
cart.add_item(item=item, price=price, quantity=quantity)
cart.add_item(item=item, price=price, quantity=quantity)

# Remove item from cart
print(f"Removing Item {item}")
quantity_before = cart.items[]["quantity"]
quantity_after = cart.items[]["quantity"]

# Assert that if we remove an item, the quantity of items in the cart is equal to the number of items added - 1
assert quantity_before == quantity_after + 1

@given(items_strategy(), price_strategy(), qty_strategy())
def test_calculate_total_hypothesis(item, price, quantity):
cart = ShoppingCart()

# Add items to cart
cart.add_item(item=item, price=price, quantity=quantity)
cart.add_item(item=item, price=price, quantity=quantity)

# Remove item from cart

# Calculate total
total = cart.get_total_price()
assert total == cart.items[]["price"] * cart.items[]["quantity"]

Discover Bugs With Hypothesis

Define your own hypothesis strategies.


def items_strategy(draw: Callable[[SearchStrategy[Item]], Item]):
return draw(st.sampled_from(list(Item)))

# Create a strategy for price
def price_strategy(draw: Callable[[SearchStrategy[float]], float]):
return round(draw(st.floats(min_value=0.01, max_value=100, allow_nan=False)), 2)

# Create a strategy for quantity
def qty_strategy(draw: Callable[[SearchStrategy[int]], int]):
return draw(st.integers(min_value=1, max_value=10))

Model-Based Testing in Hypothesis

Additional reading.

Hypothesis testing: Testing a Sample Statistic

Print Cheatsheet

Converting P-Values

P-values are probabilities. Translating from a probability into a significant or not significant result involves setting a significance threshold between 0 and 1. P-values less than this threshold are considered significant and p-values higher than this threshold are considered not significant.

Significance Threshold

The significance threshold is used to convert a p-value into a yes/no or a true/false result. After running a hypothesis test and obtaining a p-value, we can interpret the outcome based on whether the p-value is higher or lower than the threshold. A p-value lower than the significance threshold is considered significant and would result in the rejection of the null hypothesis. A p-value higher than the significance threshold is considered not significant.

Hypothesis Testing Errors

When using significance thresholds with hypothesis testing, two kinds of errors may occur. A type I error, also known as a false positive, happens when we incorrectly find a significant result. A type II error, also known as a false negative, happens when we incorrectly find a non-significant result:

P-value significant Type I Error Correct!
P-value not significant Correct! Type II error

Type I Error Rate

A significance threshold is used to convert a p-value into a yes/no or a true/false result. This introduces the possibility of an error: that we conclude something is true based on our test when it is actually not true. A type I error occurs when we calculate a “significant” p-value when we shouldn’t have. It turns out that the significance threshold we use for a hypothesis test is equal to our probability of making a type I error.

Multiple Hypothesis Test Error Rate

When working with a single hypothesis test, the type I error rate is equal to the significance threshold and is therefore easy for a researcher to control. However, when running multiple hypothesis tests, the probability of at least one type I error increases beyond the significance threshold for each test. The probability of an error occurring when running multiple hypothesis tests is 1-(1-a)^n, where a is the significance threshold and n is the number of tests.

Binomial Hypothesis Tests

Binomial hypothesis tests compare the number of observed “successes” among a sample of “trials” to an expected population-level probability of success. They are used for a sample of one binary categorical variable. For example, if we want to test whether a coin is fair, we might flip it 100 times and count how many heads we get. Suppose we get 40 heads in 100 flips. Then the number of observed successes would be 40, the number of trials would be 100, and the expected population-level probability of success would be 0.5 (the probability of heads for a fair coin).

Null and Alternative Hypotheses

Hypothesis tests start with a null and alternative hypothesis; the null hypothesis describes no difference from the expected population value; the alternative describes a particular kind of difference from an expected population value (less than, greater than, or different from). For example if we wanted to perform a hypothesis test examining if there is a significant difference between the temperature on earth in 1990 as compared to the temperature in 2020, we could define the following null and alternative hypotheses:

  • Null: The average temperature on earth in 1990 was the same as the average temperature in 2020
  • Alternative: The average temperature on earth in 1990 was less than the average temperature in 2020

When running a hypothesis test, it is common to report a p-value as the main outcome for the test. A p-value is the probability of observing some range of sample statistics (described by the alternative hypothesis) if the null hypothesis is true. For example, the image shown here illustrates a p-value calculation for a binomial test to determine whether a coin is fair. The p-value is equal to the proportion of the null distribution colored in red. The null and alternative hypotheses for this test are as follows:

  • Null: The probability of heads is 0.5
  • Alternative: The probability of heads is less than 0.5

Simulating Hypothesis Tests

The example code shown here simulates a binomial hypothesis test with the following null and alternative hypotheses:

  • Null: The probability that a visitor to a website makes a purchase is 0.10
  • Alternative: The probability that a visitor to a website makes a purchase is less than 0.10.

The p-value is calculated for an observed sample of 500 visitors where 41 of them made a purchase.

Binomial Tests in Python

The scipy.stats library of Python has a function called binom_test() , which is used to perform a Binomial Test. binom_test() accepts four inputs, the number of observed successes, the number of total trials, an expected probability of success, and the alternative hypothesis which can be ‘two-sided’, ‘greater’, and ‘less’.

One-Sample T-Tests

One-sample t-tests are used compare a sample mean to an expected population mean. They are used for a sample of one quantitative variable. For example, we could use a one-sample t-test to determine if the average amount of time customers spend browsing a shoe boutique is longer than 10 minutes.

One-Sample T-Tests In Python

A one-sample t-test can be implemented in Python using the ttest_1samp() function from scipy.stats . The function requires a sample distribution and expected population mean. As shown, the t-statistic and the p-value are returned.

Binomial and T-Test Assumptions

Before running a one-sample t-test, it is important to check the following assumptions.

  • The sample should be independently and randomly sampled from the population of interest
  • The sample should be normally distributed or the sample size should be large

Learn More on Codecademy

Hypothesis testing with python.

Table of Contents

Testing your python code with hypothesis, installing & using hypothesis, a quick example, understanding hypothesis, using hypothesis strategies, filtering and mapping strategies, composing strategies, constraints & satisfiability, writing reusable strategies with functions.

  • @composite: Declarative Strategies
  • @example: Explicitly Testing Certain Values

Hypothesis Example: Roman Numeral Converter

I can think of a several Python packages that greatly improved the quality of the software I write. Two of them are pytest and hypothesis . The former adds an ergonomic framework for writing tests and fixtures and a feature-rich test runner. The latter adds property-based testing that can ferret out all but the most stubborn bugs using clever algorithms, and that’s the package we’ll explore in this course.

In an ordinary test you interface with the code you want to test by generating one or more inputs to test against, and then you validate that it returns the right answer. But that, then, raises a tantalizing question: what about all the inputs you didn’t test? Your code coverage tool may well report 100% test coverage, but that does not, ipso facto , mean the code is bug-free.

One of the defining features of Hypothesis is its ability to generate test cases automatically in a manner that is:

Repeated invocations of your tests result in reproducible outcomes, even though Hypothesis does use randomness to generate the data.

You are given a detailed answer that explains how your test failed and why it failed. Hypothesis makes it clear how you, the human, can reproduce the invariant that caused your test to fail.

You can refine its strategies and tell it where or what it should or should not search for. At no point are you compelled to modify your code to suit the whims of Hypothesis if it generates nonsensical data.

So let’s look at how Hypothesis can help you discover errors in your code.

You can install hypothesis by typing pip install hypothesis . It has few dependencies of its own, and should install and run everywhere.

Hypothesis plugs into pytest and unittest by default, so you don’t have to do anything to make it work with it. In addition, Hypothesis comes with a CLI tool you can invoke with hypothesis . But more on that in a bit.

I will use pytest throughout to demonstrate Hypothesis, but it works equally well with the builtin unittest module.

Before I delve into the details of Hypothesis, let’s start with a simple example: a naive CSV writer and reader. A topic that seems simple enough: how hard is it to separate fields of data with a comma and then read it back in later?

But of course CSV is frighteningly hard to get right. The US and UK use '.' as a decimal separator, but in large parts of the world they use ',' which of course results in immediate failure. So then you start quoting things, and now you need a state machine that can distinguish quoted from unquoted; and what about nested quotes, etc.

The naive CSV reader and writer is an excellent stand-in for any number of complex projects where the requirements outwardly seem simple but there lurks a large number of edge cases that you must take into account.

Here the writer simply string quotes each field before joining them together with ',' . The reader does the opposite: it assumes each field is quoted after it is split by the comma.

A naive roundtrip pytest proves the code “works”:

And evidently so:

And for a lot of code that’s where the testing would begin and end. A couple of lines of code to test a couple of functions that outwardly behave in a manner that anybody can read and understand. Now let’s look at what a Hypothesis test would look like, and what happens when we run it:

At first blush there’s nothing here that you couldn’t divine the intent of, even if you don’t know Hypothesis. I’m asking for the argument fields to have a list ranging from one element of generated text up to ten. Aside from that, the test operates in exactly the same manner as before.

Now watch what happens when I run the test:

Hypothesis quickly found an example that broke our code. As it turns out, a list of [','] breaks our code. We get two fields back after round-tripping the code through our CSV writer and reader — uncovering our first bug.

In a nutshell, this is what Hypothesis does. But let’s look at it in detail.

Simply put, Hypothesis generates data using a number of configurable strategies . Strategies range from simple to complex. A simple strategy may generate bools; another integers. You can combine strategies to make larger ones, such as lists or dicts that match certain patterns or structures you want to test. You can clamp their outputs based on certain constraints, like only positive integers or strings of a certain length. You can also write your own strategies if you have particularly complex requirements.

Strategies are the gateway to property-based testing and are a fundamental part of how Hypothesis works. You can find a detailed list of all the strategies in the Strategies reference of their documentation or in the hypothesis.strategies module.

The best way to get a feel for what each strategy does in practice is to import them from the hypothesis.strategies module and call the example() method on an instance:

You may have noticed that the floats example included inf in the list. By default, all strategies will – where feasible – attempt to test all legal (but possibly obscure) forms of values you can generate of that type. That is particularly important as corner cases like inf or NaN are legal floating-point values but, I imagine, not something you’d ordinarily test against yourself.

And that’s one pillar of how Hypothesis tries to find bugs in your code: by testing edge cases that you would likely miss yourself. If you ask it for a text() strategy you’re as likely to be given Western characters as you are a mishmash of unicode and escape-encoded garbage. Understanding why Hypothesis generates the examples it does is a useful way to think about how your code may interact data it has no control over.

Now if it were simply generating text or numbers from an inexhaustible source of numbers or strings, it wouldn’t catch as many errors as it actually does . The reason for that is that each test you write is subjected to a battery of examples drawn from the strategies you’ve designed. If a test case fails, it’s put aside and tested again but with a reduced subset of inputs, if possible. In Hypothesis it’s known as shrinking the search space to try and find the smallest possible result that will cause your code to fail. So instead of a 10,000-length string, if it can find one that’s only 3 or 4, it will try to show that to you instead.

You can tell Hypothesis to filter or map the examples it draws to further reduce them if the strategy does not meet your requirements:

Here I ask for integers where the number is greater than 0 and is evenly divisible by 8. Hypothesis will then attempt to generate examples that meets the constraints you have imposed on it.

You can also map , which works in much the same way as filter. Here I’m asking for lowercase ASCII and then uppercasing them:

Having said that, using either when you don’t have to (I could have asked for uppercase ASCII characters to begin with) is likely to result in slower strategies.

A third option, flatmap , lets you build strategies from strategies; but that deserves closer scrutiny, so I’ll talk about it later.

You can tell Hypothesis to pick one of a number of strategies by composing strategies with | or st.one_of() :

An essential feature when you have to draw from multiple sources of examples for a single data point.

When you ask Hypothesis to draw an example it takes into account the constraints you may have imposed on it: only positive integers; only lists of numbers that add up to exactly 100; any filter() calls you may have applied; and so on. Those are constraints. You’re taking something that was once unbounded (with respect to the strategy you’re drawing an example from, that is) and introducing additional limitations that constrain the possible range of values it can give you.

But consider what happens if I pass filters that will yield nothing at all:

At some point Hypothesis will give up and declare it cannot find anything that satisfies that strategy and its constraints.

Hypothesis gives up after a while if it’s not able to draw an example. Usually that indicates an invariant in the constraints you’ve placed that makes it hard or impossible to draw examples from. In the example above, I asked for numbers that are simultaneously below zero and greater than zero, which is an impossible request.

As you can see, the strategies are simple functions, and they behave as such. You can therefore refactor each strategy into reusable patterns:

The benefit of this approach is that if you discover edge cases that Hypothesis does not account for, you can update the pattern in one place and observe its effects on your code. It’s functional and composable.

One caveat of this approach is that you cannot draw examples and expect Hypothesis to behave correctly. So I don’t recommend you call example() on a strategy only to pass it into another strategy.

For that, you want the @composite decorator.

@composite : Declarative Strategies

If the previous approach is unabashedly functional in nature, this approach is imperative.

The @composite decorator lets you write imperative Python code instead. If you cannot easily structure your strategy with the built-in ones, or if you require more granular control over the values it emits, you should consider the @composite strategy.

Instead of returning a compound strategy object like you would above, you instead draw examples using a special function you’re given access to in the decorated function.

This example draws two randomized names and returns them as a tuple:

Note that the @composite decorator passes in a special draw callable that you must use to draw samples from. You cannot – well, you can , but you shouldn’t – use the example() method on the strategy object you get back. Doing so will break Hypothesis’s ability to synthesize test cases properly.

Because the code is imperative you’re free to modify the drawn examples to your liking. But what if you’re given an example you don’t like or one that breaks a known invariant you don’t wish to test for? For that you can use the assume() function to state the assumptions that Hypothesis must meet if you try to draw an example from generate_full_name .

Let’s say that first_name and last_name must not be equal:

Like the assert statement in Python, the assume() function teaches Hypothesis what is, and is not, a valid example. You use this to great effect to generate complex compound strategies.

I recommend you observe the following rules of thumb if you write imperative strategies with @composite :

If you want to draw a succession of examples to initialize, say, a list or a custom object with values that meet certain criteria you should use filter , where possible, and assume to teach Hypothesis why the value(s) you drew and subsequently discarded weren’t any good.

The example above uses assume() to teach Hypothesis that first_name and last_name must not be equal.

If you can put your functional strategies in separate functions, you should. It encourages code re-use and if your strategies are failing (or not generating the sort of examples you’d expect) you can inspect each strategy in turn. Large nested strategies are harder to untangle and harder still to reason about.

If you can express your requirements with filter and map or the builtin constraints (like min_size or max_size ), you should. Imperative strategies that use assume may take more time to converge on a valid example.

@example : Explicitly Testing Certain Values

Occasionally you’ll come across a handful of cases that either fails or used to fail, and you want to ensure that Hypothesis does not forget to test them, or to indicate to yourself or your fellow developers that certain values are known to cause issues and should be tested explicitly.

The @example decorator does just that:

You can add as many as you like.

Let’s say I wanted to write a simple converter to and from Roman numerals.

Here I’m collecting Roman numerals into numerals , one at a time, by looping over SYMBOLS of valid numerals, subtracting the value of the symbol from number , until the while loop’s condition ( number >= 1 ) is False .

The test is also simple and serves as a smoke test. I generate a random integer and convert it to Roman numerals with to_roman . When it’s all said and done I turn the string of numerals into a set and check that all members of the set are legal Roman numerals.

Now if I run pytest on it seems to hang . But thanks to Hypothesis’s debug mode I can inspect why:

Ah. Instead of testing with tiny numbers like a human would ordinarily do, it used a fantastically large one… which is altogether slow.

OK, so there’s at least one gotcha; it’s not really a bug , but it’s something to think about: limiting the maximum value. I’m only going to limit the test, but it would be reasonable to limit it in the code also.

Changing the max_value to something sensible, like st.integers(max_value=5000) and the test now fails with another error:

It seems our code’s not able to handle the number 0! Which… is correct. The Romans didn’t really use the number zero as we would today; that invention came later, so they had a bunch of workarounds to deal with the absence of something. But that’s neither here nor there in our example. Let’s instead set min_value=1 also, as there is no support for negative numbers either:

OK… not bad. We’ve proven that given a random assortment of numbers between our defined range of values that, indeed, we get something resembling Roman numerals.

One of the hardest things about Hypothesis is framing questions to your testable code in a way that tests its properties but without you, the developer, knowing the answer (necessarily) beforehand. So one simple way to test that there’s at least something semi-coherent coming out of our to_roman function is to check that it can generate the very numerals we defined in SYMBOLS from before:

Here I’m sampling from a tuple of the SYMBOLS from earlier. The sampling algorithm’ll decide what values it wants to give us, all we care about is that we are given examples like ("I", 1) or ("V", 5) to compare against.

So let’s run pytest again:

Oops. The Roman numeral V is equal to 5 and yet we get five IIIII ? A closer examination reveals that, indeed, the code only yields sequences of I equal to the number we pass it. There’s a logic error in our code.

In the example above I loop over the elements in the SYMBOLS dictionary but as it’s ordered the first element is always I . And as the smallest representable value is 1, we end up with just that answer. It’s technically correct as you can count with just I but it’s not very useful.

Fixing it is easy though:

Rerunning the test yields a pass. Now we know that, at the very least, our to_roman function is capable of mapping numbers that are equal to any symbol in SYMBOLS .

Now the litmus test is taking the numeral we’re given and making sense of it. So let’s write a function that converts a Roman numeral back into decimal:

Like to_roman we walk through each character, get the numeral’s numeric value, and add it to the running total. The test is a simple roundtrip test as to_roman has an inverse function from_roman (and vice versa) such that :

Invertible functions are easier to test because you can compare the output of one against the input of another and check if it yields the original value. But not every function has an inverse, though.

Running the test yields a pass:

So now we’re in a pretty good place. But there’s a slight oversight in our Roman numeral converters, though: they don’t respect the subtraction rule for some of the numerals. For instance VI is 6; but IV is 4. The value XI is 11; and IX is 9. Only some (sigh) numerals exhibit this property.

So let’s write another test. This time it’ll fail as we’ve yet to write the modified code. Luckily we know the subtractive numerals we must accommodate:

Pretty simple test. Check that certain numerals yield the value, and that the values yield the right numeral.

With an extensive test suite we should feel fairly confident making changes to the code. If we break something, one of our preexisting tests will fail.

The rules around which numerals are subtractive is rather subjective. The SUBTRACTIVE_SYMBOLS dictionary holds the most common ones. So all we need to do is read ahead of the numerals list to see if there exists a two-digit numeral that has a prescribed value and then we use that instead of the usual value.

The to_roman change is simple. A union of the two numeral symbol dictionaries is all it takes . The code already understands how to turn numbers into numerals — we just added a few more.

This method requires Python 3.9 or later. Read how to merge dictionaries

If done right, running the tests should yield a pass:

And that’s it. We now have useful tests and a functional Roman numeral converter that converts to and from with ease. But one thing we didn’t do is create a strategy that generates Roman numerals using st.text() . A custom composite strategy to generate both valid and invalid Roman numerals to test the ruggedness of our converter is left as an exercise to you.

In the next part of this course we’ll look at more advanced testing strategies.

Unlike a tool like faker that generates realistic-looking test data for fixtures or demos, Hypothesis is a property-based tester . It uses heuristics and clever algorithms to find inputs that break your code.

Testing a function that does not have an inverse to compare the result against – like our Roman numeral converter that works both ways – you often have to approach your code as though it were a black box where you relinquish control of the inputs and outputs. That is harder, but makes for less brittle code.

It’s perfectly fine to mix and match tests. Hypothesis is useful for flushing out invariants you would never think of. Combine it with known inputs and outputs to jump start your testing for the first 80%, and augment it with Hypothesis to catch the remaining 20%.

Hypothesis testing involves formulating assumptions about population parameters based on sample statistics and rigorously evaluating these assumptions against empirical evidence. This article sheds light on the significance of hypothesis testing and the critical steps involved in the process.

What is Hypothesis Testing?

A hypothesis is an assumption or idea, specifically a statistical claim about an unknown population parameter. For example, a judge assumes a person is innocent and verifies this by reviewing evidence and hearing testimony before reaching a verdict.

Hypothesis testing is a statistical method that is used to make a statistical decision using experimental data. Hypothesis testing is basically an assumption that we make about a population parameter. It evaluates two mutually exclusive statements about a population to determine which statement is best supported by the sample data. 

To test the validity of the claim or assumption about the population parameter:

  • A sample is drawn from the population and analyzed.
  • The results of the analysis are used to decide whether the claim is true or not.
Example: You say an average height in the class is 30 or a boy is taller than a girl. All of these is an assumption that we are assuming, and we need some statistical way to prove these. We need some mathematical conclusion whatever we are assuming is true.

Defining Hypotheses

  • Null hypothesis (H 0 ): In statistics, the null hypothesis is a general statement or default position that there is no relationship between two measured cases or no relationship among groups. In other words, it is a basic assumption or made based on the problem knowledge. Example : A company’s mean production is 50 units/per da H 0 : [Tex]\mu [/Tex] = 50.
  • Alternative hypothesis (H 1 ): The alternative hypothesis is the hypothesis used in hypothesis testing that is contrary to the null hypothesis.  Example: A company’s production is not equal to 50 units/per day i.e. H 1 : [Tex]\mu [/Tex] [Tex]\ne [/Tex] 50.

Key Terms of Hypothesis Testing

  • Level of significance : It refers to the degree of significance in which we accept or reject the null hypothesis. 100% accuracy is not possible for accepting a hypothesis, so we, therefore, select a level of significance that is usually 5%. This is normally denoted with  [Tex]\alpha[/Tex] and generally, it is 0.05 or 5%, which means your output should be 95% confident to give a similar kind of result in each sample.
  • P-value: The P value , or calculated probability, is the probability of finding the observed/extreme results when the null hypothesis(H0) of a study-given problem is true. If your P-value is less than the chosen significance level then you reject the null hypothesis i.e. accept that your sample claims to support the alternative hypothesis.
  • Test Statistic: The test statistic is a numerical value calculated from sample data during a hypothesis test, used to determine whether to reject the null hypothesis. It is compared to a critical value or p-value to make decisions about the statistical significance of the observed results.
  • Critical value : The critical value in statistics is a threshold or cutoff point used to determine whether to reject the null hypothesis in a hypothesis test.
  • Degrees of freedom: Degrees of freedom are associated with the variability or freedom one has in estimating a parameter. The degrees of freedom are related to the sample size and determine the shape.

Why do we use Hypothesis Testing?

Hypothesis testing is an important procedure in statistics. Hypothesis testing evaluates two mutually exclusive population statements to determine which statement is most supported by sample data. When we say that the findings are statistically significant, thanks to hypothesis testing. 

One-Tailed and Two-Tailed Test

One tailed test focuses on one direction, either greater than or less than a specified value. We use a one-tailed test when there is a clear directional expectation based on prior knowledge or theory. The critical region is located on only one side of the distribution curve. If the sample falls into this critical region, the null hypothesis is rejected in favor of the alternative hypothesis.

One-Tailed Test

There are two types of one-tailed test:

  • Left-Tailed (Left-Sided) Test: The alternative hypothesis asserts that the true parameter value is less than the null hypothesis. Example: H 0 ​: [Tex]\mu \geq 50 [/Tex] and H 1 : [Tex]\mu < 50 [/Tex]
  • Right-Tailed (Right-Sided) Test : The alternative hypothesis asserts that the true parameter value is greater than the null hypothesis. Example: H 0 : [Tex]\mu \leq50 [/Tex] and H 1 : [Tex]\mu > 50 [/Tex]

Two-Tailed Test

A two-tailed test considers both directions, greater than and less than a specified value.We use a two-tailed test when there is no specific directional expectation, and want to detect any significant difference.

Example: H 0 : [Tex]\mu = [/Tex] 50 and H 1 : [Tex]\mu \neq 50 [/Tex]

To delve deeper into differences into both types of test: Refer to link

What are Type 1 and Type 2 errors in Hypothesis Testing?

In hypothesis testing, Type I and Type II errors are two possible errors that researchers can make when drawing conclusions about a population based on a sample of data. These errors are associated with the decisions made regarding the null hypothesis and the alternative hypothesis.

  • Type I error: When we reject the null hypothesis, although that hypothesis was true. Type I error is denoted by alpha( [Tex]\alpha [/Tex] ).
  • Type II errors : When we accept the null hypothesis, but it is false. Type II errors are denoted by beta( [Tex]\beta [/Tex] ).

Null Hypothesis is True

Null Hypothesis is False

Null Hypothesis is True (Accept)

Correct Decision

Type II Error (False Negative)

Alternative Hypothesis is True (Reject)

Type I Error (False Positive)

Correct Decision

How does Hypothesis Testing work?

Step 1: define null and alternative hypothesis.

State the null hypothesis ( [Tex]H_0 [/Tex] ), representing no effect, and the alternative hypothesis ( [Tex]H_1 [/Tex] ​), suggesting an effect or difference.

We first identify the problem about which we want to make an assumption keeping in mind that our assumption should be contradictory to one another, assuming Normally distributed data.

Step 2 – Choose significance level

Select a significance level ( [Tex]\alpha [/Tex] ), typically 0.05, to determine the threshold for rejecting the null hypothesis. It provides validity to our hypothesis test, ensuring that we have sufficient data to back up our claims. Usually, we determine our significance level beforehand of the test. The p-value is the criterion used to calculate our significance value.

Step 3 – Collect and Analyze data.

Gather relevant data through observation or experimentation. Analyze the data using appropriate statistical methods to obtain a test statistic.

Step 4-Calculate Test Statistic

The data for the tests are evaluated in this step we look for various scores based on the characteristics of data. The choice of the test statistic depends on the type of hypothesis test being conducted.

There are various hypothesis tests, each appropriate for various goal to calculate our test. This could be a Z-test , Chi-square , T-test , and so on.

  • Z-test : If population means and standard deviations are known. Z-statistic is commonly used.
  • t-test : If population standard deviations are unknown. and sample size is small than t-test statistic is more appropriate.
  • Chi-square test : Chi-square test is used for categorical data or for testing independence in contingency tables
  • F-test : F-test is often used in analysis of variance (ANOVA) to compare variances or test the equality of means across multiple groups.

We have a smaller dataset, So, T-test is more appropriate to test our hypothesis.

T-statistic is a measure of the difference between the means of two groups relative to the variability within each group. It is calculated as the difference between the sample means divided by the standard error of the difference. It is also known as the t-value or t-score.

Step 5 – Comparing Test Statistic:

In this stage, we decide where we should accept the null hypothesis or reject the null hypothesis. There are two ways to decide where we should accept or reject the null hypothesis.

Method A: Using Crtical values

Comparing the test statistic and tabulated critical value we have,

  • If Test Statistic>Critical Value: Reject the null hypothesis.
  • If Test Statistic≤Critical Value: Fail to reject the null hypothesis.

Note: Critical values are predetermined threshold values that are used to make a decision in hypothesis testing. To determine critical values for hypothesis testing, we typically refer to a statistical distribution table , such as the normal distribution or t-distribution tables based on.

Method B: Using P-values

We can also come to an conclusion using the p-value,

  • If the p-value is less than or equal to the significance level i.e. ( [Tex]p\leq\alpha [/Tex] ), you reject the null hypothesis. This indicates that the observed results are unlikely to have occurred by chance alone, providing evidence in favor of the alternative hypothesis.
  • If the p-value is greater than the significance level i.e. ( [Tex]p\geq \alpha[/Tex] ), you fail to reject the null hypothesis. This suggests that the observed results are consistent with what would be expected under the null hypothesis.

Note : The p-value is the probability of obtaining a test statistic as extreme as, or more extreme than, the one observed in the sample, assuming the null hypothesis is true. To determine p-value for hypothesis testing, we typically refer to a statistical distribution table , such as the normal distribution or t-distribution tables based on.

Step 7- Interpret the Results

At last, we can conclude our experiment using method A or B.

Calculating test statistic

To validate our hypothesis about a population parameter we use statistical functions . We use the z-score, p-value, and level of significance(alpha) to make evidence for our hypothesis for normally distributed data .

1. Z-statistics:

When population means and standard deviations are known.

[Tex]z = \frac{\bar{x} – \mu}{\frac{\sigma}{\sqrt{n}}}[/Tex]

  • [Tex]\bar{x} [/Tex] is the sample mean,
  • μ represents the population mean, 
  • σ is the standard deviation
  • and n is the size of the sample.

2. T-Statistics

T test is used when n<30,

t-statistic calculation is given by:

[Tex]t=\frac{x̄-μ}{s/\sqrt{n}} [/Tex]

  • t = t-score,
  • x̄ = sample mean
  • μ = population mean,
  • s = standard deviation of the sample,
  • n = sample size

3. Chi-Square Test

Chi-Square Test for Independence categorical Data (Non-normally distributed) using:

[Tex]\chi^2 = \sum \frac{(O_{ij} – E_{ij})^2}{E_{ij}}[/Tex]

  • [Tex]O_{ij}[/Tex] is the observed frequency in cell [Tex]{ij} [/Tex]
  • i,j are the rows and columns index respectively.
  • [Tex]E_{ij}[/Tex] is the expected frequency in cell [Tex]{ij}[/Tex] , calculated as : [Tex]\frac{{\text{{Row total}} \times \text{{Column total}}}}{{\text{{Total observations}}}}[/Tex]

Real life Examples of Hypothesis Testing

Let’s examine hypothesis testing using two real life situations,

Case A: D oes a New Drug Affect Blood Pressure?

Imagine a pharmaceutical company has developed a new drug that they believe can effectively lower blood pressure in patients with hypertension. Before bringing the drug to market, they need to conduct a study to assess its impact on blood pressure.

  • Before Treatment: 120, 122, 118, 130, 125, 128, 115, 121, 123, 119
  • After Treatment: 115, 120, 112, 128, 122, 125, 110, 117, 119, 114

Step 1 : Define the Hypothesis

  • Null Hypothesis : (H 0 )The new drug has no effect on blood pressure.
  • Alternate Hypothesis : (H 1 )The new drug has an effect on blood pressure.

Step 2: Define the Significance level

Let’s consider the Significance level at 0.05, indicating rejection of the null hypothesis.

If the evidence suggests less than a 5% chance of observing the results due to random variation.

Step 3 : Compute the test statistic

Using paired T-test analyze the data to obtain a test statistic and a p-value.

The test statistic (e.g., T-statistic) is calculated based on the differences between blood pressure measurements before and after treatment.

t = m/(s/√n)

  • m  = mean of the difference i.e X after, X before
  • s  = standard deviation of the difference (d) i.e d i ​= X after, i ​− X before,
  • n  = sample size,

then, m= -3.9, s= 1.8 and n= 10

we, calculate the , T-statistic = -9 based on the formula for paired t test

Step 4: Find the p-value

The calculated t-statistic is -9 and degrees of freedom df = 9, you can find the p-value using statistical software or a t-distribution table.

thus, p-value = 8.538051223166285e-06

Step 5: Result

  • If the p-value is less than or equal to 0.05, the researchers reject the null hypothesis.
  • If the p-value is greater than 0.05, they fail to reject the null hypothesis.

Conclusion: Since the p-value (8.538051223166285e-06) is less than the significance level (0.05), the researchers reject the null hypothesis. There is statistically significant evidence that the average blood pressure before and after treatment with the new drug is different.

Python Implementation of Case A

Let’s create hypothesis testing with python, where we are testing whether a new drug affects blood pressure. For this example, we will use a paired T-test. We’ll use the scipy.stats library for the T-test.

Scipy is a mathematical library in Python that is mostly used for mathematical equations and computations.

We will implement our first real life problem via python,

import numpy as np from scipy import stats # Data before_treatment = np . array ([ 120 , 122 , 118 , 130 , 125 , 128 , 115 , 121 , 123 , 119 ]) after_treatment = np . array ([ 115 , 120 , 112 , 128 , 122 , 125 , 110 , 117 , 119 , 114 ]) # Step 1: Null and Alternate Hypotheses # Null Hypothesis: The new drug has no effect on blood pressure. # Alternate Hypothesis: The new drug has an effect on blood pressure. null_hypothesis = "The new drug has no effect on blood pressure." alternate_hypothesis = "The new drug has an effect on blood pressure." # Step 2: Significance Level alpha = 0.05 # Step 3: Paired T-test t_statistic , p_value = stats . ttest_rel ( after_treatment , before_treatment ) # Step 4: Calculate T-statistic manually m = np . mean ( after_treatment - before_treatment ) s = np . std ( after_treatment - before_treatment , ddof = 1 ) # using ddof=1 for sample standard deviation n = len ( before_treatment ) t_statistic_manual = m / ( s / np . sqrt ( n )) # Step 5: Decision if p_value <= alpha : decision = "Reject" else : decision = "Fail to reject" # Conclusion if decision == "Reject" : conclusion = "There is statistically significant evidence that the average blood pressure before and after treatment with the new drug is different." else : conclusion = "There is insufficient evidence to claim a significant difference in average blood pressure before and after treatment with the new drug." # Display results print ( "T-statistic (from scipy):" , t_statistic ) print ( "P-value (from scipy):" , p_value ) print ( "T-statistic (calculated manually):" , t_statistic_manual ) print ( f "Decision: { decision } the null hypothesis at alpha= { alpha } ." ) print ( "Conclusion:" , conclusion )

T-statistic (from scipy): -9.0 P-value (from scipy): 8.538051223166285e-06 T-statistic (calculated manually): -9.0 Decision: Reject the null hypothesis at alpha=0.05. Conclusion: There is statistically significant evidence that the average blood pressure before and after treatment with the new drug is different.

In the above example, given the T-statistic of approximately -9 and an extremely small p-value, the results indicate a strong case to reject the null hypothesis at a significance level of 0.05. 

  • The results suggest that the new drug, treatment, or intervention has a significant effect on lowering blood pressure.
  • The negative T-statistic indicates that the mean blood pressure after treatment is significantly lower than the assumed population mean before treatment.

Case B : Cholesterol level in a population

Data: A sample of 25 individuals is taken, and their cholesterol levels are measured.

Cholesterol Levels (mg/dL): 205, 198, 210, 190, 215, 205, 200, 192, 198, 205, 198, 202, 208, 200, 205, 198, 205, 210, 192, 205, 198, 205, 210, 192, 205.

Populations Mean = 200

Population Standard Deviation (σ): 5 mg/dL(given for this problem)

Step 1: Define the Hypothesis

  • Null Hypothesis (H 0 ): The average cholesterol level in a population is 200 mg/dL.
  • Alternate Hypothesis (H 1 ): The average cholesterol level in a population is different from 200 mg/dL.

As the direction of deviation is not given , we assume a two-tailed test, and based on a normal distribution table, the critical values for a significance level of 0.05 (two-tailed) can be calculated through the z-table and are approximately -1.96 and 1.96.

The test statistic is calculated by using the z formula Z = [Tex](203.8 – 200) / (5 \div \sqrt{25}) [/Tex] ​ and we get accordingly , Z =2.039999999999992.

Step 4: Result

Since the absolute value of the test statistic (2.04) is greater than the critical value (1.96), we reject the null hypothesis. And conclude that, there is statistically significant evidence that the average cholesterol level in the population is different from 200 mg/dL

Python Implementation of Case B

import scipy.stats as stats import math import numpy as np # Given data sample_data = np . array ( [ 205 , 198 , 210 , 190 , 215 , 205 , 200 , 192 , 198 , 205 , 198 , 202 , 208 , 200 , 205 , 198 , 205 , 210 , 192 , 205 , 198 , 205 , 210 , 192 , 205 ]) population_std_dev = 5 population_mean = 200 sample_size = len ( sample_data ) # Step 1: Define the Hypotheses # Null Hypothesis (H0): The average cholesterol level in a population is 200 mg/dL. # Alternate Hypothesis (H1): The average cholesterol level in a population is different from 200 mg/dL. # Step 2: Define the Significance Level alpha = 0.05 # Two-tailed test # Critical values for a significance level of 0.05 (two-tailed) critical_value_left = stats . norm . ppf ( alpha / 2 ) critical_value_right = - critical_value_left # Step 3: Compute the test statistic sample_mean = sample_data . mean () z_score = ( sample_mean - population_mean ) / \ ( population_std_dev / math . sqrt ( sample_size )) # Step 4: Result # Check if the absolute value of the test statistic is greater than the critical values if abs ( z_score ) > max ( abs ( critical_value_left ), abs ( critical_value_right )): print ( "Reject the null hypothesis." ) print ( "There is statistically significant evidence that the average cholesterol level in the population is different from 200 mg/dL." ) else : print ( "Fail to reject the null hypothesis." ) print ( "There is not enough evidence to conclude that the average cholesterol level in the population is different from 200 mg/dL." )

Reject the null hypothesis. There is statistically significant evidence that the average cholesterol level in the population is different from 200 mg/dL.

Limitations of Hypothesis Testing

  • Although a useful technique, hypothesis testing does not offer a comprehensive grasp of the topic being studied. Without fully reflecting the intricacy or whole context of the phenomena, it concentrates on certain hypotheses and statistical significance.
  • The accuracy of hypothesis testing results is contingent on the quality of available data and the appropriateness of statistical methods used. Inaccurate data or poorly formulated hypotheses can lead to incorrect conclusions.
  • Relying solely on hypothesis testing may cause analysts to overlook significant patterns or relationships in the data that are not captured by the specific hypotheses being tested. This limitation underscores the importance of complimenting hypothesis testing with other analytical approaches.

Hypothesis testing stands as a cornerstone in statistical analysis, enabling data scientists to navigate uncertainties and draw credible inferences from sample data. By systematically defining null and alternative hypotheses, choosing significance levels, and leveraging statistical tests, researchers can assess the validity of their assumptions. The article also elucidates the critical distinction between Type I and Type II errors, providing a comprehensive understanding of the nuanced decision-making process inherent in hypothesis testing. The real-life example of testing a new drug’s effect on blood pressure using a paired T-test showcases the practical application of these principles, underscoring the importance of statistical rigor in data-driven decision-making.

Frequently Asked Questions (FAQs)

1. what are the 3 types of hypothesis test.

There are three types of hypothesis tests: right-tailed, left-tailed, and two-tailed. Right-tailed tests assess if a parameter is greater, left-tailed if lesser. Two-tailed tests check for non-directional differences, greater or lesser.

2.What are the 4 components of hypothesis testing?

Null Hypothesis ( [Tex]H_o [/Tex] ): No effect or difference exists. Alternative Hypothesis ( [Tex]H_1 [/Tex] ): An effect or difference exists. Significance Level ( [Tex]\alpha [/Tex] ): Risk of rejecting null hypothesis when it’s true (Type I error). Test Statistic: Numerical value representing observed evidence against null hypothesis.

3.What is hypothesis testing in ML?

Statistical method to evaluate the performance and validity of machine learning models. Tests specific hypotheses about model behavior, like whether features influence predictions or if a model generalizes well to unseen data.

4.What is the difference between Pytest and hypothesis in Python?

Pytest purposes general testing framework for Python code while Hypothesis is a Property-based testing framework for Python, focusing on generating test cases based on specified properties of the code.

Please Login to comment...

Similar reads.

  • data-science

Improve your Coding Skills with Practice


What kind of Experience do you want to share?

  • Details and advanced features
  • Edit on GitHub

Details and advanced features ¶

This is an account of slightly less common Hypothesis features that you don’t need to get started but will nevertheless make your life easier.

Additional test output ¶

Normally the output of a failing test will look something like:

With the repr of each keyword argument being printed.

Sometimes this isn’t enough, either because you have a value with a __repr__() method that isn’t very descriptive or because you need to see the output of some intermediate steps of your test. That’s where the note function comes in:

Report this value for the minimal failing example.

The note is printed for the minimal failing example of the test in order to include any additional information you might need in your test.

Test statistics ¶

If you are using pytest you can see a number of statistics about the executed tests by passing the command line argument --hypothesis-show-statistics . This will include some general statistics about the test:

For example if you ran the following with --hypothesis-show-statistics :

You would see:

The final “Stopped because” line is particularly important to note: It tells you the setting value that determined when the test should stop trying new examples. This can be useful for understanding the behaviour of your tests. Ideally you’d always want this to be max_examples .

In some cases (such as filtered and recursive strategies) you will see events mentioned which describe some aspect of the data generation:

You would see something like:

You can also mark custom events in a test using the event function:

Record an event that occurred during this test. Statistics on the number of test runs with each event will be reported at the end if you run Hypothesis in statistics reporting mode.

Event values should be strings or convertible to them. If an optional payload is given, it will be included in the string for Test statistics .

You will then see output like:

Arguments to event can be any hashable type, but two events will be considered the same if they are the same when converted to a string with str .

Making assumptions ¶

Sometimes Hypothesis doesn’t give you exactly the right sort of data you want - it’s mostly of the right shape, but some examples won’t work and you don’t want to care about them. You can just ignore these by aborting the test early, but this runs the risk of accidentally testing a lot less than you think you are. Also it would be nice to spend less time on bad examples - if you’re running 100 examples per test (the default) and it turns out 70 of those examples don’t match your needs, that’s a lot of wasted time.

Calling assume is like an assert that marks the example as bad, rather than failing the test.

This allows you to specify properties that you assume will be true, and let Hypothesis try to avoid similar examples in future.

For example suppose you had the following test:

Running this gives us:

This is annoying. We know about NaN and don’t really care about it, but as soon as Hypothesis finds a NaN example it will get distracted by that and tell us about it. Also the test will fail and we want it to pass.

So let’s block off this particular example:

And this passes without a problem.

In order to avoid the easy trap where you assume a lot more than you intended, Hypothesis will fail a test when it can’t find enough examples passing the assumption.

If we’d written:

Then on running we’d have got the exception:

How good is assume? ¶

Hypothesis has an adaptive exploration strategy to try to avoid things which falsify assumptions, which should generally result in it still being able to find examples in hard to find situations.

Suppose we had the following:

Unsurprisingly this fails and gives the falsifying example [] .

Adding assume(xs) to this removes the trivial empty example and gives us [0] .

Adding assume(all(x > 0 for x in xs)) and it passes: the sum of a list of positive integers is positive.

The reason that this should be surprising is not that it doesn’t find a counter-example, but that it finds enough examples at all.

In order to make sure something interesting is happening, suppose we wanted to try this for long lists. e.g. suppose we added an assume(len(xs) > 10) to it. This should basically never find an example: a naive strategy would find fewer than one in a thousand examples, because if each element of the list is negative with probability one-half, you’d have to have ten of these go the right way by chance. In the default configuration Hypothesis gives up long before it’s tried 1000 examples (by default it tries 200).

Here’s what happens if we try to run this:

In: test_sum_is_positive()

As you can see, Hypothesis doesn’t find many examples here, but it finds some - enough to keep it happy.

In general if you can shape your strategies better to your tests you should - for example integers(1, 1000) is a lot better than assume(1 <= x <= 1000) , but assume will take you a long way if you can’t.

Defining strategies ¶

The type of object that is used to explore the examples given to your test function is called a SearchStrategy . These are created using the functions exposed in the hypothesis.strategies module.

Many of these strategies expose a variety of arguments you can use to customize generation. For example for integers you can specify min and max values of integers you want. If you want to see exactly what a strategy produces you can ask for an example:

Many strategies are built out of other strategies. For example, if you want to define a tuple you need to say what goes in each element:

Further details are available in a separate document .

The gory details of given parameters ¶

A decorator for turning a test function that accepts arguments into a randomized test.

This is the main entry point to Hypothesis.

The @given decorator may be used to specify which arguments of a function should be parametrized over. You can use either positional or keyword arguments, but not a mixture of both.

For example all of the following are valid uses:

The following are not:

The rules for determining what are valid uses of given are as follows:

You may pass any keyword argument to given .

Positional arguments to given are equivalent to the rightmost named arguments for the test function.

Positional arguments may not be used if the underlying test function has varargs, arbitrary keywords, or keyword-only arguments.

Functions tested with given may not have any defaults.

The reason for the “rightmost named arguments” behaviour is so that using @given with instance methods works: self will be passed to the function as normal and not be parametrized over.

The function returned by given has all the same arguments as the original test, minus those that are filled in by @given . Check the notes on framework compatibility to see how this affects other testing libraries you may be using.

Targeted example generation ¶

Targeted property-based testing combines the advantages of both search-based and property-based testing. Instead of being completely random, T-PBT uses a search-based component to guide the input generation towards values that have a higher probability of falsifying a property. This explores the input space more effectively and requires fewer tests to find a bug or achieve a high confidence in the system being tested than random PBT. ( Löscher and Sagonas )

This is not always a good idea - for example calculating the search metric might take time better spent running more uniformly-random test cases, or your target metric might accidentally lead Hypothesis away from bugs - but if there is a natural metric like “floating-point error”, “load factor” or “queue length”, we encourage you to experiment with targeted testing.

Calling this function with an int or float observation gives it feedback with which to guide our search for inputs that will cause an error, in addition to all the usual heuristics. Observations must always be finite.

Hypothesis will try to maximize the observed value over several examples; almost any metric will work so long as it makes sense to increase it. For example, -abs(error) is a metric that increases as error approaches zero.

Example metrics:

Number of elements in a collection, or tasks in a queue

Mean or maximum runtime of a task (or both, if you use label )

Compression ratio for data (perhaps per-algorithm or per-level)

Number of steps taken by a state machine

The optional label argument can be used to distinguish between and therefore separately optimise distinct observations, such as the mean and standard deviation of a dataset. It is an error to call target() with any label more than once per test case.

The more examples you run, the better this technique works.

As a rule of thumb, the targeting effect is noticeable above max_examples=1000 , and immediately obvious by around ten thousand examples per label used by your test.

Test statistics include the best score seen for each label, which can help avoid the threshold problem when the minimal example shrinks right down to the threshold of failure ( issue #2180 ).

We recommend that users also skim the papers introducing targeted PBT; from ISSTA 2017 and ICST 2018 . For the curious, the initial implementation in Hypothesis uses hill-climbing search via a mutating fuzzer, with some tactics inspired by simulated annealing to avoid getting stuck and endlessly mutating a local maximum.

Custom function execution ¶

Hypothesis provides you with a hook that lets you control how it runs examples.

This lets you do things like set up and tear down around each example, run examples in a subprocess, transform coroutine tests into normal tests, etc. For example, TransactionTestCase in the Django extra runs each example in a separate database transaction.

The way this works is by introducing the concept of an executor. An executor is essentially a function that takes a block of code and run it. The default executor is:

You define executors by defining a method execute_example on a class. Any test methods on that class with @given used on them will use self.execute_example as an executor with which to run tests. For example, the following executor runs all its code twice:

Note: The functions you use in map, etc. will run inside the executor. i.e. they will not be called until you invoke the function passed to execute_example .

An executor must be able to handle being passed a function which returns None, otherwise it won’t be able to run normal test cases. So for example the following executor is invalid:

and should be rewritten as:

An alternative hook is provided for use by test runner extensions such as pytest-trio , which cannot use the execute_example method. This is not recommended for end-users - it is better to write a complete test function directly, perhaps by using a decorator to perform the same transformation before applying @given .

For authors of test runners however, assigning to the inner_test attribute of the hypothesis attribute of the test will replace the interior test.

The new inner_test must accept and pass through all the *args and **kwargs expected by the original test.

If the end user has also specified a custom executor using the execute_example method, it - and all other execution-time logic - will be applied to the new inner test assigned by the test runner.

Making random code deterministic ¶

While Hypothesis’ example generation can be used for nondeterministic tests, debugging anything nondeterministic is usually a very frustrating exercise. To make things worse, our example shrinking relies on the same input causing the same failure each time - though we show the un-shrunk failure and a decent error message if it doesn’t.

By default, Hypothesis will handle the global random and numpy.random random number generators for you, and you can register others:

Register (a weakref to) the given Random-like instance for management by Hypothesis.

You can pass instances of structural subtypes of random.Random (i.e., objects with seed, getstate, and setstate methods) to register_random(r) to have their states seeded and restored in the same way as the global PRNGs from the random and numpy.random modules.

All global PRNGs, from e.g. simulation or scheduling frameworks, should be registered to prevent flaky tests. Hypothesis will ensure that the PRNG state is consistent for all test runs, always seeding them to zero and restoring the previous state after the test, or, reproducibly varied if you choose to use the random_module() strategy.

register_random only makes weakrefs to r , thus r will only be managed by Hypothesis as long as it has active references elsewhere at runtime. The pattern register_random(MyRandom()) will raise a ReferenceError to help protect users from this issue. This check does not occur for the PyPy interpreter. See the following example for an illustration of this issue

Inferred strategies ¶

In some cases, Hypothesis can work out what to do when you omit arguments. This is based on introspection, not magic, and therefore has well-defined limits.

builds() will check the signature of the target (using signature() ). If there are required arguments with type annotations and no strategy was passed to builds() , from_type() is used to fill them in. You can also pass the value ... ( Ellipsis ) as a keyword argument, to force this inference for arguments with a default value.

@given does not perform any implicit inference for required arguments, as this would break compatibility with pytest fixtures. ... ( Ellipsis ), can be used as a keyword argument to explicitly fill in an argument from its type annotation. You can also use the hypothesis.infer alias if writing a literal ... seems too weird.

@given(...) can also be specified to fill all arguments from their type annotations.

Limitations ¶

Hypothesis does not inspect PEP 484 type comments at runtime. While from_type() will work as usual, inference in builds() and @given will only work if you manually create the __annotations__ attribute (e.g. by using @annotations(...) and @returns(...) decorators).

The typing module changes between different Python releases, including at minor versions. These are all supported on a best-effort basis, but you may encounter problems. Please report them to us, and consider updating to a newer version of Python as a workaround.

Type annotations in Hypothesis ¶

If you install Hypothesis and use mypy 0.590+, or another PEP 561 -compatible tool, the type checker should automatically pick up our type hints.

Hypothesis’ type hints may make breaking changes between minor releases.

Upstream tools and conventions about type hints remain in flux - for example the typing module itself is provisional - and we plan to support the latest version of this ecosystem, as well as older versions where practical.

We may also find more precise ways to describe the type of various interfaces, or change their type and runtime behaviour together in a way which is otherwise backwards-compatible. We often omit type hints for deprecated features or arguments, as an additional form of warning.

There are known issues inferring the type of examples generated by deferred() , recursive() , one_of() , dictionaries() , and fixed_dictionaries() . We will fix these, and require correspondingly newer versions of Mypy for type hinting, as the ecosystem improves.

Writing downstream type hints ¶

Projects that provide Hypothesis strategies and use type hints may wish to annotate their strategies too. This is a supported use-case, again on a best-effort provisional basis. For example:

SearchStrategy is the type of all strategy objects. It is a generic type, and covariant in the type of the examples it creates. For example:

integers() is of type SearchStrategy[int] .

lists(integers()) is of type SearchStrategy[List[int]] .

SearchStrategy[Dog] is a subtype of SearchStrategy[Animal] if Dog is a subtype of Animal (as seems likely).

SearchStrategy should only be used in type hints. Please do not inherit from, compare to, or otherwise use it in any way outside of type hints. The only supported way to construct objects of this type is to use the functions provided by the hypothesis.strategies module!

The Hypothesis pytest plugin ¶

Hypothesis includes a tiny plugin to improve integration with pytest , which is activated by default (but does not affect other test runners). It aims to improve the integration between Hypothesis and Pytest by providing extra information and convenient access to config options.

pytest --hypothesis-show-statistics can be used to display test and data generation statistics .

pytest --hypothesis-profile=<profile name> can be used to load a settings profile .

pytest --hypothesis-verbosity=<level name> can be used to override the current verbosity level .

pytest --hypothesis-seed=<an int> can be used to reproduce a failure with a particular seed .

pytest --hypothesis-explain can be used to temporarily enable the explain phase .

Finally, all tests that are defined with Hypothesis automatically have @pytest.mark.hypothesis applied to them. See here for information on working with markers .

Pytest will load the plugin automatically if Hypothesis is installed. You don’t need to do anything at all to use it.

Use with external fuzzers ¶

Sometimes, you might want to point a traditional fuzzer such as python-afl , pythonfuzz , or Google’s atheris (for Python and native extensions) at your code. Wouldn’t it be nice if you could use any of your @given tests as fuzz targets, instead of converting bytestrings into your objects by hand?

Depending on the input to fuzz_one_input , one of three things will happen:

If the bytestring was invalid, for example because it was too short or failed a filter or assume() too many times, fuzz_one_input returns None .

If the bytestring was valid and the test passed, fuzz_one_input returns a canonicalised and pruned buffer which will replay that test case. This is provided as an option to improve the performance of mutating fuzzers, but can safely be ignored.

If the test failed , i.e. raised an exception, fuzz_one_input will add the pruned buffer to the Hypothesis example database and then re-raise that exception. All you need to do to reproduce, minimize, and de-duplicate all the failures found via fuzzing is run your test suite!

Note that the interpretation of both input and output bytestrings is specific to the exact version of Hypothesis you are using and the strategies given to the test, just like the example database and @reproduce_failure decorator.

Interaction with settings ¶

fuzz_one_input uses just enough of Hypothesis’ internals to drive your test function with a fuzzer-provided bytestring, and most settings therefore have no effect in this mode. We recommend running your tests the usual way before fuzzing to get the benefits of healthchecks, as well as afterwards to replay, shrink, deduplicate, and report whatever errors were discovered.

The database setting is used by fuzzing mode - adding failures to the database to be replayed when you next run your tests is our preferred reporting mechanism and response to the ‘fuzzer taming’ problem .

The verbosity and stateful_step_count settings work as usual.

The deadline , derandomize , max_examples , phases , print_blob , report_multiple_bugs , and suppress_health_check settings do not affect fuzzing mode.

Thread-Safety Policy ¶

As discussed in issue #2719 , Hypothesis is not truly thread-safe and that’s unlikely to change in the future. This policy therefore describes what you can expect if you use Hypothesis with multiple threads.

Running tests in multiple processes , e.g. with pytest -n auto , is fully supported and we test this regularly in CI - thanks to process isolation, we only need to ensure that DirectoryBasedExampleDatabase can’t tread on its own toes too badly. If you find a bug here we will fix it ASAP.

Running separate tests in multiple threads is not something we design or test for, and is not formally supported. That said, anecdotally it does mostly work and we would like it to keep working - we accept reasonable patches and low-priority bug reports. The main risks here are global state, shared caches, and cached strategies.

Using multiple threads within a single test , or running a single test simultaneously in multiple threads, makes it pretty easy to trigger internal errors. We usually accept patches for such issues unless readability or single-thread performance suffer.

Hypothesis assumes that tests are single-threaded, or do a sufficiently-good job of pretending to be single-threaded. Tests that use helper threads internally should be OK, but the user must be careful to ensure that test outcomes are still deterministic. In particular it counts as nondeterministic if helper-thread timing changes the sequence of dynamic draws using e.g. the data() .

Interacting with any Hypothesis APIs from helper threads might do weird/bad things, so avoid that too - we rely on thread-local variables in a few places, and haven’t explicitly tested/audited how they respond to cross-thread API calls. While data() and equivalents are the most obvious danger, other APIs might also be subtly affected.


