model hypothesis testing in r

Hypothesis Tests in R

This tutorial covers basic hypothesis testing in R.

Normality tests
Shapiro-Wilk normality test
Kolmogorov-Smirnov test
Comparing central tendencies: Tests with continuous / discrete data
One-sample t-test : Normally-distributed sample vs. expected mean
Two-sample t-test : Two normally-distributed samples
Wilcoxen rank sum : Two non-normally-distributed samples
Weighted two-sample t-test : Two continuous samples with weights
Comparing proportions: Tests with categorical data
Chi-squared goodness of fit test : Sampled frequencies of categorical values vs. expected frequencies
Chi-squared independence test : Two sampled frequencies of categorical values
Weighted chi-squared independence test : Two weighted sampled frequencies of categorical values
Comparing multiple groups: Tests with categorical and continuous / discrete data
Analysis of Variation (ANOVA) : Normally-distributed samples in groups defined by categorical variable(s)
Kruskal-Wallace One-Way Analysis of Variance : Nonparametric test of the significance of differences between two or more groups

Hypothesis Testing

Science is "knowledge or a system of knowledge covering general truths or the operation of general laws especially as obtained and tested through scientific method" (Merriam-Webster 2022) .

The idealized world of the scientific method is question-driven , with the collection and analysis of data determined by the formulation of research questions and the testing of hypotheses. Hypotheses are tentative assumptions about what the answers to your research questions may be.

Formulate questions: How can I understand some phenomenon?
Literature review: What does existing research say about my questions?
Formulate hypotheses: What do I think the answers to my questions will be?
Collect data: What data can I gather to test my hypothesis?
Test hypotheses: Does the data support my hypothesis?
Communicate results: Who else needs to know about this?
Formulate questions: Frame missing knowledge about a phenomenon as research question(s).
Literature review: A literature review is an investigation of what existing research says about the phenomenon you are studying. A thorough literature review is essential to identify gaps in existing knowledge you can fill, and to avoid unnecessarily duplicating existing research.
Formulate hypotheses: Develop possible answers to your research questions.
Collect data: Acquire data that supports or refutes the hypothesis.
Test hypotheses: Run tools to determine if the data corroborates the hypothesis.
Communicate results: Share your findings with the broader community that might find them useful.

While the process of knowledge production is, in practice, often more iterative than this waterfall model, the testing of hypotheses is usually a fundamental element of scientific endeavors involving quantitative data.

The Problem of Induction

The scientific method looks to the past or present to build a model that can be used to infer what will happen in the future. General knowledge asserts that given a particular set of conditions, a particular outcome will or is likely to occur.

The problem of induction is that we cannot be 100% certain that what we are assuming is a general principle is not, in fact, specific to the particular set of conditions when we made our empirical observations. We cannot prove that that such principles will hold true under future conditions or different locations that we have not yet experienced (Vickers 2014) .

The problem of induction is often associated with the 18th-century British philosopher David Hume . This problem is especially vexing in the study of human beings, where behaviors are a function of complex social interactions that vary over both space and time.

Falsification

One way of addressing the problem of induction was proposed by the 20th-century Viennese philosopher Karl Popper .

Rather than try to prove a hypothesis is true, which we cannot do because we cannot know all possible situations that will arise in the future, we should instead concentrate on falsification , where we try to find situations where a hypothesis is false. While you cannot prove your hypothesis will always be true, you only need to find one situation where the hypothesis is false to demonstrate that the hypothesis can be false (Popper 1962) .

If a hypothesis is not demonstrated to be false by a particular test, we have corroborated that hypothesis. While corroboration does not "prove" anything with 100% certainty, by subjecting a hypothesis to multiple tests that fail to demonstrate that it is false, we can have increasing confidence that our hypothesis reflects reality.

Null and Alternative Hypotheses

In scientific inquiry, we are often concerned with whether a factor we are considering (such as taking a specific drug) results in a specific effect (such as reduced recovery time).

To evaluate whether a factor results in an effect, we will perform an experiment and / or gather data. For example, in a clinical drug trial, half of the test subjects will be given the drug, and half will be given a placebo (something that appears to be the drug but is actually a neutral substance).

Because the data we gather will usually only be a portion (sample) of total possible people or places that could be affected (population), there is a possibility that the sample is unrepresentative of the population. We use a statistical test that considers that uncertainty when assessing whether an effect is associated with a factor.

Statistical testing begins with an alternative hypothesis (H 1 ) that states that the factor we are considering results in a particular effect. The alternative hypothesis is based on the research question and the type of statistical test being used.
Because of the problem of induction , we cannot prove our alternative hypothesis. However, under the concept of falsification , we can evaluate the data to see if there is a significant probability that our data falsifies our alternative hypothesis (Wilkinson 2012) .
The null hypothesis (H 0 ) states that the factor has no effect. The null hypothesis is the opposite of the alternative hypothesis. The null hypothesis is what we are testing when we perform a hypothesis test.

The output of a statistical test like the t-test is a p -value. A p -value is the probability that any effects we see in the sampled data are the result of random sampling error (chance).

If a p -value is greater than the significance level (0.05 for 5% significance) we fail to reject the null hypothesis since there is a significant possibility that our results falsify our alternative hypothesis.
If a p -value is lower than the significance level (0.05 for 5% significance) we reject the null hypothesis and have corroborated (provided evidence for) our alternative hypothesis.

The calculation and interpretation of the p -value goes back to the central limit theorem , which states that random sampling error has a normal distribution.

Using our example of a clinical drug trial, if the mean recovery times for the two groups are close enough together that there is a significant possibility ( p > 0.05) that the recovery times are the same (falsification), we fail to reject the null hypothesis.

However, if the mean recovery times for the two groups are far enough apart that the probability they are the same is under the level of significance ( p < 0.05), we reject the null hypothesis and have corroborated our alternative hypothesis.

Significance means that an effect is "probably caused by something other than mere chance" (Merriam-Webster 2022) .

The significance level (α) is the threshold for significance and, by convention, is usually 5%, 10%, or 1%, which corresponds to 95% confidence, 90% confidence, or 99% confidence, respectively.
A factor is considered statistically significant if the probability that the effect we see in the data is a result of random sampling error (the p -value) is below the chosen significance level.
A statistical test is used to evaluate whether a factor being considered is statistically significant (Gallo 2016) .

Type I vs. Type II Errors

Although we are making a binary choice between rejecting and failing to reject the null hypothesis, because we are using sampled data, there is always the possibility that the choice we have made is an error.

There are two types of errors that can occur in hypothesis testing.

Type I error (false positive) occurs when a low p -value causes us to reject the null hypothesis, but the factor does not actually result in the effect.
Type II error (false negative) occurs when a high p -value causes us to fail to reject the null hypothesis, but the factor does actually result in the effect.

The numbering of the errors reflects the predisposition of the scientific method to be fundamentally skeptical . Accepting a fact about the world as true when it is not true is considered worse than rejecting a fact about the world that actually is true.

Statistical Significance vs. Importance

When we fail to reject the null hypothesis, we have found information that is commonly called statistically significant . But there are multiple challenges with this terminology.

First, statistical significance is distinct from importance (NIST 2012) . For example, if sampled data reveals a statistically significant difference in cancer rates, that does not mean that the increased risk is important enough to justify expensive mitigation measures. All statistical results require critical interpretation within the context of the phenomenon being observed. People with different values and incentives can have different interpretations of whether statistically significant results are important.

Second, the use of 95% probability for defining confidence intervals is an arbitrary convention. This creates a good vs. bad binary that suggests a "finality and certitude that are rarely justified." Alternative approaches like Beyesian statistics that express results as probabilities can offer more nuanced ways of dealing with complexity and uncertainty (Clayton 2022) .

Science vs. Non-science

Not all ideas can be falsified, and Popper uses the distinction between falsifiable and non-falsifiable ideas to make a distinction between science and non-science. In order for an idea to be science it must be an idea that can be demonstrated to be false.

While Popper asserts there is still value in ideas that are not falsifiable, such ideas are not science in his conception of what science is. Such non-science ideas often involve questions of subjective values or unseen forces that are complex, amorphous, or difficult to objectively observe.

Example Data

As example data, this tutorial will use a table of anonymized individual responses from the CDC's Behavioral Risk Factor Surveillance System . The BRFSS is a "system of health-related telephone surveys that collect state data about U.S. residents regarding their health-related risk behaviors, chronic health conditions, and use of preventive services" (CDC 2019) .

A CSV file with the selected variables used in this tutorial is available here and can be imported into R with read.csv() .

Guidance on how to download and process this data directly from the CDC website is available here...

Variable Types

The publicly-available BRFSS data contains a wide variety of discrete, ordinal, and categorical variables. Variables often contain special codes for non-responsiveness or missing (NA) values. Examples of how to clean these variables are given here...

The BRFSS has a codebook that gives the survey questions associated with each variable, and the way that responses are encoded in the variable values.

Normality Tests

Tests are commonly divided into two groups depending on whether they are built on the assumption that the continuous variable has a normal distribution.

Parametric tests presume a normal distribution.
Non-parametric tests can work with normal and non-normal distributions.

The distinction between parametric and non-parametric techniques is especially important when working with small numbers of samples (less than 40 or so) from a larger population.

The normality tests given below do not work with large numbers of values, but with many statistical techniques, violations of normality assumptions do not cause major problems when large sample sizes are used. (Ghasemi and Sahediasi 2012) .

The Shapiro-Wilk Normality Test

Data: A continuous or discrete sampled variable
R Function: shapiro.test()
Null hypothesis (H 0 ): The population distribution from which the sample is drawn is not normal
History: Samuel Sanford Shapiro and Martin Wilk (1965)

This is an example with random values from a normal distribution.

This is an example with random values from a uniform (non-normal) distribution.

The Kolmogorov-Smirnov Test

The Kolmogorov-Smirnov is a more-generalized test than the Shapiro-Wilks test that can be used to test whether a sample is drawn from any type of distribution.

Data: A continuous or discrete sampled variable and a reference probability distribution
R Function: ks.test()
Null hypothesis (H 0 ): The population distribution from which the sample is drawn does not match the reference distribution
History: Andrey Kolmogorov (1933) and Nikolai Smirnov (1948)
pearson.test() The Pearson Chi-square Normality Test from the nortest library. Lower p-values (closer to 0) means to reject the reject the null hypothesis that the distribution IS normal.

Modality Tests of Samples

Comparing two central tendencies: tests with continuous / discrete data, one sample t-test (two-sided).

The one-sample t-test tests the significance of the difference between the mean of a sample and an expected mean.

Data: A continuous or discrete sampled variable and a single expected mean (μ)
Parametric (normal distributions)
R Function: t.test()
Null hypothesis (H 0 ): The means of the sampled distribution matches the expected mean.
History: William Sealy Gosset (1908)

t = ( Χ - μ) / (σ̂ / √ n )

t : The value of t used to find the p-value
Χ : The sample mean
μ: The population mean
σ̂: The estimate of the standard deviation of the population (usually the stdev of the sample
n : The sample size

T-tests should only be used when the population is at least 20 times larger than its respective sample. If the sample size is too large, the low p-value makes the insignificant look significant. .

For example, we test a hypothesis that the mean weight in IL in 2020 is different than the 2005 continental mean weight.

Walpole et al. (2012) estimated that the average adult weight in North America in 2005 was 178 pounds. We could presume that Illinois is a comparatively normal North American state that would follow the trend of both increased age and increased weight (CDC 2021) .

The low p-value leads us to reject the null hypothesis and corroborate our alternative hypothesis that mean weight changed between 2005 and 2020 in Illinois.

One Sample T-Test (One-Sided)

Because we were expecting an increase, we can modify our hypothesis that the mean weight in 2020 is higher than the continental weight in 2005. We can perform a one-sided t-test using the alternative="greater" parameter.

The low p-value leads us to again reject the null hypothesis and corroborate our alternative hypothesis that mean weight in 2020 is higher than the continental weight in 2005.

Note that this does not clearly evaluate whether weight increased specifically in Illinois, or, if it did, whether that was caused by an aging population or decreasingly healthy diets. Hypotheses based on such questions would require more detailed analysis of individual data.

Although we can see that the mean cancer incidence rate is higher for counties near nuclear plants, there is the possiblity that the difference in means happened by accident and the nuclear plants have nothing to do with those higher rates.

The t-test allows us to test a hypothesis. Note that a t-test does not "prove" or "disprove" anything. It only gives the probability that the differences we see between two areas happened by chance. It also does not evaluate whether there are other problems with the data, such as a third variable, or inaccurate cancer incidence rate estimates.

Note that this does not prove that nuclear power plants present a higher cancer risk to their neighbors. It simply says that the slightly higher risk is probably not due to chance alone. But there are a wide variety of other other related or unrelated social, environmental, or economic factors that could contribute to this difference.

Box-and-Whisker Chart

One visualization commonly used when comparing distributions (collections of numbers) is a box-and-whisker chart. The boxes show the range of values in the middle 25% to 50% to 75% of the distribution and the whiskers show the extreme high and low values.

Although Google Sheets does not provide the capability to create box-and-whisker charts, Google Sheets does have candlestick charts , which are similar to box-and-whisker charts, and which are normally used to display the range of stock price changes over a period of time.

This video shows how to create a candlestick chart comparing the distributions of cancer incidence rates. The QUARTILE() function gets the values that divide the distribution into four equally-sized parts. This shows that while the range of incidence rates in the non-nuclear counties are wider, the bulk of the rates are below the rates in nuclear counties, giving a visual demonstration of the numeric output of our t-test.

While categorical data can often be reduced to dichotomous data and used with proportions tests or t-tests, there are situations where you are sampling data that falls into more than two categories and you would like to make hypothesis tests about those categories. This tutorial describes a group of tests that can be used with that type of data.

Two-Sample T-Test

When comparing means of values from two different groups in your sample, a two-sample t-test is in order.

The two-sample t-test tests the significance of the difference between the means of two different samples.

Two normally-distributed, continuous or discrete sampled variables, OR
A normally-distributed continuous or sampled variable and a parallel dichotomous variable indicating what group each of the values in the first variable belong to
Null hypothesis (H 0 ): The means of the two sampled distributions are equal.

For example, given the low incomes and delicious foods prevalent in Mississippi, we might presume that average weight in Mississippi would be higher than in Illinois.

We test a hypothesis that the mean weight in IL in 2020 is less than the 2020 mean weight in Mississippi.

The low p-value leads us to reject the null hypothesis and corroborate our alternative hypothesis that mean weight in Illinois is less than in Mississippi.

While the difference in means is statistically significant, it is small (182 vs. 187), which should lead to caution in interpretation that you avoid using your analysis simply to reinforce unhelpful stigmatization.

Wilcoxen Rank Sum Test (Mann-Whitney U-Test)

The Wilcoxen rank sum test tests the significance of the difference between the means of two different samples. This is a non-parametric alternative to the t-test.

Data: Two continuous sampled variables
Non-parametric (normal or non-normal distributions)
R Function: wilcox.test()
Null hypothesis (H 0 ): For randomly selected values X and Y from two populations, the probability of X being greater than Y is equal to the probability of Y being greater than X.
History: Frank Wilcoxon (1945) and Henry Mann and Donald Whitney (1947)

The test is is implemented with the wilcox.test() function.

When the test is performed on one sample in comparison to an expected value around which the distribution is symmetrical (μ), the test is known as a Mann-Whitney U test .
When the test is performed to compare two samples, the test is known as a Wilcoxon rank sum test .

For this example, we will use AVEDRNK3: During the past 30 days, on the days when you drank, about how many drinks did you drink on the average?

1 - 76: Number of drinks
77: Don’t know/Not sure
99: Refused
NA: Not asked or Missing

The histogram clearly shows this to be a non-normal distribution.

Continuing the comparison of Illinois and Mississippi from above, we might presume that with all that warm weather and excellent food in Mississippi, they might be inclined to drink more. The means of average number of drinks per month seem to suggest that Mississippians do drink more than Illinoians.

We can test use wilcox.test() to test a hypothesis that the average amount of drinking in Illinois is different than in Mississippi. Like the t-test, the alternative can be specified as two-sided or one-sided, and for this example we will test whether the sampled Illinois value is indeed less than the Mississippi value.

The low p-value leads us to reject the null hypothesis and corroborates our hypothesis that average drinking is lower in Illinois than in Mississippi. As before, this tells us nothing about why this is the case.

Weighted Two-Sample T-Test

The downloadable BRFSS data is raw, anonymized survey data that is biased by uneven geographic coverage of survey administration (noncoverage) and lack of responsiveness from some segments of the population (nonresponse). The X_LLCPWT field (landline, cellphone weighting) is a weighting factor added by the CDC that can be assigned to each response to compensate for these biases.

The wtd.t.test() function from the weights library has a weights parameter that can be used to include a weighting factor as part of the t-test.

Comparing Proportions: Tests with Categorical Data

Chi-squared goodness of fit.

Tests the significance of the difference between sampled frequencies of different values and expected frequencies of those values
Data: A categorical sampled variable and a table of expected frequencies for each of the categories
R Function: chisq.test()
Null hypothesis (H 0 ): The relative proportions of categories in one variable are different from the expected proportions
History: Karl Pearson (1900)
Example Question: Are the voting preferences of voters in my district significantly different from the current national polls?

For example, we test a hypothesis that smoking rates changed between 2000 and 2020.

In 2000, the estimated rate of adult smoking in Illinois was 22.3% (Illinois Department of Public Health 2004) .

The variable we will use is SMOKDAY2: Do you now smoke cigarettes every day, some days, or not at all?

1: Current smoker - now smokes every day
2: Current smoker - now smokes some days
3: Not at all
7: Don't know
NA: Not asked or missing - NA is used for people who have never smoked

We subset only yes/no responses in Illinois and convert into a dummy variable (yes = 1, no = 0).

The listing of the table as percentages indicates that smoking rates were halved between 2000 and 2020, but since this is sampled data, we need to run a chi-squared test to make sure the difference can't be explained by the randomness of sampling.

In this case, the very low p-value leads us to reject the null hypothesis and corroborates the alternative hypothesis that smoking rates changed between 2000 and 2020.

Chi-Squared Contingency Analysis / Test of Independence

Tests the significance of the difference between frequencies between two different groups
Data: Two categorical sampled variables
Null hypothesis (H 0 ): The relative proportions of one variable are independent of the second variable.

We can also compare categorical proportions between two sets of sampled categorical variables.

The chi-squared test can is used to determine if two categorical variables are independent. What is passed as the parameter is a contingency table created with the table() function that cross-classifies the number of rows that are in the categories specified by the two categorical variables.

The null hypothesis with this test is that the two categories are independent. The alternative hypothesis is that there is some dependency between the two categories.

For this example, we can compare the three categories of smokers (daily = 1, occasionally = 2, never = 3) across the two categories of states (Illinois and Mississippi).

The low p-value leads us to reject the null hypotheses that the categories are independent and corroborates our hypotheses that smoking behaviors in the two states are indeed different.

p-value = 1.516e-09

Weighted Chi-Squared Contingency Analysis

As with the weighted t-test above, the weights library contains the wtd.chi.sq() function for incorporating weighting into chi-squared contingency analysis.

As above, the even lower p-value leads us to again reject the null hypothesis that smoking behaviors are independent in the two states.

Suppose that the Macrander campaign would like to know how partisan this election is. If people are largely choosing to vote along party lines, the campaign will seek to get their base voters out to the polls. If people are splitting their ticket, the campaign may focus their efforts more broadly.

In the example below, the Macrander campaign took a small poll of 30 people asking who they wished to vote for AND what party they most strongly affiliate with.

The output of table() shows fairly strong relationship between party affiliation and candidates. Democrats tend to vote for Macrander, while Republicans tend to vote for Stewart, while independents all vote for Miller.

This is reflected in the very low p-value from the chi-squared test. This indicates that there is a very low probability that the two categories are independent. Therefore we reject the null hypothesis.

In contrast, suppose that the poll results had showed there were a number of people crossing party lines to vote for candidates outside their party. The simulated data below uses the runif() function to randomly choose 50 party names.

The contingency table() shows no clear relationship between party affiliation and candidate. This is validated quantitatively by the chi-squared test. The fairly high p-value of 0.4018 indicates a 40% chance that the two categories are independent. Therefore, we fail to reject the null hypothesis and the campaign should focus their efforts on the broader electorate.

The warning message given by the chisq.test() function indicates that the sample size is too small to make an accurate analysis. The simulate.p.value = T parameter adds Monte Carlo simulation to the test to improve the estimation and get rid of the warning message. However, the best way to get rid of this message is to get a larger sample.

Comparing Categorical and Continuous Variables

Analysis of variation (anova).

Analysis of Variance (ANOVA) is a test that you can use when you have a categorical variable and a continuous variable. It is a test that considers variability between means for different categories as well as the variability of observations within groups.

There are a wide variety of different extensions of ANOVA that deal with covariance (ANCOVA), multiple variables (MANOVA), and both of those together (MANCOVA). These techniques can become quite complicated and also assume that the values in the continuous variables have a normal distribution.

Data: One or more categorical (independent) variables and one continuous (dependent) sampled variable
R Function: aov()
Null hypothesis (H 0 ): There is no difference in means of the groups defined by each level of the categorical (independent) variable
History: Ronald Fisher (1921)
Example Question: Do low-, middle- and high-income people vary in the amount of time they spend watching TV?

As an example, we look at the continuous weight variable (WEIGHT2) split into groups by the eight income categories in INCOME2: Is your annual household income from all sources?

1: Less than $10,000
2: $10,000 to less than $15,000
3: $15,000 to less than $20,000
4: $20,000 to less than $25,000
5: $25,000 to less than $35,000
6: $35,000 to less than $50,000
7: $50,000 to less than $75,000)
8: $75,000 or more

The barplot() of means does show variation among groups, although there is no clear linear relationship between income and weight.

To test whether this variation could be explained by randomness in the sample, we run the ANOVA test.

The low p-value leads us to reject the null hypothesis that there is no difference in the means of the different groups, and corroborates the alternative hypothesis that mean weights differ based on income group.

However, it gives us no clear model for describing that relationship and offers no insights into why income would affect weight, especially in such a nonlinear manner.

Suppose you are performing research into obesity in your city. You take a sample of 30 people in three different neighborhoods (90 people total), collecting information on health and lifestyle. Two variables you collect are height and weight so you can calculate body mass index . Although this index can be misleading for some populations (notably very athletic people), ordinary sedentary people can be classified according to BMI:

Average BMI in the US from 2007-2010 was around 28.6 and rising, standard deviation of around 5 .

You would like to know if there is a difference in BMI between different neighborhoods so you can know whether to target specific neighborhoods or make broader city-wide efforts. Since you have more than two groups, you cannot use a t-test().

Kruskal-Wallace One-Way Analysis of Variance

A somewhat simpler test is the Kruskal-Wallace test which is a nonparametric analogue to ANOVA for testing the significance of differences between two or more groups.

R Function: kruskal.test()
Null hypothesis (H 0 ): The samples come from the same distribution.
History: William Kruskal and W. Allen Wallis (1952)

For this example, we will investigate whether mean weight varies between the three major US urban states: New York, Illinois, and California.

To test whether this variation could be explained by randomness in the sample, we run the Kruskal-Wallace test.

The low p-value leads us to reject the null hypothesis that the samples come from the same distribution. This corroborates the alternative hypothesis that mean weights differ based on state.

A convienent way of visualizing a comparison between continuous and categorical data is with a box plot , which shows the distribution of a continuous variable across different groups:

A percentile is the level at which a given percentage of the values in the distribution are below: the 5th percentile means that five percent of the numbers are below that value.

The quartiles divide the distribution into four parts. 25% of the numbers are below the first quartile. 75% are below the third quartile. 50% are below the second quartile, making it the median.

Box plots can be used with both sampled data and population data.

The first parameter to the box plot is a formula: the continuous variable as a function of (the tilde) the second variable. A data= parameter can be added if you are using variables in a data frame.

The chi-squared test can be used to determine if two categorical variables are independent of each other.

The Complete Guide: Hypothesis Testing in R

A hypothesis test is a formal statistical test we use to reject or fail to reject some statistical hypothesis.

This tutorial explains how to perform the following hypothesis tests in R:

One sample t-test
Two sample t-test
Paired samples t-test

We can use the t.test() function in R to perform each type of test:

x, y: The two samples of data.
alternative: The alternative hypothesis of the test.
mu: The true value of the mean.
paired: Whether to perform a paired t-test or not.
var.equal: Whether to assume the variances are equal between the samples.
conf.level: The confidence level to use.

The following examples show how to use this function in practice.

Example 1: One Sample t-test in R

A one sample t-test is used to test whether or not the mean of a population is equal to some value.

For example, suppose we want to know whether or not the mean weight of a certain species of some turtle is equal to 310 pounds. We go out and collect a simple random sample of turtles with the following weights:

Weights : 300, 315, 320, 311, 314, 309, 300, 308, 305, 303, 305, 301, 303

The following code shows how to perform this one sample t-test in R:

From the output we can see:

t-test statistic: -1.5848
degrees of freedom: 12
p-value: 0.139
95% confidence interval for true mean: [303.4236, 311.0379]
mean of turtle weights: 307.230

Since the p-value of the test (0.139) is not less than .05, we fail to reject the null hypothesis.

This means we do not have sufficient evidence to say that the mean weight of this species of turtle is different from 310 pounds.

Example 2: Two Sample t-test in R

A two sample t-test is used to test whether or not the means of two populations are equal.

For example, suppose we want to know whether or not the mean weight between two different species of turtles is equal. To test this, we collect a simple random sample of turtles from each species with the following weights:

Sample 1 : 300, 315, 320, 311, 314, 309, 300, 308, 305, 303, 305, 301, 303

Sample 2 : 335, 329, 322, 321, 324, 319, 304, 308, 305, 311, 307, 300, 305

The following code shows how to perform this two sample t-test in R:

t-test statistic: -2.1009
degrees of freedom: 19.112
p-value: 0.04914
95% confidence interval for true mean difference: [-14.74, -0.03]
mean of sample 1 weights: 307.2308
mean of sample 2 weights: 314.6154

Since the p-value of the test (0.04914) is less than .05, we reject the null hypothesis.

This means we have sufficient evidence to say that the mean weight between the two species is not equal.

Example 3: Paired Samples t-test in R

A paired samples t-test is used to compare the means of two samples when each observation in one sample can be paired with an observation in the other sample.

For example, suppose we want to know whether or not a certain training program is able to increase the max vertical jump (in inches) of basketball players.

To test this, we may recruit a simple random sample of 12 college basketball players and measure each of their max vertical jumps. Then, we may have each player use the training program for one month and then measure their max vertical jump again at the end of the month.

The following data shows the max jump height (in inches) before and after using the training program for each player:

Before : 22, 24, 20, 19, 19, 20, 22, 25, 24, 23, 22, 21

After : 23, 25, 20, 24, 18, 22, 23, 28, 24, 25, 24, 20

The following code shows how to perform this paired samples t-test in R:

t-test statistic: -2.5289
degrees of freedom: 11
p-value: 0.02803
95% confidence interval for true mean difference: [-2.34, -0.16]
mean difference between before and after: -1.25

Since the p-value of the test (0.02803) is less than .05, we reject the null hypothesis.

This means we have sufficient evidence to say that the mean jump height before and after using the training program is not equal.

Additional Resources

Use the following online calculators to automatically perform various t-tests:

One Sample t-test Calculator Two Sample t-test Calculator Paired Samples t-test Calculator

How to Calculate Mode from Frequency Table (With Examples)

How to use write.table in r (with examples), related posts, how to create a stem-and-leaf plot in spss, how to create a correlation matrix in spss, how to convert date of birth to age..., excel: how to highlight entire row based on..., how to add target line to graph in..., excel: how to use if function with negative..., excel: how to use if function with text..., excel: how to use greater than or equal..., excel: how to use if function with multiple..., how to extract number from string in pandas.

An R Introduction to Statistics

Hypothesis Testing

$fractal-10h$

In the following tutorials, we demonstrate the procedure of hypothesis testing in R first with the intuitive critical value approach. Then we discuss the popular p-value approach as alternative.

Lower Tail Test of Population Mean with Known Variance
Upper Tail Test of Population Mean with Known Variance
Two-Tailed Test of Population Mean with Known Variance
Lower Tail Test of Population Mean with Unknown Variance
Upper Tail Test of Population Mean with Unknown Variance
Two-Tailed Test of Population Mean with Unknown Variance
Lower Tail Test of Population Proportion
Upper Tail Test of Population Proportion
Two-Tailed Test of Population Proportion
Elementary Statistics with R
hypothesis testing
significance level
type I error

R Tutorial eBook

R Tutorials

Combining Vectors
Vector Arithmetics
Vector Index
Numeric Index Vector
Logical Index Vector
Named Vector Members
Matrix Construction
Named List Members
Data Frame Column Vector
Data Frame Column Slice
Data Frame Row Slice
Data Import
Frequency Distribution of Qualitative Data
Relative Frequency Distribution of Qualitative Data
Category Statistics
Frequency Distribution of Quantitative Data
Relative Frequency Distribution of Quantitative Data
Cumulative Frequency Distribution
Cumulative Frequency Graph
Cumulative Relative Frequency Distribution
Cumulative Relative Frequency Graph
Stem-and-Leaf Plot
Scatter Plot
Interquartile Range
Standard Deviation
Correlation Coefficient
Central Moment
Binomial Distribution
Poisson Distribution
Continuous Uniform Distribution
Exponential Distribution
Normal Distribution
Chi-squared Distribution
Student t Distribution
F Distribution
Point Estimate of Population Mean
Interval Estimate of Population Mean with Known Variance
Interval Estimate of Population Mean with Unknown Variance
Sampling Size of Population Mean
Point Estimate of Population Proportion
Interval Estimate of Population Proportion
Sampling Size of Population Proportion
Type II Error in Lower Tail Test of Population Mean with Known Variance
Type II Error in Upper Tail Test of Population Mean with Known Variance
Type II Error in Two-Tailed Test of Population Mean with Known Variance
Type II Error in Lower Tail Test of Population Mean with Unknown Variance
Type II Error in Upper Tail Test of Population Mean with Unknown Variance
Type II Error in Two-Tailed Test of Population Mean with Unknown Variance
Population Mean Between Two Matched Samples
Population Mean Between Two Independent Samples
Comparison of Two Population Proportions
Multinomial Goodness of Fit
Chi-squared Test of Independence
Completely Randomized Design
Randomized Block Design
Factorial Design
Wilcoxon Signed-Rank Test
Mann-Whitney-Wilcoxon Test
Kruskal-Wallis Test
Estimated Simple Regression Equation
Coefficient of Determination
Significance Test for Linear Regression
Confidence Interval for Linear Regression
Prediction Interval for Linear Regression
Residual Plot
Standardized Residual
Normal Probability Plot of Residuals
Estimated Multiple Regression Equation
Multiple Coefficient of Determination
Adjusted Coefficient of Determination
Significance Test for MLR
Confidence Interval for MLR
Prediction Interval for MLR
Estimated Logistic Regression Equation
Significance Test for Logistic Regression
Distance Matrix by GPU
Hierarchical Cluster Analysis
Kendall Rank Coefficient
Significance Test for Kendall's Tau-b
Support Vector Machine with GPU
Support Vector Machine with GPU, Part II
Bayesian Classification with Gaussian Process
Hierarchical Linear Model
Installing GPU Packages

Want to create or adapt books like this? Learn more about how Pressbooks supports open publishing practices.

11 Hypothesis testing

The process of induction is the process of assuming the simplest law that can be made to harmonize with our experience. This process, however, has no logical foundation but only a psychological one. It is clear that there are no grounds for believing that the simplest course of events will really happen. It is an hypothesis that the sun will rise tomorrow: and this means that we do not know whether it will rise. – Ludwig Wittgenstein 157

In the last chapter, I discussed the ideas behind estimation, which is one of the two “big ideas” in inferential statistics. It’s now time to turn out attention to the other big idea, which is hypothesis testing . In its most abstract form, hypothesis testing really a very simple idea: the researcher has some theory about the world, and wants to determine whether or not the data actually support that theory. However, the details are messy, and most people find the theory of hypothesis testing to be the most frustrating part of statistics. The structure of the chapter is as follows. Firstly, I’ll describe how hypothesis testing works, in a fair amount of detail, using a simple running example to show you how a hypothesis test is “built”. I’ll try to avoid being too dogmatic while doing so, and focus instead on the underlying logic of the testing procedure. 158 Afterwards, I’ll spend a bit of time talking about the various dogmas, rules and heresies that surround the theory of hypothesis testing.

11.1 A menagerie of hypotheses

Eventually we all succumb to madness. For me, that day will arrive once I’m finally promoted to full professor. Safely ensconced in my ivory tower, happily protected by tenure, I will finally be able to take leave of my senses (so to speak), and indulge in that most thoroughly unproductive line of psychological research: the search for extrasensory perception (ESP). 159

Let’s suppose that this glorious day has come. My first study is a simple one, in which I seek to test whether clairvoyance exists. Each participant sits down at a table, and is shown a card by an experimenter. The card is black on one side and white on the other. The experimenter takes the card away, and places it on a table in an adjacent room. The card is placed black side up or white side up completely at random, with the randomisation occurring only after the experimenter has left the room with the participant. A second experimenter comes in and asks the participant which side of the card is now facing upwards. It’s purely a one-shot experiment. Each person sees only one card, and gives only one answer; and at no stage is the participant actually in contact with someone who knows the right answer. My data set, therefore, is very simple. I have asked the question of $N$ people, and some number $X$ of these people have given the correct response. To make things concrete, let’s suppose that I have tested $N = 100$ people, and $X = 62$ of these got the answer right… a surprisingly large number, sure, but is it large enough for me to feel safe in claiming I’ve found evidence for ESP? This is the situation where hypothesis testing comes in useful. However, before we talk about how to test hypotheses, we need to be clear about what we mean by hypotheses.

11.1.1 Research hypotheses versus statistical hypotheses

The first distinction that you need to keep clear in your mind is between research hypotheses and statistical hypotheses. In my ESP study, my overall scientific goal is to demonstrate that clairvoyance exists. In this situation, I have a clear research goal: I am hoping to discover evidence for ESP. In other situations I might actually be a lot more neutral than that, so I might say that my research goal is to determine whether or not clairvoyance exists. Regardless of how I want to portray myself, the basic point that I’m trying to convey here is that a research hypothesis involves making a substantive, testable scientific claim… if you are a psychologist, then your research hypotheses are fundamentally about psychological constructs. Any of the following would count as research hypotheses :

Listening to music reduces your ability to pay attention to other things. This is a claim about the causal relationship between two psychologically meaningful concepts (listening to music and paying attention to things), so it’s a perfectly reasonable research hypothesis.
Intelligence is related to personality . Like the last one, this is a relational claim about two psychological constructs (intelligence and personality), but the claim is weaker: correlational not causal.
Intelligence is* speed of information processing . This hypothesis has a quite different character: it’s not actually a relational claim at all. It’s an ontological claim about the fundamental character of intelligence (and I’m pretty sure it’s wrong). It’s worth expanding on this one actually: It’s usually easier to think about how to construct experiments to test research hypotheses of the form “does X affect Y?” than it is to address claims like “what is X?” And in practice, what usually happens is that you find ways of testing relational claims that follow from your ontological ones. For instance, if I believe that intelligence is* speed of information processing in the brain, my experiments will often involve looking for relationships between measures of intelligence and measures of speed. As a consequence, most everyday research questions do tend to be relational in nature, but they’re almost always motivated by deeper ontological questions about the state of nature.

Notice that in practice, my research hypotheses could overlap a lot. My ultimate goal in the ESP experiment might be to test an ontological claim like “ESP exists”, but I might operationally restrict myself to a narrower hypothesis like “Some people can `see’ objects in a clairvoyant fashion”. That said, there are some things that really don’t count as proper research hypotheses in any meaningful sense:

Love is a battlefield . This is too vague to be testable. While it’s okay for a research hypothesis to have a degree of vagueness to it, it has to be possible to operationalise your theoretical ideas. Maybe I’m just not creative enough to see it, but I can’t see how this can be converted into any concrete research design. If that’s true, then this isn’t a scientific research hypothesis, it’s a pop song. That doesn’t mean it’s not interesting – a lot of deep questions that humans have fall into this category. Maybe one day science will be able to construct testable theories of love, or to test to see if God exists, and so on; but right now we can’t, and I wouldn’t bet on ever seeing a satisfying scientific approach to either.
The first rule of tautology club is the first rule of tautology club . This is not a substantive claim of any kind. It’s true by definition. No conceivable state of nature could possibly be inconsistent with this claim. As such, we say that this is an unfalsifiable hypothesis, and as such it is outside the domain of science. Whatever else you do in science, your claims must have the possibility of being wrong.
More people in my experiment will say “yes” than “no” . This one fails as a research hypothesis because it’s a claim about the data set, not about the psychology (unless of course your actual research question is whether people have some kind of “yes” bias!). As we’ll see shortly, this hypothesis is starting to sound more like a statistical hypothesis than a research hypothesis.

As you can see, research hypotheses can be somewhat messy at times; and ultimately they are scientific claims. Statistical hypotheses are neither of these two things. Statistical hypotheses must be mathematically precise, and they must correspond to specific claims about the characteristics of the data generating mechanism (i.e., the “population”). Even so, the intent is that statistical hypotheses bear a clear relationship to the substantive research hypotheses that you care about! For instance, in my ESP study my research hypothesis is that some people are able to see through walls or whatever. What I want to do is to “map” this onto a statement about how the data were generated. So let’s think about what that statement would be. The quantity that I’m interested in within the experiment is $P(\mbox{“correct”})$ , the true-but-unknown probability with which the participants in my experiment answer the question correctly. Let’s use the Greek letter $\theta$ (theta) to refer to this probability. Here are four different statistical hypotheses:

If ESP doesn’t exist and if my experiment is well designed, then my participants are just guessing. So I should expect them to get it right half of the time and so my statistical hypothesis is that the true probability of choosing correctly is $\theta = 0.5$ .
Alternatively, suppose ESP does exist and participants can see the card. If that’s true, people will perform better than chance. The statistical hypotheis would be that $\theta > 0.5$ .
A third possibility is that ESP does exist, but the colours are all reversed and people don’t realise it (okay, that’s wacky, but you never know…). If that’s how it works then you’d expect people’s performance to be below chance. This would correspond to a statistical hypothesis that $\theta < 0.5$ .
Finally, suppose ESP exists, but I have no idea whether people are seeing the right colour or the wrong one. In that case, the only claim I could make about the data would be that the probability of making the correct answer is not equal to 50. This corresponds to the statistical hypothesis that $\theta \neq 0.5$ .

All of these are legitimate examples of a statistical hypothesis because they are statements about a population parameter and are meaningfully related to my experiment.

What this discussion makes clear, I hope, is that when attempting to construct a statistical hypothesis test the researcher actually has two quite distinct hypotheses to consider. First, he or she has a research hypothesis (a claim about psychology), and this corresponds to a statistical hypothesis (a claim about the data generating population). In my ESP example, these might be

And the key thing to recognise is this: a statistical hypothesis test is a test of the statistical hypothesis, not the research hypothesis . If your study is badly designed, then the link between your research hypothesis and your statistical hypothesis is broken. To give a silly example, suppose that my ESP study was conducted in a situation where the participant can actually see the card reflected in a window; if that happens, I would be able to find very strong evidence that $\theta \neq 0.5$ , but this would tell us nothing about whether “ESP exists”.

11.1.2 Null hypotheses and alternative hypotheses

So far, so good. I have a research hypothesis that corresponds to what I want to believe about the world, and I can map it onto a statistical hypothesis that corresponds to what I want to believe about how the data were generated. It’s at this point that things get somewhat counterintuitive for a lot of people. Because what I’m about to do is invent a new statistical hypothesis (the “null” hypothesis, $H_0$ ) that corresponds to the exact opposite of what I want to believe, and then focus exclusively on that, almost to the neglect of the thing I’m actually interested in (which is now called the “alternative” hypothesis, $H_1$ ). In our ESP example, the null hypothesis is that $\theta = 0.5$ , since that’s what we’d expect if ESP didn’t exist. My hope, of course, is that ESP is totally real, and so the alternative to this null hypothesis is $\theta \neq 0.5$ . In essence, what we’re doing here is dividing up the possible values of $\theta$ into two groups: those values that I really hope aren’t true (the null), and those values that I’d be happy with if they turn out to be right (the alternative). Having done so, the important thing to recognise is that the goal of a hypothesis test is not to show that the alternative hypothesis is (probably) true; the goal is to show that the null hypothesis is (probably) false. Most people find this pretty weird.

The best way to think about it, in my experience, is to imagine that a hypothesis test is a criminal trial 160 … the trial of the null hypothesis . The null hypothesis is the defendant, the researcher is the prosecutor, and the statistical test itself is the judge. Just like a criminal trial, there is a presumption of innocence: the null hypothesis is deemed to be true unless you, the researcher, can prove beyond a reasonable doubt that it is false. You are free to design your experiment however you like (within reason, obviously!), and your goal when doing so is to maximise the chance that the data will yield a conviction… for the crime of being false. The catch is that the statistical test sets the rules of the trial, and those rules are designed to protect the null hypothesis – specifically to ensure that if the null hypothesis is actually true, the chances of a false conviction are guaranteed to be low. This is pretty important: after all, the null hypothesis doesn’t get a lawyer. And given that the researcher is trying desperately to prove it to be false, someone has to protect it.

11.2 Two types of errors

Before going into details about how a statistical test is constructed, it’s useful to understand the philosophy behind it. I hinted at it when pointing out the similarity between a null hypothesis test and a criminal trial, but I should now be explicit. Ideally, we would like to construct our test so that we never make any errors. Unfortunately, since the world is messy, this is never possible. Sometimes you’re just really unlucky: for instance, suppose you flip a coin 10 times in a row and it comes up heads all 10 times. That feels like very strong evidence that the coin is biased (and it is!), but of course there’s a 1 in 1024 chance that this would happen even if the coin was totally fair. In other words, in real life we always have to accept that there’s a chance that we did the wrong thing. As a consequence, the goal behind statistical hypothesis testing is not to eliminate errors, but to minimise them.

At this point, we need to be a bit more precise about what we mean by “errors”. Firstly, let’s state the obvious: it is either the case that the null hypothesis is true, or it is false; and our test will either reject the null hypothesis or retain it. 161 So, as the table below illustrates, after we run the test and make our choice, one of four things might have happened:

As a consequence there are actually two different types of error here. If we reject a null hypothesis that is actually true, then we have made a type I error . On the other hand, if we retain the null hypothesis when it is in fact false, then we have made a type II error .

Remember how I said that statistical testing was kind of like a criminal trial? Well, I meant it. A criminal trial requires that you establish “beyond a reasonable doubt” that the defendant did it. All of the evidentiary rules are (in theory, at least) designed to ensure that there’s (almost) no chance of wrongfully convicting an innocent defendant. The trial is designed to protect the rights of a defendant: as the English jurist William Blackstone famously said, it is “better that ten guilty persons escape than that one innocent suffer.” In other words, a criminal trial doesn’t treat the two types of error in the same way~… punishing the innocent is deemed to be much worse than letting the guilty go free. A statistical test is pretty much the same: the single most important design principle of the test is to control the probability of a type I error, to keep it below some fixed probability. This probability, which is denoted $\alpha$ , is called the significance level of the test (or sometimes, the size of the test). And I’ll say it again, because it is so central to the whole set-up~… a hypothesis test is said to have significance level $\alpha$ if the type I error rate is no larger than $\alpha$ .

So, what about the type II error rate? Well, we’d also like to keep those under control too, and we denote this probability by $\beta$ . However, it’s much more common to refer to the power of the test, which is the probability with which we reject a null hypothesis when it really is false, which is $1-\beta$ . To help keep this straight, here’s the same table again, but with the relevant numbers added:

A “powerful” hypothesis test is one that has a small value of $\beta$ , while still keeping $\alpha$ fixed at some (small) desired level. By convention, scientists make use of three different $\alpha$ levels: $.05$ , $.01$ and $.001$ . Notice the asymmetry here~… the tests are designed to ensure that the $\alpha$ level is kept small, but there’s no corresponding guarantee regarding $\beta$ . We’d certainly like the type II error rate to be small, and we try to design tests that keep it small, but this is very much secondary to the overwhelming need to control the type I error rate. As Blackstone might have said if he were a statistician, it is “better to retain 10 false null hypotheses than to reject a single true one”. To be honest, I don’t know that I agree with this philosophy – there are situations where I think it makes sense, and situations where I think it doesn’t – but that’s neither here nor there. It’s how the tests are built.

11.3 Test statistics and sampling distributions

At this point we need to start talking specifics about how a hypothesis test is constructed. To that end, let’s return to the ESP example. Let’s ignore the actual data that we obtained, for the moment, and think about the structure of the experiment. Regardless of what the actual numbers are, the form of the data is that $X$ out of $N$ people correctly identified the colour of the hidden card. Moreover, let’s suppose for the moment that the null hypothesis really is true: ESP doesn’t exist, and the true probability that anyone picks the correct colour is exactly $\theta = 0.5$ . What would we expect the data to look like? Well, obviously, we’d expect the proportion of people who make the correct response to be pretty close to 50%. Or, to phrase this in more mathematical terms, we’d say that $X/N$ is approximately $0.5$ . Of course, we wouldn’t expect this fraction to be exactly 0.5: if, for example we tested $N=100$ people, and $X = 53$ of them got the question right, we’d probably be forced to concede that the data are quite consistent with the null hypothesis. On the other hand, if $X = 99$ of our participants got the question right, then we’d feel pretty confident that the null hypothesis is wrong. Similarly, if only $X=3$ people got the answer right, we’d be similarly confident that the null was wrong. Let’s be a little more technical about this: we have a quantity $X$ that we can calculate by looking at our data; after looking at the value of $X$ , we make a decision about whether to believe that the null hypothesis is correct, or to reject the null hypothesis in favour of the alternative. The name for this thing that we calculate to guide our choices is a test statistic .

Having chosen a test statistic, the next step is to state precisely which values of the test statistic would cause is to reject the null hypothesis, and which values would cause us to keep it. In order to do so, we need to determine what the sampling distribution of the test statistic would be if the null hypothesis were actually true (we talked about sampling distributions earlier in Section 10.3.1 ). Why do we need this? Because this distribution tells us exactly what values of $X$ our null hypothesis would lead us to expect. And therefore, we can use this distribution as a tool for assessing how closely the null hypothesis agrees with our data.

$The sampling distribution for our test statistic $X$ when the null hypothesis is true. For our ESP scenario, this is a binomial distribution. Not surprisingly, since the null hypothesis says that the probability of a correct response is $\theta = .5$, the sampling distribution says that the most likely value is 50 (our of 100) correct responses. Most of the probability mass lies between 40 and 60.$

Figure 11.1: The sampling distribution for our test statistic $X$ when the null hypothesis is true. For our ESP scenario, this is a binomial distribution. Not surprisingly, since the null hypothesis says that the probability of a correct response is $\theta = .5$ , the sampling distribution says that the most likely value is 50 (our of 100) correct responses. Most of the probability mass lies between 40 and 60.

How do we actually determine the sampling distribution of the test statistic? For a lot of hypothesis tests this step is actually quite complicated, and later on in the book you’ll see me being slightly evasive about it for some of the tests (some of them I don’t even understand myself). However, sometimes it’s very easy. And, fortunately for us, our ESP example provides us with one of the easiest cases. Our population parameter $\theta$ is just the overall probability that people respond correctly when asked the question, and our test statistic $X$ is the count of the number of people who did so, out of a sample size of $N$ . We’ve seen a distribution like this before, in Section 9.4 : that’s exactly what the binomial distribution describes! So, to use the notation and terminology that I introduced in that section, we would say that the null hypothesis predicts that $X$ is binomially distributed, which is written \[ X \sim \mbox{Binomial}(\theta,N) \] Since the null hypothesis states that $\theta = 0.5$ and our experiment has $N=100$ people, we have the sampling distribution we need. This sampling distribution is plotted in Figure 11.1 . No surprises really: the null hypothesis says that $X=50$ is the most likely outcome, and it says that we’re almost certain to see somewhere between 40 and 60 correct responses.

11.4 Making decisions

Okay, we’re very close to being finished. We’ve constructed a test statistic ( $X$ ), and we chose this test statistic in such a way that we’re pretty confident that if $X$ is close to $N/2$ then we should retain the null, and if not we should reject it. The question that remains is this: exactly which values of the test statistic should we associate with the null hypothesis, and which exactly values go with the alternative hypothesis? In my ESP study, for example, I’ve observed a value of $X=62$ . What decision should I make? Should I choose to believe the null hypothesis, or the alternative hypothesis?

11.4.1 Critical regions and critical values

To answer this question, we need to introduce the concept of a critical region for the test statistic $X$ . The critical region of the test corresponds to those values of $X$ that would lead us to reject null hypothesis (which is why the critical region is also sometimes called the rejection region). How do we find this critical region? Well, let’s consider what we know:

$X$ should be very big or very small in order to reject the null hypothesis.
If the null hypothesis is true, the sampling distribution of $X$ is Binomial $(0.5, N)$ .
If $\alpha =.05$ , the critical region must cover 5% of this sampling distribution.

It’s important to make sure you understand this last point: the critical region corresponds to those values of $X$ for which we would reject the null hypothesis, and the sampling distribution in question describes the probability that we would obtain a particular value of $X$ if the null hypothesis were actually true. Now, let’s suppose that we chose a critical region that covers 20% of the sampling distribution, and suppose that the null hypothesis is actually true. What would be the probability of incorrectly rejecting the null? The answer is of course 20%. And therefore, we would have built a test that had an $\alpha$ level of $0.2$ . If we want $\alpha = .05$ , the critical region is only allowed to cover 5% of the sampling distribution of our test statistic.

Figure 11.2: The critical region associated with the hypothesis test for the ESP study, for a hypothesis test with a significance level of $\alpha = .05$ . The plot itself shows the sampling distribution of $X$ under the null hypothesis: the grey bars correspond to those values of $X$ for which we would retain the null hypothesis. The black bars show the critical region: those values of $X$ for which we would reject the null. Because the alternative hypothesis is two sided (i.e., allows both $\theta <.5$ and $\theta >.5$ ), the critical region covers both tails of the distribution. To ensure an $\alpha$ level of $.05$ , we need to ensure that each of the two regions encompasses 2.5% of the sampling distribution.

As it turns out, those three things uniquely solve the problem: our critical region consists of the most extreme values , known as the tails of the distribution. This is illustrated in Figure 11.2 . As it turns out, if we want $\alpha = .05$ , then our critical regions correspond to $X \leq 40$ and $X \geq 60$ . 162 That is, if the number of people saying “true” is between 41 and 59, then we should retain the null hypothesis. If the number is between 0 to 40 or between 60 to 100, then we should reject the null hypothesis. The numbers 40 and 60 are often referred to as the critical values , since they define the edges of the critical region.

At this point, our hypothesis test is essentially complete: (1) we choose an $\alpha$ level (e.g., $\alpha = .05$ , (2) come up with some test statistic (e.g., $X$ ) that does a good job (in some meaningful sense) of comparing $H_0$ to $H_1$ , (3) figure out the sampling distribution of the test statistic on the assumption that the null hypothesis is true (in this case, binomial) and then (4) calculate the critical region that produces an appropriate $\alpha$ level (0-40 and 60-100). All that we have to do now is calculate the value of the test statistic for the real data (e.g., $X = 62$ ) and then compare it to the critical values to make our decision. Since 62 is greater than the critical value of 60, we would reject the null hypothesis. Or, to phrase it slightly differently, we say that the test has produced a significant result.

11.4.2 A note on statistical “significance”

Like other occult techniques of divination, the statistical method has a private jargon deliberately contrived to obscure its methods from non-practitioners. – Attributed to G. O. Ashley 163

A very brief digression is in order at this point, regarding the word “significant”. The concept of statistical significance is actually a very simple one, but has a very unfortunate name. If the data allow us to reject the null hypothesis, we say that “the result is statistically significant ”, which is often shortened to “the result is significant”. This terminology is rather old, and dates back to a time when “significant” just meant something like “indicated”, rather than its modern meaning, which is much closer to “important”. As a result, a lot of modern readers get very confused when they start learning statistics, because they think that a “significant result” must be an important one. It doesn’t mean that at all. All that “statistically significant” means is that the data allowed us to reject a null hypothesis. Whether or not the result is actually important in the real world is a very different question, and depends on all sorts of other things.

11.4.3 The difference between one sided and two sided tests

There’s one more thing I want to point out about the hypothesis test that I’ve just constructed. If we take a moment to think about the statistical hypotheses I’ve been using, \[ \begin{array}{cc} H_0 : & \theta = .5 \\ H_1 : & \theta \neq .5 \end{array} \] we notice that the alternative hypothesis covers both the possibility that $\theta < .5$ and the possibility that $\theta > .5$ . This makes sense if I really think that ESP could produce better-than-chance performance or worse-than-chance performance (and there are some people who think that). In statistical language, this is an example of a two-sided test . It’s called this because the alternative hypothesis covers the area on both “sides” of the null hypothesis, and as a consequence the critical region of the test covers both tails of the sampling distribution (2.5% on either side if $\alpha =.05$ ), as illustrated earlier in Figure 11.2 .

However, that’s not the only possibility. It might be the case, for example, that I’m only willing to believe in ESP if it produces better than chance performance. If so, then my alternative hypothesis would only covers the possibility that $\theta > .5$ , and as a consequence the null hypothesis now becomes $\theta \leq .5$ : \[ \begin{array}{cc} H_0 : & \theta \leq .5 \\ H_1 : & \theta > .5 \end{array} \] When this happens, we have what’s called a one-sided test , and when this happens the critical region only covers one tail of the sampling distribution. This is illustrated in Figure 11.3 .

Figure 11.3: The critical region for a one sided test. In this case, the alternative hypothesis is that $\theta > .05$ , so we would only reject the null hypothesis for large values of $X$ . As a consequence, the critical region only covers the upper tail of the sampling distribution; specifically the upper 5% of the distribution. Contrast this to the two-sided version earlier)

11.5 The $p$ value of a test

In one sense, our hypothesis test is complete; we’ve constructed a test statistic, figured out its sampling distribution if the null hypothesis is true, and then constructed the critical region for the test. Nevertheless, I’ve actually omitted the most important number of all: the $p$ value . It is to this topic that we now turn. There are two somewhat different ways of interpreting a $p$ value, one proposed by Sir Ronald Fisher and the other by Jerzy Neyman. Both versions are legitimate, though they reflect very different ways of thinking about hypothesis tests. Most introductory textbooks tend to give Fisher’s version only, but I think that’s a bit of a shame. To my mind, Neyman’s version is cleaner, and actually better reflects the logic of the null hypothesis test. You might disagree though, so I’ve included both. I’ll start with Neyman’s version…

11.5.1 A softer view of decision making

One problem with the hypothesis testing procedure that I’ve described is that it makes no distinction at all between a result this “barely significant” and those that are “highly significant”. For instance, in my ESP study the data I obtained only just fell inside the critical region – so I did get a significant effect, but was a pretty near thing. In contrast, suppose that I’d run a study in which $X=97$ out of my $N=100$ participants got the answer right. This would obviously be significant too, but my a much larger margin; there’s really no ambiguity about this at all. The procedure that I described makes no distinction between the two. If I adopt the standard convention of allowing $\alpha = .05$ as my acceptable Type I error rate, then both of these are significant results.

This is where the $p$ value comes in handy. To understand how it works, let’s suppose that we ran lots of hypothesis tests on the same data set: but with a different value of $\alpha$ in each case. When we do that for my original ESP data, what we’d get is something like this

When we test ESP data ( $X=62$ successes out of $N=100$ observations) using $\alpha$ levels of .03 and above, we’d always find ourselves rejecting the null hypothesis. For $\alpha$ levels of .02 and below, we always end up retaining the null hypothesis. Therefore, somewhere between .02 and .03 there must be a smallest value of $\alpha$ that would allow us to reject the null hypothesis for this data. This is the $p$ value; as it turns out the ESP data has $p = .021$ . In short:

$p$ is defined to be the smallest Type I error rate ( $\alpha$ ) that you have to be willing to tolerate if you want to reject the null hypothesis.

If it turns out that $p$ describes an error rate that you find intolerable, then you must retain the null. If you’re comfortable with an error rate equal to $p$ , then it’s okay to reject the null hypothesis in favour of your preferred alternative.

In effect, $p$ is a summary of all the possible hypothesis tests that you could have run, taken across all possible $\alpha$ values. And as a consequence it has the effect of “softening” our decision process. For those tests in which $p \leq \alpha$ you would have rejected the null hypothesis, whereas for those tests in which $p > \alpha$ you would have retained the null. In my ESP study I obtained $X=62$ , and as a consequence I’ve ended up with $p = .021$ . So the error rate I have to tolerate is 2.1%. In contrast, suppose my experiment had yielded $X=97$ . What happens to my $p$ value now? This time it’s shrunk to $p = 1.36 \times 10^{-25}$ , which is a tiny, tiny 164 Type I error rate. For this second case I would be able to reject the null hypothesis with a lot more confidence, because I only have to be “willing” to tolerate a type I error rate of about 1 in 10 trillion trillion in order to justify my decision to reject.

11.5.2 The probability of extreme data

The second definition of the $p$ -value comes from Sir Ronald Fisher, and it’s actually this one that you tend to see in most introductory statistics textbooks. Notice how, when I constructed the critical region, it corresponded to the tails (i.e., extreme values) of the sampling distribution? That’s not a coincidence: almost all “good” tests have this characteristic (good in the sense of minimising our type II error rate, $\beta$ ). The reason for that is that a good critical region almost always corresponds to those values of the test statistic that are least likely to be observed if the null hypothesis is true. If this rule is true, then we can define the $p$ -value as the probability that we would have observed a test statistic that is at least as extreme as the one we actually did get. In other words, if the data are extremely implausible according to the null hypothesis, then the null hypothesis is probably wrong.

11.5.3 A common mistake

Okay, so you can see that there are two rather different but legitimate ways to interpret the $p$ value, one based on Neyman’s approach to hypothesis testing and the other based on Fisher’s. Unfortunately, there is a third explanation that people sometimes give, especially when they’re first learning statistics, and it is absolutely and completely wrong . This mistaken approach is to refer to the $p$ value as “the probability that the null hypothesis is true”. It’s an intuitively appealing way to think, but it’s wrong in two key respects: (1) null hypothesis testing is a frequentist tool, and the frequentist approach to probability does not allow you to assign probabilities to the null hypothesis… according to this view of probability, the null hypothesis is either true or it is not; it cannot have a “5% chance” of being true. (2) even within the Bayesian approach, which does let you assign probabilities to hypotheses, the $p$ value would not correspond to the probability that the null is true; this interpretation is entirely inconsistent with the mathematics of how the $p$ value is calculated. Put bluntly, despite the intuitive appeal of thinking this way, there is no justification for interpreting a $p$ value this way. Never do it.

11.6 Reporting the results of a hypothesis test

When writing up the results of a hypothesis test, there’s usually several pieces of information that you need to report, but it varies a fair bit from test to test. Throughout the rest of the book I’ll spend a little time talking about how to report the results of different tests (see Section 12.1.9 for a particularly detailed example), so that you can get a feel for how it’s usually done. However, regardless of what test you’re doing, the one thing that you always have to do is say something about the $p$ value, and whether or not the outcome was significant.

The fact that you have to do this is unsurprising; it’s the whole point of doing the test. What might be surprising is the fact that there is some contention over exactly how you’re supposed to do it. Leaving aside those people who completely disagree with the entire framework underpinning null hypothesis testing, there’s a certain amount of tension that exists regarding whether or not to report the exact $p$ value that you obtained, or if you should state only that $p < \alpha$ for a significance level that you chose in advance (e.g., $p<.05$ ).

11.6.1 The issue

To see why this is an issue, the key thing to recognise is that $p$ values are terribly convenient. In practice, the fact that we can compute a $p$ value means that we don’t actually have to specify any $\alpha$ level at all in order to run the test. Instead, what you can do is calculate your $p$ value and interpret it directly: if you get $p = .062$ , then it means that you’d have to be willing to tolerate a Type I error rate of 6.2% to justify rejecting the null. If you personally find 6.2% intolerable, then you retain the null. Therefore, the argument goes, why don’t we just report the actual $p$ value and let the reader make up their own minds about what an acceptable Type I error rate is? This approach has the big advantage of “softening” the decision making process – in fact, if you accept the Neyman definition of the $p$ value, that’s the whole point of the $p$ value. We no longer have a fixed significance level of $\alpha = .05$ as a bright line separating “accept” from “reject” decisions; and this removes the rather pathological problem of being forced to treat $p = .051$ in a fundamentally different way to $p = .049$ .

This flexibility is both the advantage and the disadvantage to the $p$ value. The reason why a lot of people don’t like the idea of reporting an exact $p$ value is that it gives the researcher a bit too much freedom. In particular, it lets you change your mind about what error tolerance you’re willing to put up with after you look at the data. For instance, consider my ESP experiment. Suppose I ran my test, and ended up with a $p$ value of .09. Should I accept or reject? Now, to be honest, I haven’t yet bothered to think about what level of Type I error I’m “really” willing to accept. I don’t have an opinion on that topic. But I do have an opinion about whether or not ESP exists, and I definitely have an opinion about whether my research should be published in a reputable scientific journal. And amazingly, now that I’ve looked at the data I’m starting to think that a 9% error rate isn’t so bad, especially when compared to how annoying it would be to have to admit to the world that my experiment has failed. So, to avoid looking like I just made it up after the fact, I now say that my $\alpha$ is .1: a 10% type I error rate isn’t too bad, and at that level my test is significant! I win.

In other words, the worry here is that I might have the best of intentions, and be the most honest of people, but the temptation to just “shade” things a little bit here and there is really, really strong. As anyone who has ever run an experiment can attest, it’s a long and difficult process, and you often get very attached to your hypotheses. It’s hard to let go and admit the experiment didn’t find what you wanted it to find. And that’s the danger here. If we use the “raw” $p$ -value, people will start interpreting the data in terms of what they want to believe, not what the data are actually saying… and if we allow that, well, why are we bothering to do science at all? Why not let everyone believe whatever they like about anything, regardless of what the facts are? Okay, that’s a bit extreme, but that’s where the worry comes from. According to this view, you really must specify your $\alpha$ value in advance, and then only report whether the test was significant or not. It’s the only way to keep ourselves honest.

11.6.2 Two proposed solutions

In practice, it’s pretty rare for a researcher to specify a single $\alpha$ level ahead of time. Instead, the convention is that scientists rely on three standard significance levels: .05, .01 and .001. When reporting your results, you indicate which (if any) of these significance levels allow you to reject the null hypothesis. This is summarised in Table 11.1 . This allows us to soften the decision rule a little bit, since $p<.01$ implies that the data meet a stronger evidentiary standard than $p<.05$ would. Nevertheless, since these levels are fixed in advance by convention, it does prevent people choosing their $\alpha$ level after looking at the data.

Nevertheless, quite a lot of people still prefer to report exact $p$ values. To many people, the advantage of allowing the reader to make up their own mind about how to interpret $p = .06$ outweighs any disadvantages. In practice, however, even among those researchers who prefer exact $p$ values it is quite common to just write $p<.001$ instead of reporting an exact value for small $p$ . This is in part because a lot of software doesn’t actually print out the $p$ value when it’s that small (e.g., SPSS just writes $p = .000$ whenever $p<.001$ ), and in part because a very small $p$ value can be kind of misleading. The human mind sees a number like .0000000001 and it’s hard to suppress the gut feeling that the evidence in favour of the alternative hypothesis is a near certainty. In practice however, this is usually wrong. Life is a big, messy, complicated thing: and every statistical test ever invented relies on simplifications, approximations and assumptions. As a consequence, it’s probably not reasonable to walk away from any statistical analysis with a feeling of confidence stronger than $p<.001$ implies. In other words, $p<.001$ is really code for “as far as this test is concerned, the evidence is overwhelming.”

In light of all this, you might be wondering exactly what you should do. There’s a fair bit of contradictory advice on the topic, with some people arguing that you should report the exact $p$ value, and other people arguing that you should use the tiered approach illustrated in Table 11.1 . As a result, the best advice I can give is to suggest that you look at papers/reports written in your field and see what the convention seems to be. If there doesn’t seem to be any consistent pattern, then use whichever method you prefer.

11.7 Running the hypothesis test in practice

At this point some of you might be wondering if this is a “real” hypothesis test, or just a toy example that I made up. It’s real. In the previous discussion I built the test from first principles, thinking that it was the simplest possible problem that you might ever encounter in real life. However, this test already exists: it’s called the binomial test , and it’s implemented by an R function called binom.test() . To test the null hypothesis that the response probability is one-half p = .5 , 165 using data in which x = 62 of n = 100 people made the correct response, here’s how to do it in R:

Right now, this output looks pretty unfamiliar to you, but you can see that it’s telling you more or less the right things. Specifically, the $p$ -value of 0.02 is less than the usual choice of $\alpha = .05$ , so you can reject the null. We’ll talk a lot more about how to read this sort of output as we go along; and after a while you’ll hopefully find it quite easy to read and understand. For now, however, I just wanted to make the point that R contains a whole lot of functions corresponding to different kinds of hypothesis test. And while I’ll usually spend quite a lot of time explaining the logic behind how the tests are built, every time I discuss a hypothesis test the discussion will end with me showing you a fairly simple R command that you can use to run the test in practice.

11.8 Effect size, sample size and power

In previous sections I’ve emphasised the fact that the major design principle behind statistical hypothesis testing is that we try to control our Type I error rate. When we fix $\alpha = .05$ we are attempting to ensure that only 5% of true null hypotheses are incorrectly rejected. However, this doesn’t mean that we don’t care about Type II errors. In fact, from the researcher’s perspective, the error of failing to reject the null when it is actually false is an extremely annoying one. With that in mind, a secondary goal of hypothesis testing is to try to minimise $\beta$ , the Type II error rate, although we don’t usually talk in terms of minimising Type II errors. Instead, we talk about maximising the power of the test. Since power is defined as $1-\beta$ , this is the same thing.

11.8.1 The power function

$Sampling distribution under the *alternative* hypothesis, for a population parameter value of $\theta = 0.55$. A reasonable proportion of the distribution lies in the rejection region.$

Figure 11.4: Sampling distribution under the alternative hypothesis, for a population parameter value of $\theta = 0.55$ . A reasonable proportion of the distribution lies in the rejection region.

Let’s take a moment to think about what a Type II error actually is. A Type II error occurs when the alternative hypothesis is true, but we are nevertheless unable to reject the null hypothesis. Ideally, we’d be able to calculate a single number $\beta$ that tells us the Type II error rate, in the same way that we can set $\alpha = .05$ for the Type I error rate. Unfortunately, this is a lot trickier to do. To see this, notice that in my ESP study the alternative hypothesis actually corresponds to lots of possible values of $\theta$ . In fact, the alternative hypothesis corresponds to every value of $\theta$ except 0.5. Let’s suppose that the true probability of someone choosing the correct response is 55% (i.e., $\theta = .55$ ). If so, then the true sampling distribution for $X$ is not the same one that the null hypothesis predicts: the most likely value for $X$ is now 55 out of 100. Not only that, the whole sampling distribution has now shifted, as shown in Figure 11.4 . The critical regions, of course, do not change: by definition, the critical regions are based on what the null hypothesis predicts. What we’re seeing in this figure is the fact that when the null hypothesis is wrong, a much larger proportion of the sampling distribution distribution falls in the critical region. And of course that’s what should happen: the probability of rejecting the null hypothesis is larger when the null hypothesis is actually false! However $\theta = .55$ is not the only possibility consistent with the alternative hypothesis. Let’s instead suppose that the true value of $\theta$ is actually 0.7. What happens to the sampling distribution when this occurs? The answer, shown in Figure 11.5 , is that almost the entirety of the sampling distribution has now moved into the critical region. Therefore, if $\theta = 0.7$ the probability of us correctly rejecting the null hypothesis (i.e., the power of the test) is much larger than if $\theta = 0.55$ . In short, while $\theta = .55$ and $\theta = .70$ are both part of the alternative hypothesis, the Type II error rate is different.

$Sampling distribution under the *alternative* hypothesis, for a population parameter value of $\theta = 0.70$. Almost all of the distribution lies in the rejection region.$

Figure 11.5: Sampling distribution under the alternative hypothesis, for a population parameter value of $\theta = 0.70$ . Almost all of the distribution lies in the rejection region.

$The probability that we will reject the null hypothesis, plotted as a function of the true value of $\theta$. Obviously, the test is more powerful (greater chance of correct rejection) if the true value of $\theta$ is very different from the value that the null hypothesis specifies (i.e., $\theta=.5$). Notice that when $\theta$ actually is equal to .5 (plotted as a black dot), the null hypothesis is in fact true: rejecting the null hypothesis in this instance would be a Type I error.$

Figure 11.6: The probability that we will reject the null hypothesis, plotted as a function of the true value of $\theta$ . Obviously, the test is more powerful (greater chance of correct rejection) if the true value of $\theta$ is very different from the value that the null hypothesis specifies (i.e., $\theta=.5$ ). Notice that when $\theta$ actually is equal to .5 (plotted as a black dot), the null hypothesis is in fact true: rejecting the null hypothesis in this instance would be a Type I error.

What all this means is that the power of a test (i.e., $1-\beta$ ) depends on the true value of $\theta$ . To illustrate this, I’ve calculated the expected probability of rejecting the null hypothesis for all values of $\theta$ , and plotted it in Figure 11.6 . This plot describes what is usually called the power function of the test. It’s a nice summary of how good the test is, because it actually tells you the power ( $1-\beta$ ) for all possible values of $\theta$ . As you can see, when the true value of $\theta$ is very close to 0.5, the power of the test drops very sharply, but when it is further away, the power is large.

11.8.2 Effect size

Since all models are wrong the scientist must be alert to what is importantly wrong. It is inappropriate to be concerned with mice when there are tigers abroad – George Box 1976

The plot shown in Figure 11.6 captures a fairly basic point about hypothesis testing. If the true state of the world is very different from what the null hypothesis predicts, then your power will be very high; but if the true state of the world is similar to the null (but not identical) then the power of the test is going to be very low. Therefore, it’s useful to be able to have some way of quantifying how “similar” the true state of the world is to the null hypothesis. A statistic that does this is called a measure of effect size (e.g. Cohen 1988 ; Ellis 2010 ) . Effect size is defined slightly differently in different contexts, 166 (and so this section just talks in general terms) but the qualitative idea that it tries to capture is always the same: how big is the difference between the true population parameters, and the parameter values that are assumed by the null hypothesis? In our ESP example, if we let $\theta_0 = 0.5$ denote the value assumed by the null hypothesis, and let $\theta$ denote the true value, then a simple measure of effect size could be something like the difference between the true value and null (i.e., $\theta – \theta_0$ ), or possibly just the magnitude of this difference, $\mbox{abs}(\theta – \theta_0)$ .

Why calculate effect size? Let’s assume that you’ve run your experiment, collected the data, and gotten a significant effect when you ran your hypothesis test. Isn’t it enough just to say that you’ve gotten a significant effect? Surely that’s the point of hypothesis testing? Well, sort of. Yes, the point of doing a hypothesis test is to try to demonstrate that the null hypothesis is wrong, but that’s hardly the only thing we’re interested in. If the null hypothesis claimed that $\theta = .5$ , and we show that it’s wrong, we’ve only really told half of the story. Rejecting the null hypothesis implies that we believe that $\theta \neq .5$ , but there’s a big difference between $\theta = .51$ and $\theta = .8$ . If we find that $\theta = .8$ , then not only have we found that the null hypothesis is wrong, it appears to be very wrong. On the other hand, suppose we’ve successfully rejected the null hypothesis, but it looks like the true value of $\theta$ is only .51 (this would only be possible with a large study). Sure, the null hypothesis is wrong, but it’s not at all clear that we actually care , because the effect size is so small. In the context of my ESP study we might still care, since any demonstration of real psychic powers would actually be pretty cool 167 , but in other contexts a 1% difference isn’t very interesting, even if it is a real difference. For instance, suppose we’re looking at differences in high school exam scores between males and females, and it turns out that the female scores are 1% higher on average than the males. If I’ve got data from thousands of students, then this difference will almost certainly be statistically significant , but regardless of how small the $p$ value is it’s just not very interesting. You’d hardly want to go around proclaiming a crisis in boys education on the basis of such a tiny difference would you? It’s for this reason that it is becoming more standard (slowly, but surely) to report some kind of standard measure of effect size along with the the results of the hypothesis test. The hypothesis test itself tells you whether you should believe that the effect you have observed is real (i.e., not just due to chance); the effect size tells you whether or not you should care.

11.8.3 Increasing the power of your study

Not surprisingly, scientists are fairly obsessed with maximising the power of their experiments. We want our experiments to work, and so we want to maximise the chance of rejecting the null hypothesis if it is false (and of course we usually want to believe that it is false!) As we’ve seen, one factor that influences power is the effect size. So the first thing you can do to increase your power is to increase the effect size. In practice, what this means is that you want to design your study in such a way that the effect size gets magnified. For instance, in my ESP study I might believe that psychic powers work best in a quiet, darkened room; with fewer distractions to cloud the mind. Therefore I would try to conduct my experiments in just such an environment: if I can strengthen people’s ESP abilities somehow, then the true value of $\theta$ will go up 168 and therefore my effect size will be larger. In short, clever experimental design is one way to boost power; because it can alter the effect size.

Unfortunately, it’s often the case that even with the best of experimental designs you may have only a small effect. Perhaps, for example, ESP really does exist, but even under the best of conditions it’s very very weak. Under those circumstances, your best bet for increasing power is to increase the sample size. In general, the more observations that you have available, the more likely it is that you can discriminate between two hypotheses. If I ran my ESP experiment with 10 participants, and 7 of them correctly guessed the colour of the hidden card, you wouldn’t be terribly impressed. But if I ran it with 10,000 participants and 7,000 of them got the answer right, you would be much more likely to think I had discovered something. In other words, power increases with the sample size. This is illustrated in Figure 11.7 , which shows the power of the test for a true parameter of $\theta = 0.7$ , for all sample sizes $N$ from 1 to 100, where I’m assuming that the null hypothesis predicts that $\theta_0 = 0.5$ .

$The power of our test, plotted as a function of the sample size $N$. In this case, the true value of $\theta$ is 0.7, but the null hypothesis is that $\theta = 0.5$. Overall, larger $N$ means greater power. (The small zig-zags in this function occur because of some odd interactions between $\theta$, $\alpha$ and the fact that the binomial distribution is discrete; it doesn't matter for any serious purpose)$

Figure 11.7: The power of our test, plotted as a function of the sample size $N$ . In this case, the true value of $\theta$ is 0.7, but the null hypothesis is that $\theta = 0.5$ . Overall, larger $N$ means greater power. (The small zig-zags in this function occur because of some odd interactions between $\theta$ , $\alpha$ and the fact that the binomial distribution is discrete; it doesn’t matter for any serious purpose)

Because power is important, whenever you’re contemplating running an experiment it would be pretty useful to know how much power you’re likely to have. It’s never possible to know for sure, since you can’t possibly know what your effect size is. However, it’s often (well, sometimes) possible to guess how big it should be. If so, you can guess what sample size you need! This idea is called power analysis , and if it’s feasible to do it, then it’s very helpful, since it can tell you something about whether you have enough time or money to be able to run the experiment successfully. It’s increasingly common to see people arguing that power analysis should be a required part of experimental design, so it’s worth knowing about. I don’t discuss power analysis in this book, however. This is partly for a boring reason and partly for a substantive one. The boring reason is that I haven’t had time to write about power analysis yet. The substantive one is that I’m still a little suspicious of power analysis. Speaking as a researcher, I have very rarely found myself in a position to be able to do one – it’s either the case that (a) my experiment is a bit non-standard and I don’t know how to define effect size properly, (b) I literally have so little idea about what the effect size will be that I wouldn’t know how to interpret the answers. Not only that, after extensive conversations with someone who does stats consulting for a living (my wife, as it happens), I can’t help but notice that in practice the only time anyone ever asks her for a power analysis is when she’s helping someone write a grant application. In other words, the only time any scientist ever seems to want a power analysis in real life is when they’re being forced to do it by bureaucratic process. It’s not part of anyone’s day to day work. In short, I’ve always been of the view that while power is an important concept, power analysis is not as useful as people make it sound, except in the rare cases where (a) someone has figured out how to calculate power for your actual experimental design and (b) you have a pretty good idea what the effect size is likely to be. Maybe other people have had better experiences than me, but I’ve personally never been in a situation where both (a) and (b) were true. Maybe I’ll be convinced otherwise in the future, and probably a future version of this book would include a more detailed discussion of power analysis, but for now this is about as much as I’m comfortable saying about the topic.

11.9 Some issues to consider

What I’ve described to you in this chapter is the orthodox framework for null hypothesis significance testing (NHST). Understanding how NHST works is an absolute necessity, since it has been the dominant approach to inferential statistics ever since it came to prominence in the early 20th century. It’s what the vast majority of working scientists rely on for their data analysis, so even if you hate it you need to know it. However, the approach is not without problems. There are a number of quirks in the framework, historical oddities in how it came to be, theoretical disputes over whether or not the framework is right, and a lot of practical traps for the unwary. I’m not going to go into a lot of detail on this topic, but I think it’s worth briefly discussing a few of these issues.

11.9.1 Neyman versus Fisher

The first thing you should be aware of is that orthodox NHST is actually a mash-up of two rather different approaches to hypothesis testing, one proposed by Sir Ronald Fisher and the other proposed by Jerzy Neyman (for a historical summary see Lehmann 2011 ) . The history is messy because Fisher and Neyman were real people whose opinions changed over time, and at no point did either of them offer “the definitive statement” of how we should interpret their work many decades later. That said, here’s a quick summary of what I take these two approaches to be.

First, let’s talk about Fisher’s approach. As far as I can tell, Fisher assumed that you only had the one hypothesis (the null), and what you want to do is find out if the null hypothesis is inconsistent with the data. From his perspective, what you should do is check to see if the data are “sufficiently unlikely” according to the null. In fact, if you remember back to our earlier discussion, that’s how Fisher defines the $p$ -value. According to Fisher, if the null hypothesis provided a very poor account of the data, you could safely reject it. But, since you don’t have any other hypotheses to compare it to, there’s no way of “accepting the alternative” because you don’t necessarily have an explicitly stated alternative. That’s more or less all that there was to it.

In contrast, Neyman thought that the point of hypothesis testing was as a guide to action, and his approach was somewhat more formal than Fisher’s. His view was that there are multiple things that you could do (accept the null or accept the alternative) and the point of the test was to tell you which one the data support. From this perspective, it is critical to specify your alternative hypothesis properly. If you don’t know what the alternative hypothesis is, then you don’t know how powerful the test is, or even which action makes sense. His framework genuinely requires a competition between different hypotheses. For Neyman, the $p$ value didn’t directly measure the probability of the data (or data more extreme) under the null, it was more of an abstract description about which “possible tests” were telling you to accept the null, and which “possible tests” were telling you to accept the alternative.

As you can see, what we have today is an odd mishmash of the two. We talk about having both a null hypothesis and an alternative (Neyman), but usually 169 define the $p$ value in terms of exreme data (Fisher), but we still have $\alpha$ values (Neyman). Some of the statistical tests have explicitly specified alternatives (Neyman) but others are quite vague about it (Fisher). And, according to some people at least, we’re not allowed to talk about accepting the alternative (Fisher). It’s a mess: but I hope this at least explains why it’s a mess.

11.9.2 Bayesians versus frequentists

Earlier on in this chapter I was quite emphatic about the fact that you cannot interpret the $p$ value as the probability that the null hypothesis is true. NHST is fundamentally a frequentist tool (see Chapter 9 ) and as such it does not allow you to assign probabilities to hypotheses: the null hypothesis is either true or it is not. The Bayesian approach to statistics interprets probability as a degree of belief, so it’s totally okay to say that there is a 10% chance that the null hypothesis is true: that’s just a reflection of the degree of confidence that you have in this hypothesis. You aren’t allowed to do this within the frequentist approach. Remember, if you’re a frequentist, a probability can only be defined in terms of what happens after a large number of independent replications (i.e., a long run frequency). If this is your interpretation of probability, talking about the “probability” that the null hypothesis is true is complete gibberish: a null hypothesis is either true or it is false. There’s no way you can talk about a long run frequency for this statement. To talk about “the probability of the null hypothesis” is as meaningless as “the colour of freedom”. It doesn’t have one!

Most importantly, this isn’t a purely ideological matter. If you decide that you are a Bayesian and that you’re okay with making probability statements about hypotheses, you have to follow the Bayesian rules for calculating those probabilities. I’ll talk more about this in Chapter 17 , but for now what I want to point out to you is the $p$ value is a terrible approximation to the probability that $H_0$ is true. If what you want to know is the probability of the null, then the $p$ value is not what you’re looking for!

11.9.3 Traps

As you can see, the theory behind hypothesis testing is a mess, and even now there are arguments in statistics about how it “should” work. However, disagreements among statisticians are not our real concern here. Our real concern is practical data analysis. And while the “orthodox” approach to null hypothesis significance testing has many drawbacks, even an unrepentant Bayesian like myself would agree that they can be useful if used responsibly. Most of the time they give sensible answers, and you can use them to learn interesting things. Setting aside the various ideologies and historical confusions that we’ve discussed, the fact remains that the biggest danger in all of statistics is thoughtlessness . I don’t mean stupidity, here: I literally mean thoughtlessness. The rush to interpret a result without spending time thinking through what each test actually says about the data, and checking whether that’s consistent with how you’ve interpreted it. That’s where the biggest trap lies.

To give an example of this, consider the following example (see Gelman and Stern 2006 ) . Suppose I’m running my ESP study, and I’ve decided to analyse the data separately for the male participants and the female participants. Of the male participants, 33 out of 50 guessed the colour of the card correctly. This is a significant effect ( $p = .03$ ). Of the female participants, 29 out of 50 guessed correctly. This is not a significant effect ( $p = .32$ ). Upon observing this, it is extremely tempting for people to start wondering why there is a difference between males and females in terms of their psychic abilities. However, this is wrong. If you think about it, we haven’t actually run a test that explicitly compares males to females. All we have done is compare males to chance (binomial test was significant) and compared females to chance (binomial test was non significant). If we want to argue that there is a real difference between the males and the females, we should probably run a test of the null hypothesis that there is no difference! We can do that using a different hypothesis test, 170 but when we do that it turns out that we have no evidence that males and females are significantly different ( $p = .54$ ). Now do you think that there’s anything fundamentally different between the two groups? Of course not. What’s happened here is that the data from both groups (male and female) are pretty borderline: by pure chance, one of them happened to end up on the magic side of the $p = .05$ line, and the other one didn’t. That doesn’t actually imply that males and females are different. This mistake is so common that you should always be wary of it: the difference between significant and not-significant is not evidence of a real difference – if you want to say that there’s a difference between two groups, then you have to test for that difference!

The example above is just that: an example. I’ve singled it out because it’s such a common one, but the bigger picture is that data analysis can be tricky to get right. Think about what it is you want to test, why you want to test it, and whether or not the answers that your test gives could possibly make any sense in the real world.

11.10 Summary

Null hypothesis testing is one of the most ubiquitous elements to statistical theory. The vast majority of scientific papers report the results of some hypothesis test or another. As a consequence it is almost impossible to get by in science without having at least a cursory understanding of what a $p$ -value means, making this one of the most important chapters in the book. As usual, I’ll end the chapter with a quick recap of the key ideas that we’ve talked about:

Research hypotheses and statistical hypotheses. Null and alternative hypotheses. (Section 11.1 ).
Type 1 and Type 2 errors (Section 11.2 )
Test statistics and sampling distributions (Section 11.3 )
Hypothesis testing as a decision making process (Section 11.4 )
$p$ -values as “soft” decisions (Section 11.5 )
Writing up the results of a hypothesis test (Section 11.6 )
Effect size and power (Section 11.8 )
A few issues to consider regarding hypothesis testing (Section 11.9 )

Later in the book, in Chapter 17 , I’ll revisit the theory of null hypothesis tests from a Bayesian perspective, and introduce a number of new tools that you can use if you aren’t particularly fond of the orthodox approach. But for now, though, we’re done with the abstract statistical theory, and we can start discussing specific data analysis tools.

Cohen, J. 1988. Statistical Power Analysis for the Behavioral Sciences . 2nd ed. Lawrence Erlbaum.

Ellis, P. D. 2010. The Essential Guide to Effect Sizes: Statistical Power, Meta-Analysis, and the Interpretation of Research Results . Cambridge, UK: Cambridge University Press.

Lehmann, Erich L. 2011. Fisher, Neyman, and the Creation of Classical Statistics . Springer.

Gelman, A., and H. Stern. 2006. “The Difference Between ‘Significant’ and ‘Not Significant’ Is Not Itself Statistically Significant.” The American Statistician 60: 328–31.

The quote comes from Wittgenstein’s (1922) text, Tractatus Logico-Philosphicus . ↩
A technical note. The description below differs subtly from the standard description given in a lot of introductory texts. The orthodox theory of null hypothesis testing emerged from the work of Sir Ronald Fisher and Jerzy Neyman in the early 20th century; but Fisher and Neyman actually had very different views about how it should work. The standard treatment of hypothesis testing that most texts use is a hybrid of the two approaches. The treatment here is a little more Neyman-style than the orthodox view, especially as regards the meaning of the $p$ value. ↩
My apologies to anyone who actually believes in this stuff, but on my reading of the literature on ESP, it’s just not reasonable to think this is real. To be fair, though, some of the studies are rigorously designed; so it’s actually an interesting area for thinking about psychological research design. And of course it’s a free country, so you can spend your own time and effort proving me wrong if you like, but I wouldn’t think that’s a terribly practical use of your intellect. ↩
This analogy only works if you’re from an adversarial legal system like UK/US/Australia. As I understand these things, the French inquisitorial system is quite different. ↩
An aside regarding the language you use to talk about hypothesis testing. Firstly, one thing you really want to avoid is the word “prove”: a statistical test really doesn’t prove that a hypothesis is true or false. Proof implies certainty, and as the saying goes, statistics means never having to say you’re certain. On that point almost everyone would agree. However, beyond that there’s a fair amount of confusion. Some people argue that you’re only allowed to make statements like “rejected the null”, “failed to reject the null”, or possibly “retained the null”. According to this line of thinking, you can’t say things like “accept the alternative” or “accept the null”. Personally I think this is too strong: in my opinion, this conflates null hypothesis testing with Karl Popper’s falsificationist view of the scientific process. While there are similarities between falsificationism and null hypothesis testing, they aren’t equivalent. However, while I personally think it’s fine to talk about accepting a hypothesis (on the proviso that “acceptance” doesn’t actually mean that it’s necessarily true, especially in the case of the null hypothesis), many people will disagree. And more to the point, you should be aware that this particular weirdness exists, so that you’re not caught unawares by it when writing up your own results. ↩
Strictly speaking, the test I just constructed has $\alpha = .057$ , which is a bit too generous. However, if I’d chosen 39 and 61 to be the boundaries for the critical region, then the critical region only covers 3.5% of the distribution. I figured that it makes more sense to use 40 and 60 as my critical values, and be willing to tolerate a 5.7% type I error rate, since that’s as close as I can get to a value of $\alpha = .05$ . ↩
The internet seems fairly convinced that Ashley said this, though I can’t for the life of me find anyone willing to give a source for the claim. ↩
That’s $p = .000000000000000000000000136$ for folks that don’t like scientific notation! ↩
Note that the p here has nothing to do with a $p$ value. The p argument in the binom.test() function corresponds to the probability of making a correct response, according to the null hypothesis. In other words, it’s the $\theta$ value. ↩
There’s an R package called compute.es that can be used for calculating a very broad range of effect size measures; but for the purposes of the current book we won’t need it: all of the effect size measures that I’ll talk about here have functions in the lsr package ↩
Although in practice a very small effect size is worrying, because even very minor methodological flaws might be responsible for the effect; and in practice no experiment is perfect, so there are always methodological issues to worry about. ↩
Notice that the true population parameter $\theta$ doesn’t necessarily correspond to an immutable fact of nature. In this context $\theta$ is just the true probability that people would correctly guess the colour of the card in the other room. As such the population parameter can be influenced by all sorts of things. Of course, this is all on the assumption that ESP actually exists! ↩
Although this book describes both Neyman’s and Fisher’s definition of the $p$ value, most don’t. Most introductory textbooks will only give you the Fisher version. ↩
In this case, the Pearson chi-square test of independence (Chapter 12 ; chisq.test() in R) is what we use; see also the prop.test() function. ↩

Share This Book

Introduction

R installation
Working directory
Getting help
Install packages

Data structures

Data Wrangling

Sort and order
Merge data frames

Programming

Creating functions
If else statement
apply function
sapply function
tapply function

Import & export

Read TXT files
Import CSV files
Read Excel files
Read SQL databases
Export data
plot function
Scatter plot
Density plot
Tutorials Introduction Data wrangling Graphics Statistics See all

HYPOTHESIS TESTING IN R

Hypothesis testing is a statistical procedure used to make decisions or draw conclusions about the characteristics of a population based on information provided by a sample

NORMALITY TESTS

Normality tests are used to evaluate whether a data sample follows a normal distribution. These tests allow to verify if the data have a behavior similar to that of a Gaussian distribution, being useful to determine if the assumptions of certain parametric statistical analyses that require normality in the data are met

Shapiro Wilk normality test

shapiro.test()

Lilliefors normality test

lillie.test()

GOODNESS OF FIT TESTS

These tests are used to verify whether a proposed theoretical distribution adequately matches the observed data. They are useful to assess whether a specific distribution fits the data well, allowing to determine whether a theoretical model accurately represents the observed data distribution

Pearson's Chi-squared test with chisq.test()

chisq.test()

Kolmogorov-Smirnov test in R with ks.test()

Kolmogorov-Smirnov test with ks.test()

Median tests.

Median tests are used to test whether the medians of two or more groups are statistically different, thus identifying whether there are significant differences in medians between populations or treatments

Wilcoxon signed rank test

wilcox.test()

Wilcoxon rank sum test (Mann-Whitney U test)

Kruskal Wallis rank sum test (H test)

kruskal.test()

OTHER TYPES OF TESTS

There are other types of tests, such as tests for comparing means, for equality of variances or for equality of proportions

T-test to compare means

F test with var.test() to compare two variances

Test for proportions with prop.test()

prop.test()

Try adjusting your search query

👉 If you haven’t found what you’re looking for, consider clicking the checkbox to activate the extended search on R CHARTS for additional graphs tutorials, try searching a synonym of your query if possible (e.g., ‘bar plot’ -> ‘bar chart’), search for a more generic query or if you are searching for a specific function activate the functions search or use the functions search bar .

Summary and Analysis of Extension Program Evaluation in R

Salvatore S. Mangiafico

Search Rcompanion.org

Purpose of this Book
Author of this Book
Statistics Textbooks and Other Resources
Why Statistics?
Evaluation Tools and Surveys
Types of Variables
Descriptive Statistics
Confidence Intervals
Basic Plots

Hypothesis Testing and p-values

Reporting Results of Data and Analyses
Choosing a Statistical Test
Independent and Paired Values
Introduction to Likert Data
Descriptive Statistics for Likert Item Data
Descriptive Statistics with the likert Package
Confidence Intervals for Medians
Converting Numeric Data to Categories
Introduction to Traditional Nonparametric Tests
One-sample Wilcoxon Signed-rank Test
Sign Test for One-sample Data
Two-sample Mann–Whitney U Test
Mood’s Median Test for Two-sample Data
Two-sample Paired Signed-rank Test
Sign Test for Two-sample Paired Data
Kruskal–Wallis Test
Mood’s Median Test
Friedman Test
Scheirer–Ray–Hare Test
Aligned Ranks Transformation ANOVA
Nonparametric Regression and Local Regression
Nonparametric Regression for Time Series
Introduction to Permutation Tests
One-way Permutation Test for Ordinal Data
One-way Permutation Test for Paired Ordinal Data
Permutation Tests for Medians and Percentiles
Association Tests for Ordinal Tables
Measures of Association for Ordinal Tables
Introduction to Linear Models
Using Random Effects in Models
What are Estimated Marginal Means?
Estimated Marginal Means for Multiple Comparisons
Factorial ANOVA: Main Effects, Interaction Effects, and Interaction Plots
p-values and R-square Values for Models
Accuracy and Errors for Models
Introduction to Cumulative Link Models (CLM) for Ordinal Data
Two-sample Ordinal Test with CLM
Two-sample Paired Ordinal Test with CLMM
One-way Ordinal Regression with CLM
One-way Repeated Ordinal Regression with CLMM
Two-way Ordinal Regression with CLM
Two-way Repeated Ordinal Regression with CLMM
Introduction to Tests for Nominal Variables
Confidence Intervals for Proportions
Goodness-of-Fit Tests for Nominal Variables
Association Tests for Nominal Variables
Measures of Association for Nominal Variables
Tests for Paired Nominal Data
Cochran–Mantel–Haenszel Test for 3-Dimensional Tables
Cochran’s Q Test for Paired Nominal Data
Models for Nominal Data
Introduction to Parametric Tests
One-sample t-test
Two-sample t-test
Paired t-test
One-way ANOVA
One-way ANOVA with Blocks
One-way ANOVA with Random Blocks
Two-way ANOVA
Repeated Measures ANOVA
Correlation and Linear Regression
Advanced Parametric Methods
Transforming Data
Normal Scores Transformation
Regression for Count Data
Beta Regression for Percent and Proportion Data
An R Companion for the Handbook of Biological Statistics

Initial comments

Traditionally when students first learn about the analysis of experiments, there is a strong focus on hypothesis testing and making decisions based on p -values. Hypothesis testing is important for determining if there are statistically significant effects. However, readers of this book should not place undo emphasis on p -values. Instead, they should realize that p -values are affected by sample size, and that a low p -value does not necessarily suggest a large effect or a practically meaningful effect. Summary statistics, plots, effect size statistics, and practical considerations should be used. The goal is to determine: a) statistical significance, b) effect size, c) practical importance. These are all different concepts, and they will be explored below.

Statistical inference

Most of what we’ve covered in this book so far is about producing descriptive statistics: calculating means and medians, plotting data in various ways, and producing confidence intervals. The bulk of the rest of this book will cover statistical inference: using statistical tests to draw some conclusion about the data. We’ve already done this a little bit in earlier chapters by using confidence intervals to conclude if means are different or not among groups.

As Dr. Nic mentions in her article in the “References and further reading” section, this is the part where people sometimes get stumped. It is natural for most of us to use summary statistics or plots, but jumping to statistical inference needs a little change in perspective. The idea of using some statistical test to answer a question isn’t a difficult concept, but some of the following discussion gets a little theoretical. The video from the Statistics Learning Center in the “References and further reading” section does a good job of explaining the basis of statistical inference.

One important thing to gain from this chapter is an understanding of how to use the p -value, alpha , and decision rule to test the null hypothesis. But once you are comfortable with that, you will want to return to this chapter to have a better understanding of the theory behind this process.

Another important thing is to understand the limitations of relying on p -values, and why it is important to assess the size of effects and weigh practical considerations.

Packages used in this chapter

The packages used in this chapter include:

The following commands will install these packages if they are not already installed:

if(!require(lsr)){install.packages("lsr")}

Hypothesis testing

The null and alternative hypotheses.

The statistical tests in this book rely on testing a null hypothesis, which has a specific formulation for each test. The null hypothesis always describes the case where e.g. two groups are not different or there is no correlation between two variables, etc.

The alternative hypothesis is the contrary of the null hypothesis, and so describes the cases where there is a difference among groups or a correlation between two variables, etc.

Notice that the definitions of null hypothesis and alternative hypothesis have nothing to do with what you want to find or don't want to find, or what is interesting or not interesting, or what you expect to find or what you don’t expect to find. If you were comparing the height of men and women, the null hypothesis would be that the height of men and the height of women were not different. Yet, you might find it surprising if you found this hypothesis to be true for some population you were studying. Likewise, if you were studying the income of men and women, the null hypothesis would be that the income of men and women are not different, in the population you are studying. In this case you might be hoping the null hypothesis is true, though you might be unsurprised if the alternative hypothesis were true. In any case, the null hypothesis will take the form that there is no difference between groups, there is no correlation between two variables, or there is no effect of this variable in our model.

p -value definition

Most of the tests in this book rely on using a statistic called the p -value to evaluate if we should reject, or fail to reject, the null hypothesis.

Given the assumption that the null hypothesis is true , the p -value is defined as the probability of obtaining a result equal to or more extreme than what was actually observed in the data.

We’ll unpack this definition in a little bit.

Decision rule

The p -value for the given data will be determined by conducting the statistical test.

This p -value is then compared to a pre-determined value alpha . Most commonly, an alpha value of 0.05 is used, but there is nothing magic about this value.

If the p -value for the test is less than alpha , we reject the null hypothesis.

If the p -value is greater than or equal to alpha , we fail to reject the null hypothesis.

Coin flipping example

For an example of using the p -value for hypothesis testing, imagine you have a coin you will toss 100 times. The null hypothesis is that the coin is fair—that is, that it is equally likely that the coin will land on heads as land on tails. The alternative hypothesis is that the coin is not fair. Let’s say for this experiment you throw the coin 100 times and it lands on heads 95 times out of those hundred. The p -value in this case would be the probability of getting 95, 96, 97, 98, 99, or 100 heads, or 0, 1, 2, 3, 4, or 5 heads, assuming that the null hypothesis is true .

This is what we call a two-sided test, since we are testing both extremes suggested by our data: getting 95 or greater heads or getting 95 or greater tails. In most cases we will use two sided tests.

You can imagine that the p -value for this data will be quite small. If the null hypothesis is true, and the coin is fair, there would be a low probability of getting 95 or more heads or 95 or more tails.

Using a binomial test, the p -value is < 0.0001.

(Actually, R reports it as < 2.2e-16, which is shorthand for the number in scientific notation, 2.2 x 10 -16 , which is 0.00000000000000022, with 15 zeros after the decimal point.)

Assuming an alpha of 0.05, since the p -value is less than alpha , we reject the null hypothesis. That is, we conclude that the coin is not fair.

binom.test(5, 100, 0.5)

Exact binomial test number of successes = 5, number of trials = 100, p-value < 2.2e-16 alternative hypothesis: true probability of success is not equal to 0.5

Passing and failing example

As another example, imagine we are considering two classrooms, and we have counts of students who passed a certain exam. We want to know if one classroom had statistically more passes or failures than the other.

In our example each classroom will have 10 students. The data is arranged into a contingency table.

Classroom Passed Failed A 8 2 B 3 7

We will use Fisher’s exact test to test if there is an association between Classroom and the counts of passed and failed students. The null hypothesis is that there is no association between Classroom and Passed/Failed , based on the relative counts in each cell of the contingency table.

Input =(" Classroom Passed Failed A 8 2 B 3 7 ") Matrix = as.matrix(read.table(textConnection(Input), header=TRUE, row.names=1)) Matrix

Passed Failed A 8 2 B 3 7

fisher.test(Matrix)

Fisher's Exact Test for Count Data p-value = 0.06978

The reported p -value is 0.070. If we use an alpha of 0.05, then the p -value is greater than alpha , so we fail to reject the null hypothesis. That is, we did not have sufficient evidence to say that there is an association between Classroom and Passed/Failed .

More extreme data in this case would be if the counts in the upper left or lower right (or both!) were greater.

Classroom Passed Failed A 9 1 B 3 7 Classroom Passed Failed A 10 0 B 3 7 and so on, with Classroom B...

In most cases we would want to consider as "extreme" not only the results when Classroom A has a high frequency of passing students, but also results when Classroom B has a high frequency of passing students. This is called a two-sided or two-tailed test. If we were only concerned with one classroom having a high frequency of passing students, relatively, we would instead perform a one-sided test. The default for the fisher.test function is two-sided, and usually you will want to use two-sided tests.

Classroom Passed Failed A 2 8 B 7 3 Classroom Passed Failed A 1 9 B 7 3 Classroom Passed Failed A 0 10 B 7 3 and so on, with Classroom B...

In both cases, "extreme" means there is a stronger association between Classroom and Passed/Failed .

Theory and practice of using p -values

Wait, does this make any sense.

Recall that the definition of the p -value is:

The astute reader might be asking herself, “If I’m trying to determine if the null hypothesis is true or not, why would I start with the assumption that the null hypothesis is true? And why am I using a probability of getting certain data given that a hypothesis is true? Don’t I want to instead determine the probability of the hypothesis given my data?”

The answer is yes , we would like a method to determine the likelihood of our hypothesis being true given our data, but we use the Null Hypothesis Significance Test approach since it is relatively straightforward, and has wide acceptance historically and across disciplines.

In practice we do use the results of the statistical tests to reach conclusions about the null hypothesis.

Technically, the p -value says nothing about the alternative hypothesis. But logically, if the null hypothesis is rejected, then its logical complement, the alternative hypothesis, is supported. Practically, this is how we handle significant p -values, though this practical approach generates disapproval in some theoretical circles.

Statistics is like a jury?

Note the language used when testing the null hypothesis. Based on the results of our statistical tests, we either reject the null hypothesis, or fail to reject the null hypothesis.

This is somewhat similar to the approach of a jury in a trial. The jury either finds sufficient evidence to declare someone guilty, or fails to find sufficient evidence to declare someone guilty.

Failing to convict someone isn’t necessarily the same as declaring someone innocent. Likewise, if we fail to reject the null hypothesis, we shouldn’t assume that the null hypothesis is true. It may be that we didn’t have sufficient samples to get a result that would have allowed us to reject the null hypothesis, or maybe there are some other factors affecting the results that we didn’t account for. This is similar to an “innocent until proven guilty” stance.

Errors in inference

For the most part, the statistical tests we use are based on probability, and our data could always be the result of chance. Considering the coin flipping example above, if we did flip a coin 100 times and came up with 95 heads, we would be compelled to conclude that the coin was not fair. But 95 heads could happen with a fair coin strictly by chance.

We can, therefore, make two kinds of errors in testing the null hypothesis:

• A Type I error occurs when the null hypothesis really is true, but based on our decision rule we reject the null hypothesis. In this case, our result is a false positive ; we think there is an effect (unfair coin, association between variables, difference among groups) when really there isn’t. The probability of making this kind error is alpha , the same alpha we used in our decision rule.

• A Type II error occurs when the null hypothesis is really false, but based on our decision rule we fail to reject the null hypothesis. In this case, our result is a false negative ; we have failed to find an effect that really does exist. The probability of making this kind of error is called beta .

The following table summarizes these errors.

Reality ___________________________________ Decision of Test Null is true Null is false Reject null hypothesis Type I error Correctly (prob. = alpha) reject null (prob. = 1 – beta) Retain null hypothesis Correctly Type II error retain null (prob. = beta) (prob. = 1 – alpha)

Statistical power

The statistical power of a test is a measure of the ability of the test to detect a real effect. It is related to the effect size, the sample size, and our chosen alpha level.

The effect size is a measure of how unfair a coin is, how strong the association is between two variables, or how large the difference is among groups. As the effect size increases or as the number of observations we collect increases, or as the alpha level increases, the power of the test increases.

Statistical power in the table above is indicated by 1 – beta , and power is the probability of correctly rejecting the null hypothesis.

An example should make these relationship clear. Imagine we are sampling a large group of 7 th grade students for their height. That is, the group is the population, and we are sampling a sub-set of these students. In reality, for students in the population, the girls are taller than the boys, but the difference is small (that is, the effect size is small), and there is a lot of variability in students’ heights. You can imagine that in order to detect the difference between girls and boys that we would have to measure many students. If we fail to sample enough students, we might make a Type II error. That is, we might fail to detect the actual difference in heights between sexes.

If we had a different experiment with a larger effect size—for example the weight difference between mature hamsters and mature hedgehogs—we might need fewer samples to detect the difference.

Note also, that our chosen alpha plays a role in the power of our test, too. All things being equal, across many tests, if we decrease our alph a, that is, insist on a lower rate of Type I errors, we are more likely to commit a Type II error, and so have a lower power. This is analogous to a case of a meticulous jury that has a very high standard of proof to convict someone. In this case, the likelihood of a false conviction is low, but the likelihood of a letting a guilty person go free is relatively high.

The 0.05 alpha value is not dogma

The level of alpha is traditionally set at 0.05 in some disciplines, though there is sometimes reason to choose a different value.

One situation in which the alpha level is increased is in preliminary studies in which it is better to include potentially significant effects even if there is not strong evidence for keeping them. In this case, the researcher is accepting an inflated chance of Type I errors in order to decrease the chance of Type II errors.

Imagine an experiment in which you wanted to see if various environmental treatments would improve student learning. In a preliminary study, you might have many treatments, with few observations each, and you want to retain any potentially successful treatments for future study. For example, you might try playing classical music, improved lighting, complimenting students, and so on, and see if there is any effect on student learning. You might relax your alpha value to 0.10 or 0.15 in the preliminary study to see what treatments to include in future studies.

On the other hand, in situations where a Type I, false positive, error might be costly in terms of money or people’s health, a lower alpha can be used, perhaps, 0.01 or 0.001. You can imagine a case in which there is an established treatment for cancer, and a new treatment is being tested. Because the new treatment is likely to be expensive and to hold people’s lives in the balance, a researcher would want to be very sure that the new treatment is more effective than the established treatment. In reality, the researchers would not just lower the alpha level, but also look at the effect size, submit the research for peer review, replicate the study, be sure there were no problems with the design of the study or the data collection, and weigh the practical implications.

The 0.05 alpha value is almost dogma

In theory, as a researcher, you would determine the alpha level you feel is appropriate. That is, the probability of making a Type I error when the null hypothesis is in fact true.

In reality, though, 0.05 is almost always used in most fields for readers of this book. Choosing a different alpha value will rarely go without question. It is best to keep with the 0.05 level unless you have good justification for another value, or are in a discipline where other values are routinely used.

Practical advice

One good practice is to report actual p -values from analyses. It is fine to also simply say, e.g. “The dependent variable was significantly correlated with variable A ( p < 0.05).” But I prefer when possible to say, “The dependent variable was significantly correlated with variable A ( p = 0.026).

It is probably best to avoid using terms like “marginally significant” or “borderline significant” for p -values less than 0.10 but greater than 0.05, though you might encounter similar phrases. It is better to simply report the p -values of tests or effects in straight-forward manner. If you had cause to include certain model effects or results from other tests, they can be reported as e.g., “Variables correlated with the dependent variable with p < 0.15 were A , B , and C .”

Is the p -value every really true?

Considering some of the examples presented, it may have occurred to the reader to ask if the null hypothesis is ever really true. For example, in some population of 7 th graders, if we could measure everyone in the population to a high degree of precision, then there must be some difference in height between girls and boys. This is an important limitation of null hypothesis significance testing. Often, if we have many observations, even small effects will be reported as significant. This is one reason why it is important to not rely too heavily on p -values, but to also look at the size of the effect and practical considerations. In this example, if we sampled many students and the difference in heights was 0.5 cm, even if significant, we might decide that this effect is too small to be of practical importance, especially relative to an average height of 150 cm. (Here, the difference would be 0.3% of the average height).

Effect sizes and practical importance

Practical importance and statistical significance.

It is important to remember to not let p -values be the only guide for drawing conclusions. It is equally important to look at the size of the effects you are measuring, as well as take into account other practical considerations like the costs of choosing a certain path of action.

For example, imagine we want to compare the SAT scores of two SAT preparation classes with a t -test.

Class.A = c(1500, 1505, 1505, 1510, 1510, 1510, 1515, 1515, 1520, 1520) Class.B = c(1510, 1515, 1515, 1520, 1520, 1520, 1525, 1525, 1530, 1530) t.test(Class.A, Class.B)

Welch Two Sample t-test t = -3.3968, df = 18, p-value = 0.003214 mean of x mean of y 1511 1521

The p -value is reported as 0.003, so we would consider there to be a significant difference between the two classes ( p < 0.05).

But we have to ask ourselves the practical question, is a difference of 10 points on the SAT large enough for us to care about? What if enrolling in one class costs significantly more than the other class? Is it worth the extra money for a difference of 10 points on average?

Sizes of effects

It should be remembered that p -values do not indicate the size of the effect being studied. It shouldn’t be assumed that a small p -value indicates a large difference between groups, or vice-versa.

For example, in the SAT example above, the p -value is fairly small, but the size of the effect (difference between classes) in this case is relatively small (10 points, especially small relative to the range of scores students receive on the SAT).

In converse, there could be a relatively large size of the effects, but if there is a lot of variability in the data or the sample size is not large enough, the p -value could be relatively large.

In this example, the SAT scores differ by 100 points between classes, but because the variability is greater than in the previous example, the p -value is not significant.

Class.C = c(1000, 1100, 1200, 1250, 1300, 1300, 1400, 1400, 1450, 1500) Class.D = c(1100, 1200, 1300, 1350, 1400, 1400, 1500, 1500, 1550, 1600) t.test(Class.C, Class.D)

Welch Two Sample t-test t = -1.4174, df = 18, p-value = 0.1735 mean of x mean of y 1290 1390

boxplot(cbind(Class.C, Class.D))

p -values and sample sizes

It should also be remembered that p -values are affected by sample size. For a given effect size and variability in the data, as the sample size increases, the p -value is likely to decrease. For large data sets, small effects can result in significant p -values.

As an example, let’s take the data from Class.C and Class.D and double the number of observations for each without changing the distribution of the values in each, and rename them Class.E and Class.F .

Class.E = c(1000, 1100, 1200, 1250, 1300, 1300, 1400, 1400, 1450, 1500, 1000, 1100, 1200, 1250, 1300, 1300, 1400, 1400, 1450, 1500) Class.F = c(1100, 1200, 1300, 1350, 1400, 1400, 1500, 1500, 1550, 1600, 1100, 1200, 1300, 1350, 1400, 1400, 1500, 1500, 1550, 1600) t.test(Class.E, Class.F)

Welch Two Sample t-test t = -2.0594, df = 38, p-value = 0.04636 mean of x mean of y 1290 1390

boxplot(cbind(Class.E, Class.F))

Notice that the p -value is lower for the t -test for Class.E and Class.F than it was for Class.C and Class.D . Also notice that the means reported in the output are the same, and the box plots would look the same.

Effect size statistics

One way to account for the effect of sample size on our statistical tests is to consider effect size statistics. These statistics reflect the size of the effect in a standardized way, and are unaffected by sample size.

An appropriate effect size statistic for a t -test is Cohen’s d . It takes the difference in means between the two groups and divides by the pooled standard deviation of the groups. Cohen’s d equals zero if the means are the same, and increases to infinity as the difference in means increases relative to the standard deviation.

In the following, note that Cohen’s d is not affected by the sample size difference in the Class.C / Class.D and the Class.E / Class.F examples.

library(lsr) cohensD(Class.C, Class.D, method = "raw")

cohensD(Class.E, Class.F, method = "raw")

Effect size statistics are standardized so that they are not affected by the units of measurements of the data. This makes them interpretable across different situations, or if the reader is not familiar with the units of measurement in the original data. A Cohen’s d of 1 suggests that the two means differ by one pooled standard deviation. A Cohen’s d of 0.5 suggests that the two means differ by one-half the pooled standard deviation.

For example, if we create new variables— Class.G and Class.H —that are the SAT scores from the previous example expressed as a proportion of a 1600 score, Cohen’s d will be the same as in the previous example.

Class.G = Class.E / 1600 Class.H = Class.F / 1600 Class.G Class.H cohensD(Class.G, Class.H, method="raw")

Good practices for statistical analyses

Statistics is not like a trial.

When analyzing data, the analyst should not approach the task as would a lawyer for the prosecution. That is, the analyst should not be searching for significant effects and tests, but should instead be like an independent investigator using lines of evidence to find out what is most likely to true given the data, graphical analysis, and statistical analysis available.

The problem of multiple p -values

One concept that will be in important in the following discussion is that when there are multiple tests producing multiple p -values, that there is an inflation of the Type I error rate. That is, there is a higher chance of making false-positive errors.

This simply follows mathematically from the definition of alpha . If we allow a probability of 0.05, or 5% chance, of making a Type I error for any one test, as we do more and more tests, the chances that at least one of them having a false positive becomes greater and greater.

p -value adjustment

One way we deal with the problem of multiple p -values in statistical analyses is to adjust p -values when we do a series of tests together (for example, if we are comparing the means of multiple groups).

Don’t use Bonferroni adjustments

There are various p -value adjustments available in R. In some cases, we will use FDR, which stands for false discovery rate , and in R is an alias for the Benjamini and Hochberg method. There are also cases in which we’ll use Tukey range adjustment to correct for the family-wise error rate.

Unfortunately, students in analysis of experiments courses often learn to use Bonferroni adjustment for p -values. This method is simple to do with hand calculations, but is excessively conservative in most situations, and, in my opinion, antiquated.

There are other p -value adjustment methods, and the choice of which one to use is dictated either by which are common in your field of study, or by doing enough reading to understand which are statistically most appropriate for your application.

Preplanned tests

The statistical tests covered in this book assume that tests are preplanned for their p -values to be accurate. That is, in theory, you set out an experiment, collect the data as planned, and then say “I’m going to analyze it with kind of model and do these post-hoc tests afterwards”, and report these results, and that’s all you would do.

Some authors emphasize this idea of preplanned tests. In contrast is an exploratory data analysis approach that relies upon examining the data with plots and using simple tests like correlation tests to suggest what statistical analysis makes sense.

If an experiment is set out in a specific design, then usually it is appropriate to use the analysis suggested by this design.

p -value hacking

It is important when approaching data from an exploratory approach, to avoid committing p -value hacking. Imagine the case in which the researcher collects many different measurements across a range of subjects. The researcher might be tempted to simply try different tests and models to relate one variable to another, for all the variables. He might continue to do this until he found a test with a significant p -value.

But this would be a form of p -value hacking.

Because an alpha value of 0.05 allows us to make a false-positive error five percent of the time, finding one p -value below 0.05 after several successive tests may simply be due to chance.

Some forms of p -value hacking are more egregious. For example, if one were to collect some data, run a test, and then continue to collect data and run tests iteratively until a significant p -value is found.

Publication bias

A related issue in science is that there is a bias to publish, or to report, only significant results. This can also lead to an inflation of the false-positive rate. As a hypothetical example, imagine if there are currently 20 similar studies being conducted testing a similar effect—let’s say the effect of glucosamine supplements on joint pain. If 19 of those studies found no effect and so were discarded, but one study found an effect using an alpha of 0.05, and was published, is this really any support that glucosamine supplements decrease joint pain?

Clarification of terms and reporting on assignments

"statistically significant".

In the context of this book, the term "significant" means "statistically significant".

Whenever the decision rule finds that p < alpha , the difference in groups, the association, or the correlation under consideration is then considered "statistically significant" or "significant".

No effect size or practical considerations enter into determining whether an effect is “significant” or not. The only exception is that test assumptions and requirements for appropriate data must also be met in order for the p -value to be valid.

What you need to consider :

• The null hypothesis

• p , alpha , and the decision rule,

• Your result. That is, whether the difference in groups, the association, or the correlation is significant or not.

What you should report on your assignments:

• The p -value

• The conclusion, e.g. "There was a significant difference in the mean heights of boys and girls in the class." It is best to preface this with the "reject" or "fail to reject" language concerning your decision about the null hypothesis.

“Size of the effect” / “effect size”

In the context of this book, I use the term "size of the effect" to suggest the use of summary statistics to indicate how large an effect is. This may be, for example the difference in two medians. I try reserve the term “effect size” to refer to the use of effect size statistics. This distinction isn’t necessarily common.

Usually you will consider an effect in relation to the magnitude of measurements. That is, you might look at the difference in medians as a percent of the median of one group or of the global median. Or, you might look at the difference in medians in relation to the range of answers. For example, a one-point difference on a 5-point Likert item. Counts might be expressed as proportions of totals or subsets.

What you should report on assignments :

• The size of the effect. That is, the difference in medians or means, the difference in counts, or the proportions of counts among groups.

• Where appropriate, the size of the effect expressed as a percentage or proportion.

• If there is an effect size statistic—such as r , epsilon -squared, phi , Cramér's V , or Cohen's d —: report this and its interpretation (small, medium, large), and incorporate this into your conclusion.

"Practical" / "Practical importance"

If there is a significant result, the question of practical importance asks if the difference or association is large enough to matter in the real world.

If there is no significant result, the question of practical importance asks if the a difference or association is large enough to warrant another look, for example by running another test with a larger sample size or that controls variability in observations better.

• Your conclusion as to whether this effect is large enough to be important in the real world.

• The context, explanation, or support to justify your conclusion.

• In some cases you might include considerations that aren't included in the data presented. Examples might include the cost of one treatment over another, including time investment, or whether there is a large risk in selecting one treatment over another (e.g., if people's lives are on the line).

A few of xkcd comics

Significant.

xkcd.com/882/

Null hypothesis

xkcd.com/892/

xkcd.com/1478/

Experiments, sampling, and causation

Types of experimental designs, experimental designs.

A true experimental design assigns treatments in a systematic manner. The experimenter must be able to manipulate the experimental treatments and assign them to subjects. Since treatments are randomly assigned to subjects, a causal inference can be made for significant results. That is, we can say that the variation in the dependent variable is caused by the variation in the independent variable.

For interval/ratio data, traditional experimental designs can be analyzed with specific parametric models, assuming other model assumptions are met. These traditional experimental designs include:

• Completely random design

• Randomized complete block design

• Factorial

• Split-plot

• Latin square

Quasi-experiment designs

Often a researcher cannot assign treatments to individual experimental units, but can assign treatments to groups. For example, if students are in a specific grade or class, it would not be practical to randomly assign students to grades or classes. But different classes could receive different treatments (such as different curricula). Causality can be inferred cautiously if treatments are randomly assigned and there is some understanding of the factors that affect the outcome.

Observational studies

In observational studies, the independent variables are not manipulated, and no treatments are assigned. Surveys are often like this, as are studies of natural systems without experimental manipulation. Statistical analysis can reveal the relationships among variables, but causality cannot be inferred. This is because there may be other unstudied variables that affect the measured variables in the study.

Good sampling practices are critical for producing good data. In general, samples need to be collected in a random fashion so that bias is avoided.

In survey data, bias is often introduced by a self-selection bias. For example, internet or telephone surveys include only those who respond to these requests. Might there be some relevant difference in the variables of interest between those who respond to such requests and the general population being surveyed? Or bias could be introduced by the researcher selecting some subset of potential subjects, for example only surveying a 4-H program with particularly cooperative students and ignoring other clubs. This is sometimes called “convenience sampling”.

In election forecasting, good pollsters need to account for selection bias and other biases in the survey process. For example, if a survey is done by landline telephone, those being surveyed are more likely to be older than the general population of voters, and so likely to have a bias in their voting patterns.

Plan ahead and be consistent

It is sometimes necessary to change experimental conditions during the course of an experiment. Equipment might fail, or unusual weather may prevent making meaningful measurements.

But in general, it is much better to plan ahead and be consistent with measurements.

Consistency

People sometimes have the tendency to change measurement frequency or experimental treatments during the course of a study. This inevitably causes headaches in trying to analyze data, and makes writing up the results messy. Try to avoid this.

Controls and checks

If you are testing an experimental treatment, include a check treatment that almost certainly will have an effect and a control treatment that almost certainly won’t. A control treatment will receive no treatment and a check treatment will receive a treatment known to be successful. In an educational setting, perhaps a control group receives no instruction on the topic but on another topic, and the check group will receive standard instruction.

Including checks and controls helps with the analysis in a practical sense, since they serve as standard treatments against which to compare the experimental treatments. In the case where the experimental treatments have similar effects, controls and checks allow you say, for example, “Means for the all experimental treatments were similar, but were higher than the mean for control, and lower than the mean for check treatment.”

Include alternate measurements

It often happens that measuring equipment fails or that a certain measurement doesn’t produce the expected results. It is therefore helpful to include measurements of several variables that can capture the potential effects. Perhaps test scores of students won’t show an effect, but a self-assessment question on how much students learned will.

Include covariates

Including additional independent variables that might affect the dependent variable is often helpful in an analysis. In an educational setting, you might assess student age, grade, school, town, background level in the subject, or how well they are feeling that day.

The effects of covariates on the dependent variable may be of interest in itself. But also, including co-variates in an analysis can better model the data, sometimes making treatment effects more clear or making a model better meet model assumptions.

Optional discussion: Alternative methods to the Null Hypothesis Significance Test

The nhst controversy.

Particularly in the fields of psychology and education, there has been much criticism of the null hypothesis significance test approach. From my reading, the main complaints against NHST tend to be:

• Students and researchers don’t really understand the meaning of p -values.

• p -values don’t include important information like confidence intervals or parameter estimates.

• p -values have properties that may be misleading, for example that they do not represent effect size, and that they change with sample size.

• We often treat an alpha of 0.05 as a magical cutoff value.

Personally, I don’t find these to be very convincing arguments against the NHST approach.

The first complaint is in some sense pedantic: Like so many things, students and researchers learn the definition of p -values at some point and then eventually forget. This doesn’t seem to impact the usefulness of the approach.

The second point has weight only if researchers use only p -values to draw conclusions from statistical tests. As this book points out, one should always consider the size of the effects and practical considerations of the effects, as well present finding in table or graphical form, including confidence intervals or measures of dispersion. There is no reason why parameter estimates, goodness-of-fit statistics, and confidence intervals can’t be included when a NHST approach is followed.

The properties in the third point also don’t count much as criticism if one is using p -values correctly. One should understand that it is possible to have a small effect size and a small p -value, and vice-versa. This is not a problem, because p -values and effect sizes are two different concepts. We shouldn’t expect them to be the same. The fact that p -values change with sample size is also in no way problematic to me. It makes sense that when there is a small effect size or a lot of variability in the data that we need many samples to conclude the effect is likely to be real.

(One case where I think the considerations in the preceding point are commonly problematic is when people use statistical tests to check for the normality or homogeneity of data or model residuals. As sample size increases, these tests are better able to detect small deviations from normality or homoscedasticity. Too many people use them and think their model is inappropriate because the test can detect a small effect size, that is, a small deviation from normality or homoscedasticity).

The fourth point is a good one. It doesn’t make much sense to come to one conclusion if our p -value is 0.049 and the opposite conclusion if our p -value is 0.051. But I think this can be ameliorated by reporting the actual p -values from analyses, and relying less on p -values to evaluate results.

Overall it seems to me that these complaints condemn poor practices that the authors observe: not reporting the size of effects in some manner; not including confidence intervals or measures of dispersion; basing conclusions solely on p -values; and not including important results like parameter estimates and goodness-of-fit statistics.

Alternatives to the NHST approach

Estimates and confidence intervals.

One approach to determining statistical significance is to use estimates and confidence intervals. Estimates could be statistics like means, medians, proportions, or other calculated statistics. This approach can be very straightforward, easy for readers to understand, and easy to present clearly.

Bayesian approach

The most popular competitor to the NHST approach is Bayesian inference. Bayesian inference has the advantage of calculating the probability of the hypothesis given the data , which is what we thought we should be doing in the “Wait, does this make any sense?” section above. Essentially it takes prior knowledge about the distribution of the parameters of interest for a population and adds the information from the measured data to reassess some hypothesis related to the parameters of interest. If the reader will excuse the vagueness of this description, it makes intuitive sense. We start with what we suspect to be the case, and then use new data to assess our hypothesis.

One disadvantage of the Bayesian approach is that it is not obvious in most cases what could be used for legitimate prior information. A second disadvantage is that conducting Bayesian analysis is not as straightforward as the tests presented in this book.

References and further reading

[Video] “Understanding statistical inference” from Statistics Learning Center (Dr. Nic). 2015. www.youtube.com/watch?v=tFRXsngz4UQ .

[Video] “Hypothesis tests, p-value” from Statistics Learning Center (Dr. Nic). 2011. www.youtube.com/watch?v=0zZYBALbZgg .

[Video] “Understanding the p-value” from Statistics Learning Center (Dr. Nic). 2011.

www.youtube.com/watch?v=eyknGvncKLw .

[Video] “Important statistical concepts: significance, strength, association, causation” from Statistics Learning Center (Dr. Nic). 2012. www.youtube.com/watch?v=FG7xnWmZlPE .

“Understanding statistical inference” from Dr. Nic. 2015. Learn and Teach Statistics & Operations Research. creativemaths.net/blog/understanding-statistical-inference/ .

“Basic concepts of hypothesis testing” in McDonald, J.H. 2014. Handbook of Biological Statistics . www.biostathandbook.com/hypothesistesting.html .

“Hypothesis testing” , section 4.3, in Diez, D.M., C.D. Barr , and M. Çetinkaya-Rundel. 2012. OpenIntro Statistics , 2nd ed. www.openintro.org/ .

“Hypothesis Testing with One Sample”, sections 9.1–9.2 in Openstax. 2013. Introductory Statistics . openstax.org/textbooks/introductory-statistics .

"Proving causation" from Dr. Nic. 2013. Learn and Teach Statistics & Operations Research. creativemaths.net/blog/proving-causation/ .

[Video] “Variation and Sampling Error” from Statistics Learning Center (Dr. Nic). 2014. www.youtube.com/watch?v=y3A0lUkpAko .

[Video] “Sampling: Simple Random, Convenience, systematic, cluster, stratified” from Statistics Learning Center (Dr. Nic). 2012. www.youtube.com/watch?v=be9e-Q-jC-0 .

“Confounding variables” in McDonald, J.H. 2014. Handbook of Biological Statistics . www.biostathandbook.com/confounding.html .

“Overview of data collection principles” , section 1.3, in Diez, D.M., C.D. Barr , and M. Çetinkaya-Rundel. 2012. OpenIntro Statistics , 2nd ed. www.openintro.org/ .

“Observational studies and sampling strategies” , section 1.4, in Diez, D.M., C.D. Barr , and M. Çetinkaya-Rundel. 2012. OpenIntro Statistics , 2nd ed. www.openintro.org/ .

“Experiments” , section 1.5, in Diez, D.M., C.D. Barr , and M. Çetinkaya-Rundel. 2012. OpenIntro Statistics , 2nd ed. www.openintro.org/ .

Exercises F

1. Which of the following pair is the null hypothesis?

A) The number of heads from the coin is not different from the number of tails.

B) The number of heads from the coin is different from the number of tails.

2. Which of the following pair is the null hypothesis?

A) The height of boys is different than the height of girls.

B) The height of boys is not different than the height of girls.

3. Which of the following pair is the null hypothesis?

A) There is an association between classroom and sex. That is, there is a difference in counts of girls and boys between the classes.

B) There is no association between classroom and sex. That is, there is no difference in counts of girls and boys between the classes.

4. We flip a coin 10 times and it lands on heads 7 times. We want to know if the coin is fair.

a. What is the null hypothesis?

b. Looking at the code below, and assuming an alpha of 0.05,

What do you decide (use the reject or fail to reject language)?

c. In practical terms, what do you conclude?

binom.test(7, 10, 0.5)

Exact binomial test number of successes = 7, number of trials = 10, p-value = 0.3438

5. We measure the height of 9 boys and 9 girls in a class, in centimeters. We want to know if one group is taller than the other.

c. In practical terms, what do you conclude? Address the practical importance of the results.

Girls = c(152, 150, 140, 160, 145, 155, 150, 152, 147) Boys = c(144, 142, 132, 152, 137, 147, 142, 144, 139) t.test(Girls, Boys)

Welch Two Sample t-test t = 2.9382, df = 16, p-value = 0.009645 mean of x mean of y 150.1111 142.1111

mean(Boys) sd(Boys) quantile(Boys)

mean(Girls) sd(Girls) quantile(Girls) boxplot(cbind(Girls, Boys))

6. We count the number of boys and girls in two classrooms. We are interested to know if there is an association between the classrooms and the number of girls and boys. That is, does the proportion of boys and girls differ statistically across the two classrooms?

Classroom Girls Boys A 13 7 B 5 15

Input =(" Classroom Girls Boys A 13 7 B 5 15 ") Matrix = as.matrix(read.table(textConnection(Input), header=TRUE, row.names=1)) fisher.test(Matrix)

Fisher's Exact Test for Count Data p-value = 0.02484

Matrix rowSums(Matrix) colSums(Matrix) prop.table(Matrix, margin=1) ### Proportions for each row barplot(t(Matrix), beside = TRUE, legend = TRUE, ylim = c(0, 25), xlab = "Class", ylab = "Count")

7. Why should you not rely solely on p -values to make a decision in the real world? (You should have at least two reasons.)

8. Create your own example to show the importance of considering the size of the effect . Describe the scenario: what the research question is, and what kind of data were collected. You may make up data and provide real results, or report hypothetical results.

9. Create your own example to show the importance of weighing other practical considerations . Describe the scenario: what the research question is, what kind of data were collected, what statistical results were reached, and what other practical considerations were brought to bear.

10. What is 5e-4 in common decimal notation?

Non-commercial reproduction of this content, with attribution, is permitted. For-profit reproduction without permission is prohibited.

If you use the code or information in this site in a published work, please cite it as a source. Also, if you are an instructor and use this book in your course, please let me know. My contact information is on the About the Author of this Book page.

Mangiafico, S.S. 2016. Summary and Analysis of Extension Program Evaluation in R, version 1.20.05, revised 2023. rcompanion.org/handbook/ . (Pdf version: rcompanion.org/documents/RHandbookProgramEvaluation.pdf .)

Statistics with R
R Objects, Numbers, Attributes, Vectors, Coercion
Matrices, Lists, Factors
Data Frames in R
Control Structures in R
Functions in R
Data Basics: Compute Summary Statistics in R
Central Tendency and Spread in R Programming
Data Basics: Plotting – Charts and Graphs
Normal Distribution in R
Skewness of statistical data
Bernoulli Distribution in R
Binomial Distribution in R Programming
Compute Randomly Drawn Negative Binomial Density in R Programming
Poisson Functions in R Programming
How to Use the Multinomial Distribution in R
Beta Distribution in R
Chi-Square Distribution in R
Exponential Distribution in R Programming
Log Normal Distribution in R
Continuous Uniform Distribution in R
Understanding the t-distribution in R
Gamma Distribution in R Programming
How to Calculate Conditional Probability in R?

How to Plot a Weibull Distribution in R

Hypothesis testing in r programming.

One Sample T-test in R
Two sample T-test in R
Paired Sample T-test in R
Type I Error in R
Type II Error in R
Confidence Intervals in R
Covariance and Correlation in R
Covariance Matrix in R
Pearson Correlation in R
Normal Probability Plot in R

Hypothesis testing is a statistical method used to determine whether there is enough evidence to reject a null hypothesis in favor of an alternative hypothesis. In R programming, you can perform various types of hypothesis tests, such as t-tests, chi-squared tests, and ANOVA tests, among others.

In R programming, you can perform hypothesis testing using various built-in functions. Here’s an overview of some commonly used hypothesis testing methods in R:

T-test (one-sample, paired, and independent two-sample)
Chi-square test
ANOVA (Analysis of Variance)
Wilcoxon signed-rank test
Mann-Whitney U test

1. One-sample t-test:

The one-sample t-test is used to compare the mean of a sample to a known value (usually a population mean) to see if there is a significant difference.

2. Two-sample t-test:

The two-sample t-test is used to compare the means of two independent samples to see if there is a significant difference.

3. Paired t-test:

The paired t-test is used to compare the means of two dependent samples, usually to test the effect of a treatment or intervention.

4. Chi-squared test:

The chi-squared test is used to test the association between two categorical variables.

5. One-way ANOVA

For a one-way ANOVA, use the aov() and summary() functions:

6. Wilcoxon signed-rank test

7. mann-whitney u test.

For a Mann-Whitney U test, use the wilcox.test() function with the paired argument set to FALSE :

Steps for conducting a Hypothesis Testing

Hypothesis testing is a statistical method used to make inferences about population parameters based on sample data. In R programming, you can perform various types of hypothesis tests, such as t-tests, chi-squared tests, and ANOVA, depending on the nature of your data and research question.

Here, I’ll walk you through the steps for conducting a t-test (one of the most common hypothesis tests) in R. A t-test is used to compare the means of two groups, often in order to determine whether there’s a significant difference between them.

1. Prepare your data:

First, you’ll need to have your data in R. You can either read data from a file (e.g., using read.csv() ), or you can create vectors directly in R. For this example, I’ll create two sample vectors for Group 1 and Group 2:

2. State your null and alternative hypotheses:

In hypothesis testing, we start with a null hypothesis (H0) and an alternative hypothesis (H1). For a t-test, the null hypothesis is typically that there’s no difference between the means of the two groups, while the alternative hypothesis is that there is a difference. In this example:

H0: μ1 = μ2 (the means of Group 1 and Group 2 are equal)
H1: μ1 ≠ μ2 (the means of Group 1 and Group 2 are not equal)

3. Perform the t-test:

Use the t.test() function to perform the t-test on your data. You can specify the type of t-test (independent samples, paired, or one-sample) with the appropriate arguments. In this case, we’ll perform an independent samples t-test:

4. Interpret the results:

The t-test result will include the t-value, degrees of freedom, and the p-value, among other information. The p-value is particularly important, as it helps you determine whether to accept or reject the null hypothesis. A common significance level (alpha) is 0.05. If the p-value is less than alpha, you can reject the null hypothesis, otherwise you fail to reject it.

5. Make a decision:

Based on the p-value and your chosen significance level, make a decision about whether to reject or fail to reject the null hypothesis. If the p-value is less than 0.05, you would reject the null hypothesis and conclude that there is a significant difference between the means of the two groups.

Keep in mind that this example demonstrates the basic process of hypothesis testing using a t-test in R. Different tests and data may require additional steps, arguments, or functions. Be sure to consult R documentation and resources to ensure you’re using the appropriate test and interpreting the results correctly.

Few more examples of hypothesis tests using R

1. one-sample t-test: compares the mean of a sample to a known value., 2. two-sample t-test: compares the means of two independent samples., 3. paired t-test: compares the means of two paired samples., 4. chi-squared test: tests the independence between two categorical variables., 5. anova: compares the means of three or more independent samples..

Remember to interpret the results (p-value) according to the significance level (commonly 0.05). If the p-value is less than the significance level, you can reject the null hypothesis in favor of the alternative hypothesis.

T-Test in R Programming

Hypothesis Testing in R: A Comprehensive Guide with Code and Examples

Hypothesis testing is a fundamental statistical technique used to make inferences about population parameters based on sample data. It helps us determine whether an observed effect or difference is statistically significant or if it could have occurred by random chance. In this guide, we’ll explore hypothesis testing in R, a powerful statistical programming language, through practical examples and code snippets.

Understanding the Hypothesis Testing Process

Before diving into code examples, let’s grasp the key concepts of hypothesis testing:

Null Hypothesis (H0): This is the default assumption that there is no significant effect, difference, or relationship in the population. It’s often denoted as H0.
Alternative Hypothesis (Ha): This is the statement we want to test; it asserts that there is a significant effect, difference, or relationship in the population. It’s often denoted as Ha.
Significance Level (α): This is the predetermined threshold that defines when we reject the null hypothesis. Common values are 0.05 or 0.01, representing a 5% or 1% chance of making a Type I error (false positive), respectively.
Test Statistic: A statistic calculated from the sample data that measures the strength of evidence against the null hypothesis.
P-value: The probability of observing a test statistic as extreme as, or more extreme than, the one calculated from the sample data under the null hypothesis. A smaller p-value suggests stronger evidence against the null hypothesis.
Decision Rule: Based on the p-value, we decide whether to reject the null hypothesis. If the p-value is less than α, we reject H0; otherwise, we fail to reject it.

Now, let’s explore some practical examples of hypothesis testing in R.

Example 1: One-Sample T-Test

Suppose we have a dataset of exam scores, and we want to test if the average score is significantly different from 75.

In this example, we perform a one-sample t-test to determine if the sample mean is significantly different from 75. The resulting p-value will help us make the decision.

Example 2: Two-Sample T-Test

Let’s say we want to compare the exam scores of two different classes (Class A and Class B) to see if there is a significant difference between their average scores.

Here, we perform a two-sample t-test to compare the means of two independent samples (Class A and Class B).

Example 3: Chi-Square Test of Independence

Suppose we have data on the preferred mode of transportation for two groups of people (Group X and Group Y), and we want to test if there is an association between the groups and their transportation preferences.

In this example, we use a chi-square test to determine if there is an association between the groups and their transportation preferences.

Hypothesis testing is a powerful tool for making data-driven decisions in various fields, from medicine to business. In R, you can conduct a wide range of hypothesis tests using built-in functions and libraries like t.test() and chisq.test() . Remember to set your significance level appropriately, and interpret the results cautiously based on the p-value.

By mastering hypothesis testing in R, you’ll be better equipped to draw meaningful conclusions from your data and make informed decisions in your research and analysis.

Previous post.

How to Use A Log Transformation in R To Rescale Your Data

5 Best Practices for Database Partitioning in Cloud Environments

Horizontal and Vertical Partitioning in Databases

Beginner’s Guide to Tidying Up Your Datasets: Data Cleaning 101 with Proven Strategies

linearHypothesis: Test Linear Hypothesis

Description.

Generic function for testing a linear hypothesis, and methods for linear models, generalized linear models, multivariate linear models, linear and generalized linear mixed-effects models, generalized linear models fit with svyglm in the survey package, robust linear models fit with rlm in the MASS package, and other models that have methods for coef and vcov . For mixed-effects models, the tests are Wald chi-square tests for the fixed effects.

lht(model, ...)

# S3 method for default linearHypothesis(model, hypothesis.matrix, rhs=NULL, test=c("Chisq", "F"), vcov.=NULL, singular.ok=FALSE, verbose=FALSE, coef. = coef(model), suppress.vcov.msg=FALSE, error.df, ...)

# S3 method for lm linearHypothesis(model, hypothesis.matrix, rhs=NULL, test=c("F", "Chisq"), vcov.=NULL, white.adjust=c(FALSE, TRUE, "hc3", "hc0", "hc1", "hc2", "hc4"), singular.ok=FALSE, ...)

# S3 method for glm linearHypothesis(model, ...)

# S3 method for lmList linearHypothesis(model, ..., vcov.=vcov, coef.=coef)

# S3 method for nlsList linearHypothesis(model, ..., vcov.=vcov, coef.=coef)

# S3 method for mlm linearHypothesis(model, hypothesis.matrix, rhs=NULL, SSPE, V, test, idata, icontrasts=c("contr.sum", "contr.poly"), idesign, iterms, check.imatrix=TRUE, P=NULL, title="model hypothesis testing in r", singular.ok=FALSE, verbose=FALSE, ...) # S3 method for polr linearHypothesis(model, hypothesis.matrix, rhs=NULL, vcov., verbose=FALSE, ...) # S3 method for linearHypothesis.mlm print(x, SSP=TRUE, SSPE=SSP, digits=getOption("digits"), ...) # S3 method for lme linearHypothesis(model, hypothesis.matrix, rhs=NULL, vcov.=NULL, singular.ok=FALSE, verbose=FALSE, ...) # S3 method for mer linearHypothesis(model, hypothesis.matrix, rhs=NULL, vcov.=NULL, test=c("Chisq", "F"), singular.ok=FALSE, verbose=FALSE, ...) # S3 method for merMod linearHypothesis(model, hypothesis.matrix, rhs=NULL, vcov.=NULL, test=c("Chisq", "F"), singular.ok=FALSE, verbose=FALSE, ...) # S3 method for svyglm linearHypothesis(model, ...)

# S3 method for rlm linearHypothesis(model, ...)

# S3 method for survreg linearHypothesis(model, hypothesis.matrix, rhs=NULL, test=c("Chisq", "F"), vcov., verbose=FALSE, ...) matchCoefs(model, pattern, ...)

# S3 method for default matchCoefs(model, pattern, coef.=coef, ...)

# S3 method for lme matchCoefs(model, pattern, ...)

# S3 method for mer matchCoefs(model, pattern, ...)

# S3 method for merMod matchCoefs(model, pattern, ...)

# S3 method for mlm matchCoefs(model, pattern, ...)

# S3 method for lmList matchCoefs(model, pattern, ...)

For a univariate model, an object of class "anova"

which contains the residual degrees of freedom in the model, the difference in degrees of freedom, Wald statistic (either "F" or "Chisq" ), and corresponding p value. The value of the linear hypothesis and its covariance matrix are returned respectively as "value" and "vcov" attributes of the object (but not printed).

For a multivariate linear model, an object of class

"linearHypothesis.mlm" , which contains sums-of-squares-and-product matrices for the hypothesis and for error, degrees of freedom for the hypothesis and error, and some other information.

The returned object normally would be printed.

fitted model object. The default method of linearHypothesis works for models for which the estimated parameters can be retrieved by coef and the corresponding estimated covariance matrix by vcov . See the Details for more information.

matrix (or vector) giving linear combinations of coefficients by rows, or a character vector giving the hypothesis in symbolic form (see Details ).

right-hand-side vector for hypothesis, with as many entries as rows in the hypothesis matrix; can be omitted, in which case it defaults to a vector of zeroes. For a multivariate linear model, rhs is a matrix, defaulting to 0. This argument isn't available for F-tests for linear mixed models.

if FALSE (the default), a model with aliased coefficients produces an error; if TRUE , the aliased coefficients are ignored, and the hypothesis matrix should not have columns for them. For a multivariate linear model: will return the hypothesis and error SSP matrices even if the latter is singular; useful for computing univariate repeated-measures ANOVAs where there are fewer subjects than df for within-subject effects.

For the default linearHypothesis method, if an F-test is requested and if error.df is missing, the error degrees of freedom will be computed by applying the df.residual function to the model; if df.residual returns NULL or NA , then a chi-square test will be substituted for the F-test (with a message to that effect.

an optional data frame giving a factor or factors defining the intra-subject model for multivariate repeated-measures data. See Details for an explanation of the intra-subject design and for further explanation of the other arguments relating to intra-subject factors.

names of contrast-generating functions to be applied by default to factors and ordered factors, respectively, in the within-subject ``data''; the contrasts must produce an intra-subject model matrix in which different terms are orthogonal.

a one-sided model formula using the ``data'' in idata and specifying the intra-subject design.

the quoted name of a term, or a vector of quoted names of terms, in the intra-subject design to be tested.

check that columns of the intra-subject model matrix for different terms are mutually orthogonal (default, TRUE ). Set to FALSE only if you have already checked that the intra-subject model matrix is block-orthogonal.

transformation matrix to be applied to the repeated measures in multivariate repeated-measures data; if NULL and no intra-subject model is specified, no response-transformation is applied; if an intra-subject model is specified via the idata , idesign , and (optionally) icontrasts arguments, then P is generated automatically from the iterms argument.

in linearHypothesis method for mlm objects: optional error sum-of-squares-and-products matrix; if missing, it is computed from the model. In print method for linearHypothesis.mlm objects: if TRUE , print the sum-of-squares and cross-products matrix for error.

character string, "F" or "Chisq" , specifying whether to compute the finite-sample F statistic (with approximate F distribution) or the large-sample Chi-squared statistic (with asymptotic Chi-squared distribution). For a multivariate linear model, the multivariate test statistic to report --- one or more of "Pillai" , "Wilks" , "Hotelling-Lawley" , or "Roy" , with "Pillai" as the default.

an optional character string to label the output.

inverse of sum of squares and products of the model matrix; if missing it is computed from the model.

a function for estimating the covariance matrix of the regression coefficients, e.g., hccm , or an estimated covariance matrix for model . See also white.adjust . For the "lmList" and "nlsList" methods, vcov. must be a function (defaulting to vcov ) to be applied to each model in the list.

a vector of coefficient estimates. The default is to get the coefficient estimates from the model argument, but the user can input any vector of the correct length. For the "lmList" and "nlsList" methods, coef. must be a function (defaulting to coef ) to be applied to each model in the list.

logical or character. Convenience interface to hccm (instead of using the argument vcov. ). Can be set either to a character value specifying the type argument of hccm or TRUE , in which case "hc3" is used implicitly. The default is FALSE .

If TRUE , the hypothesis matrix, right-hand-side vector (or matrix), and estimated value of the hypothesis are printed to standard output; if FALSE (the default), the hypothesis is only printed in symbolic form and the value of the hypothesis is not printed.

an object produced by linearHypothesis.mlm .

if TRUE (the default), print the sum-of-squares and cross-products matrix for the hypothesis and the response-transformation matrix.

minimum number of signficiant digits to print.

a regular expression to be matched against coefficient names.

for internal use by methods that call the default method.

arguments to pass down.

Achim Zeileis and John Fox [email protected]

linearHypothesis computes either a finite-sample F statistic or asymptotic Chi-squared statistic for carrying out a Wald-test-based comparison between a model and a linearly restricted model. The default method will work with any model object for which the coefficient vector can be retrieved by coef and the coefficient-covariance matrix by vcov (otherwise the argument vcov. has to be set explicitly). For computing the F statistic (but not the Chi-squared statistic) a df.residual method needs to be available. If a formula method exists, it is used for pretty printing.

The method for "lm" objects calls the default method, but it changes the default test to "F" , supports the convenience argument white.adjust (for backwards compatibility), and enhances the output by the residual sums of squares. For "glm" objects just the default method is called (bypassing the "lm" method). The "svyglm" method also calls the default method.

Multinomial logit models fit by the multinom function in the nnet package invoke the default method, and the coefficient names are composed from the response-level names and conventional coefficient names, separated by a period ( "." ): see one of the examples below.

The function lht also dispatches to linearHypothesis .

The hypothesis matrix can be supplied as a numeric matrix (or vector), the rows of which specify linear combinations of the model coefficients, which are tested equal to the corresponding entries in the right-hand-side vector, which defaults to a vector of zeroes.

Alternatively, the hypothesis can be specified symbolically as a character vector with one or more elements, each of which gives either a linear combination of coefficients, or a linear equation in the coefficients (i.e., with both a left and right side separated by an equals sign). Components of a linear expression or linear equation can consist of numeric constants, or numeric constants multiplying coefficient names (in which case the number precedes the coefficient, and may be separated from it by spaces or an asterisk); constants of 1 or -1 may be omitted. Spaces are always optional. Components are separated by plus or minus signs. Newlines or tabs in hypotheses will be treated as spaces. See the examples below.

If the user sets the arguments coef. and vcov. , then the computations are done without reference to the model argument. This is like assuming that coef. is normally distibuted with estimated variance vcov. and the linearHypothesis will compute tests on the mean vector for coef. , without actually using the model argument.

A linear hypothesis for a multivariate linear model (i.e., an object of class "mlm" ) can optionally include an intra-subject transformation matrix for a repeated-measures design. If the intra-subject transformation is absent (the default), the multivariate test concerns all of the corresponding coefficients for the response variables. There are two ways to specify the transformation matrix for the repeated measures:

The transformation matrix can be specified directly via the P argument.

A data frame can be provided defining the repeated-measures factor or factors via idata , with default contrasts given by the icontrasts argument. An intra-subject model-matrix is generated from the one-sided formula specified by the idesign argument; columns of the model matrix corresponding to different terms in the intra-subject model must be orthogonal (as is insured by the default contrasts). Note that the contrasts given in icontrasts can be overridden by assigning specific contrasts to the factors in idata . The repeated-measures transformation matrix consists of the columns of the intra-subject model matrix corresponding to the term or terms in iterms . In most instances, this will be the simpler approach, and indeed, most tests of interests can be generated automatically via the Anova function.

matchCoefs is a convenience function that can sometimes help in formulating hypotheses; for example matchCoefs(mod, ":") will return the names of all interaction coefficients in the model mod .

Fox, J. (2016) Applied Regression Analysis and Generalized Linear Models , Third Edition. Sage.

Fox, J. and Weisberg, S. (2019) An R Companion to Applied Regression , Third Edition, Sage.

Hand, D. J., and Taylor, C. C. (1987) Multivariate Analysis of Variance and Repeated Measures: A Practical Approach for Behavioural Scientists. Chapman and Hall.

O'Brien, R. G., and Kaiser, M. K. (1985) MANOVA method for analyzing repeated measures designs: An extensive primer. Psychological Bulletin 97 , 316--333.

anova , Anova , waldtest , hccm , vcovHC , vcovHAC , coef , vcov

Run the code above in your browser using DataLab

Data Visualization

Statistics in R
Machine Learning in R
Data Science in R

Packages in R

R Tutorial | Learn R Programming Language

Introduction

R Programming Language - Introduction
Interesting Facts about R Programming Language
R vs Python
Environments in R Programming
Introduction to R Studio
How to Install R and R Studio?
Creation and Execution of R File in R Studio
Clear the Console and the Environment in R Studio
Hello World in R Programming

Fundamentals of R

Basic Syntax in R Programming
Comments in R
R Operators
R - Keywords
R Data Types
R Variables - Creating, Naming and Using Variables in R
Scope of Variable in R
Dynamic Scoping in R Programming
Lexical Scoping in R Programming

Input/Output

Taking Input from User in R Programming
Printing Output of an R Program
Print the Argument to the Screen in R Programming - print() Function

Control Flow

Control Statements in R Programming
Decision Making in R Programming - if, if-else, if-else-if ladder, nested if-else, and switch
Switch case in R
For loop in R
R - while loop
R - Repeat loop
goto statement in R Programming
Break and Next statements in R
Functions in R Programming
Function Arguments in R Programming
Types of Functions in R Programming
Recursive Functions in R Programming
Conversion Functions in R Programming

Data Structures

Data Structures in R Programming
R - Matrices
R - Data Frames

Object Oriented Programming

R - Object Oriented Programming
Classes in R Programming
R - Objects
Encapsulation in R Programming
Polymorphism in R Programming
R - Inheritance
Abstraction in R Programming
Looping over Objects in R Programming
S3 class in R Programming
Explicit Coercion in R Programming

Error Handling

Handling Errors in R Programming
Condition Handling in R Programming
Debugging in R Programming

File Handling

File Handling in R Programming
Reading Files in R Programming
Writing to Files in R Programming
Working with Binary Files in R Programming
Packages in R Programming
Data visualization with R and ggplot2
dplyr Package in R Programming
Grid and Lattice Packages in R Programming
Shiny Package in R Programming
tidyr Package in R Programming
What Are the Tidyverse Packages in R Language?
Data Munging in R Programming

Data Interfaces

Data Handling in R Programming
Importing Data in R Script
Exporting Data from scripts in R Programming
Working with CSV files in R Programming
Working with XML Files in R Programming
Working with Excel Files in R Programming
Working with JSON Files in R Programming
Working with Databases in R Programming
Data Visualization in R
R - Line Graphs
R - Bar Charts
Histograms in R language
Scatter plots in R Language
R - Pie Charts
Boxplots in R Language
R - Statistics
Mean, Median and Mode in R Programming
Calculate the Average, Variance and Standard Deviation in R Programming
Descriptive Analysis in R Programming
Normal Distribution in R
Binomial Distribution in R Programming
ANOVA (Analysis of Variance) Test in R Programming
Covariance and Correlation in R Programming
Skewness and Kurtosis in R Programming

Hypothesis Testing in R Programming

Bootstrapping in R Programming
Time Series Analysis in R

Machine Learning

Introduction to Machine Learning in R
Setting up Environment for Machine Learning with R Programming
Supervised and Unsupervised Learning in R Programming
Regression and its Types in R Programming
Classification in R Programming
Naive Bayes Classifier in R Programming
KNN Classifier in R Programming
Clustering in R Programming
Decision Tree in R Programming
Random Forest Approach in R Programming
Hierarchical Clustering in R Programming
DBScan Clustering in R Programming
Deep Learning in R Programming

$\mu$

Four Step Process of Hypothesis Testing

There are 4 major steps in hypothesis testing:

State the hypothesis- This step is started by stating null and alternative hypothesis which is presumed as true.
Formulate an analysis plan and set the criteria for decision- In this step, a significance level of test is set. The significance level is the probability of a false rejection in a hypothesis test.
Analyze sample data- In this, a test statistic is used to formulate the statistical comparison between the sample mean and the mean of the population or standard deviation of the sample and standard deviation of the population.
Interpret decision- The value of the test statistic is used to make the decision based on the significance level. For example, if the significance level is set to 0.1 probability, then the sample mean less than 10% will be rejected. Otherwise, the hypothesis is retained to be true.

One Sample T-Testing

One sample T-Testing approach collects a huge amount of data and tests it on random samples. To perform T-Test in R, normally distributed data is required. This test is used to test the mean of the sample with the population. For example, the height of persons living in an area is different or identical to other persons living in other areas.

Syntax: t.test(x, mu) Parameters: x: represents numeric vector of data mu: represents true value of the mean

To know about more optional parameters of t.test() , try the below command:

Example:

Data: The dataset ‘x’ was used for the test.
The determined t-value is -49.504.
Degrees of Freedom (df): The t-test has 99 degrees of freedom.
The p-value is 2.2e-16, which indicates that there is substantial evidence refuting the null hypothesis.
Alternative hypothesis: The true mean is not equal to five, according to the alternative hypothesis.
95 percent confidence interval: (-0.1910645, 0.2090349) is the confidence interval’s value. This range denotes the values that, with 95% confidence, correspond to the genuine population mean.

Two Sample T-Testing

In two sample T-Testing, the sample vectors are compared. If var. equal = TRUE, the test assumes that the variances of both the samples are equal.

Syntax: t.test(x, y) Parameters: x and y: Numeric vectors

Directional Hypothesis

Using the directional hypothesis, the direction of the hypothesis can be specified like, if the user wants to know the sample mean is lower or greater than another mean sample of the data.

Syntax: t.test(x, mu, alternative) Parameters: x: represents numeric vector data mu: represents mean against which sample data has to be tested alternative: sets the alternative hypothesis

One Sample -Test

This type of test is used when comparison has to be computed on one sample and the data is non-parametric. It is performed using wilcox.test() function in R programming.

Syntax: wilcox.test(x, y, exact = NULL) Parameters: x and y: represents numeric vector exact: represents logical value which indicates whether p-value be computed

To know about more optional parameters of wilcox.test() , use below command:

The calculated test statistic or V value is 2555.
P-value: The null hypothesis is weakly supported by the p-value of 0.9192.
The alternative hypothesis asserts that the real location is not equal to 0. This indicates that there is a reasonable suspicion that the distribution’s median or location parameter is different from 0.

Two Sample -Test

This test is performed to compare two samples of data. Example:

Correlation Test

This test is used to compare the correlation of the two vectors provided in the function call or to test for the association between the paired samples.

Syntax: cor.test(x, y) Parameters: x and y: represents numeric data vectors

To know about more optional parameters in cor.test() function, use below command:

Data: The variables’mtcars$mpg’ and’mtcars$hp’ from the ‘mtcars’ dataset were subjected to a correlation test.
t-value: The t-value that was determined is -6.7424.
Degrees of Freedom (df): The test has 30 degrees of freedom.
The p-value is 1.788e-07, indicating that there is substantial evidence that rules out the null hypothesis.
The alternative hypothesis asserts that the true correlation is not equal to 0, indicating that “mtcars$mpg” and “mtcars$hp” are significantly correlated.
95 percent confidence interval: (-0.8852686, -0.5860994) is the confidence interval. This range denotes the values that, with a 95% level of confidence, represent the genuine population correlation coefficient.
Correlation coefficient sample estimate: The correlation coefficient sample estimate is -0.7761684.

Please Login to comment...

Improve your Coding Skills with Practice

What kind of Experience do you want to share?

Online Degree Explore Bachelor’s & Master’s degrees
MasterTrack™ Earn credit towards a Master’s degree
University Certificates Advance your career with graduate-level learning
Top Courses
Join for Free

Hypothesis Testing in R

Taught in English

Instructor: Arimoro Olayinka Imisioluwa

Guided Project

Recommended experience.

Intermediate level

Basic understanding of the theory of hypothesis testing

(23 reviews)

What you'll learn

Understand the basic concepts of hypothesis testing

Perform different hypothesis tests for one and two samples

Skills you'll practice

Student'S T-Test
Decision-Making
Alternative Hypothesis
Statistical Hypothesis Testing
Null Hypothesis

Details to know

Add to your LinkedIn profile

See how employees at top companies are mastering in-demand skills

Learn, practice, and apply job-ready skills in less than 2 hours

Receive training from industry experts
Gain hands-on experience solving real-world job tasks
Build confidence using the latest tools and technologies

About this Guided Project

Welcome to this project-based course Hypothesis Testing in R. In this project, you will learn how to perform extensive hypothesis tests for one and two samples in R.

By the end of this 2-hour long project, you will understand the rationale behind performing hypothesis testing. Also, you will learn how to perform hypothesis tests for proportions and means. By extension, you will learn how to perform a hypothesis test for means of matched or paired samples in R. Note, you do not need to be a statistical analyst or data scientist to be successful in this guided project, just a familiarity with basic statistics and using R suffice for this project. If you are not familiar with R and want to learn the basics, start with my previous guided project titled “Getting Started with R”, and "Calculating Descriptive Statistics in R". A fundamental prerequisite is having a good understanding of the theory of hypothesis test.

Learn step-by-step

In a video that plays in a split-screen with your work area, your instructor will walk you through these steps:

Getting Started

Test for proportions

Test for means

Two sample test for proportions

Two sample test for means

Matched samples

9 project images

Instructor ratings

We asked all learners to give feedback on our instructors based on the quality of their teaching style.

The Coursera Project Network is a select group of instructors who have demonstrated expertise in specific tools or skills through their industry experience or academic backgrounds in the topics of their projects. If you're interested in becoming a project instructor and creating Guided Projects to help millions of learners around the world, please apply today at teach.coursera.org.

How you'll learn

Skill-based, hands-on learning

Practice new skills by completing job-related tasks.

Expert guidance

Follow along with pre-recorded videos from experts using a unique side-by-side interface.

No downloads or installation required

Access the tools and resources you need in a pre-configured cloud workspace.

Available only on desktop

This Guided Project is designed for laptops or desktop computers with a reliable Internet connection, not mobile devices.

Why people choose Coursera for their career

New to Data Analysis? Start here.

Open new doors with Coursera Plus

Unlimited access to 7,000+ world-class courses, hands-on projects, and job-ready certificate programs - all included in your subscription

Advance your career with an online degree

Earn a degree from world-class universities - 100% online

Join over 3,400 global companies that choose Coursera for Business

Upskill your employees to excel in the digital economy

Frequently asked questions

What will i get if i purchase a guided project.

By purchasing a Guided Project, you'll get everything you need to complete the Guided Project including access to a cloud desktop workspace through your web browser that contains the files and software you need to get started, plus step-by-step video instruction from a subject matter expert.

Are Guided Projects available on desktop and mobile?

Because your workspace contains a cloud desktop that is sized for a laptop or desktop computer, Guided Projects are not available on your mobile device.

Who are the instructors for Guided Projects?

Guided Project instructors are subject matter experts who have experience in the skill, tool or domain of their project and are passionate about sharing their knowledge to impact millions of learners around the world.

Can I download the work from my Guided Project after I complete it?

You can download and keep any of your created files from the Guided Project. To do so, you can use the “File Browser” feature while you are accessing your cloud desktop.

What is the refund policy?

Guided Projects are not eligible for refunds. See our full refund policy Opens in a new tab .

Is financial aid available?

Financial aid is not available for Guided Projects.

Can I audit a Guided Project and watch the video portion for free?

Auditing is not available for Guided Projects.

How much experience do I need to do this Guided Project?

At the top of the page, you can press on the experience level for this Guided Project to view any knowledge prerequisites. For every level of Guided Project, your instructor will walk you through step-by-step.

Can I complete this Guided Project right through my web browser, instead of installing special software?

Yes, everything you need to complete your Guided Project will be available in a cloud desktop that is available in your browser.

What is the learning experience like with Guided Projects?

You'll learn by doing through completing tasks in a split-screen environment directly in your browser. On the left side of the screen, you'll complete the task in your workspace. On the right side of the screen, you'll watch an instructor walk you through the project, step-by-step.

15.5: Hypothesis Tests for Regression Models

Last updated
Save as PDF
Page ID 4039

Danielle Navarro
University of New South Wales

So far we’ve talked about what a regression model is, how the coefficients of a regression model are estimated, and how we quantify the performance of the model (the last of these, incidentally, is basically our measure of effect size). The next thing we need to talk about is hypothesis tests. There are two different (but related) kinds of hypothesis tests that we need to talk about: those in which we test whether the regression model as a whole is performing significantly better than a null model; and those in which we test whether a particular regression coefficient is significantly different from zero.

At this point, you’re probably groaning internally, thinking that I’m going to introduce a whole new collection of tests. You’re probably sick of hypothesis tests by now, and don’t want to learn any new ones. Me too. I’m so sick of hypothesis tests that I’m going to shamelessly reuse the F-test from Chapter 14 and the t-test from Chapter 13. In fact, all I’m going to do in this section is show you how those tests are imported wholesale into the regression framework.

Testing the model as a whole

Okay, suppose you’ve estimated your regression model. The first hypothesis test you might want to try is one in which the null hypothesis that there is no relationship between the predictors and the outcome, and the alternative hypothesis is that the data are distributed in exactly the way that the regression model predicts . Formally, our “null model” corresponds to the fairly trivial “regression” model in which we include 0 predictors, and only include the intercept term b 0

H 0 :Y i =b 0 +ϵ i

If our regression model has K predictors, the “alternative model” is described using the usual formula for a multiple regression model:

$H_{1}: Y_{i}=\left(\sum_{k=1}^{K} b_{k} X_{i k}\right)+b_{0}+\epsilon_{i}$

How can we test these two hypotheses against each other? The trick is to understand that just like we did with ANOVA, it’s possible to divide up the total variance SS tot into the sum of the residual variance SS res and the regression model variance SS mod . I’ll skip over the technicalities, since we covered most of them in the ANOVA chapter, and just note that:

SS mod =SS tot −SS res

And, just like we did with the ANOVA, we can convert the sums of squares in to mean squares by dividing by the degrees of freedom.

$\mathrm{MS}_{m o d}=\dfrac{\mathrm{SS}_{m o d}}{d f_{m o d}}$ $\mathrm{MS}_{r e s}=\dfrac{\mathrm{SS}_{r e s}}{d f_{r e s}}$

So, how many degrees of freedom do we have? As you might expect, the df associated with the model is closely tied to the number of predictors that we’ve included. In fact, it turns out that df mod =K. For the residuals, the total degrees of freedom is df res =N−K−1.

$\ F={MS_{mod} \over MS_{res}}$

and the degrees of freedom associated with this are K and N−K−1. This F statistic has exactly the same interpretation as the one we introduced in Chapter 14. Large F values indicate that the null hypothesis is performing poorly in comparison to the alternative hypothesis. And since we already did some tedious “do it the long way” calculations back then, I won’t waste your time repeating them. In a moment I’ll show you how to do the test in R the easy way, but first, let’s have a look at the tests for the individual regression coefficients.

Tests for individual coefficients

The F-test that we’ve just introduced is useful for checking that the model as a whole is performing better than chance. This is important: if your regression model doesn’t produce a significant result for the F-test then you probably don’t have a very good regression model (or, quite possibly, you don’t have very good data). However, while failing this test is a pretty strong indicator that the model has problems, passing the test (i.e., rejecting the null) doesn’t imply that the model is good! Why is that, you might be wondering? The answer to that can be found by looking at the coefficients for the regression.2 model:

I can’t help but notice that the estimated regression coefficient for the baby.sleep variable is tiny (0.01), relative to the value that we get for dan.sleep (-8.95). Given that these two variables are absolutely on the same scale (they’re both measured in “hours slept”), I find this suspicious. In fact, I’m beginning to suspect that it’s really only the amount of sleep that I get that matters in order to predict my grumpiness.

Once again, we can reuse a hypothesis test that we discussed earlier, this time the t-test. The test that we’re interested has a null hypothesis that the true regression coefficient is zero (b=0), which is to be tested against the alternative hypothesis that it isn’t (b≠0). That is:

H 1 : b≠0

How can we test this? Well, if the central limit theorem is kind to us, we might be able to guess that the sampling distribution of $\ \hat{b}$, the estimated regression coefficient, is a normal distribution with mean centred on b. What that would mean is that if the null hypothesis were true, then the sampling distribution of $\ \hat{b}$ has mean zero and unknown standard deviation. Assuming that we can come up with a good estimate for the standard error of the regression coefficient, SE ($\ \hat{b}$), then we’re in luck. That’s exactly the situation for which we introduced the one-sample t way back in Chapter 13. So let’s define a t-statistic like this,

$\ t = { \hat{b} \over SE(\hat{b})}$

I’ll skip over the reasons why, but our degrees of freedom in this case are df=N−K−1. Irritatingly, the estimate of the standard error of the regression coefficient, SE($\ \hat{b}$), is not as easy to calculate as the standard error of the mean that we used for the simpler t-tests in Chapter 13. In fact, the formula is somewhat ugly, and not terribly helpful to look at. For our purposes it’s sufficient to point out that the standard error of the estimated regression coefficient depends on both the predictor and outcome variables, and is somewhat sensitive to violations of the homogeneity of variance assumption (discussed shortly).

In any case, this t-statistic can be interpreted in the same way as the t-statistics that we discussed in Chapter 13. Assuming that you have a two-sided alternative (i.e., you don’t really care if b>0 or b<0), then it’s the extreme values of t (i.e., a lot less than zero or a lot greater than zero) that suggest that you should reject the null hypothesis.

Running the hypothesis tests in R

To compute all of the quantities that we have talked about so far, all you need to do is ask for a summary() of your regression model. Since I’ve been using regression.2 as my example, let’s do that:

The output that this command produces is pretty dense, but we’ve already discussed everything of interest in it, so what I’ll do is go through it line by line. The first line reminds us of what the actual regression model is:

You can see why this is handy, since it was a little while back when we actually created the regression.2 model, and so it’s nice to be reminded of what it was we were doing. The next part provides a quick summary of the residuals (i.e., the ϵi values),

which can be convenient as a quick and dirty check that the model is okay. Remember, we did assume that these residuals were normally distributed, with mean 0. In particular it’s worth quickly checking to see if the median is close to zero, and to see if the first quartile is about the same size as the third quartile. If they look badly off, there’s a good chance that the assumptions of regression are violated. These ones look pretty nice to me, so let’s move on to the interesting stuff. The next part of the R output looks at the coefficients of the regression model:

Each row in this table refers to one of the coefficients in the regression model. The first row is the intercept term, and the later ones look at each of the predictors. The columns give you all of the relevant information. The first column is the actual estimate of b (e.g., 125.96 for the intercept, and -8.9 for the dan.sleep predictor). The second column is the standard error estimate $\ \hat{\sigma_b}$. The third column gives you the t-statistic, and it’s worth noticing that in this table t= $\ \hat{b}$ /SE($\ \hat{b}$) every time. Finally, the fourth column gives you the actual p value for each of these tests. 217 The only thing that the table itself doesn’t list is the degrees of freedom used in the t-test, which is always N−K−1 and is listed immediately below, in this line:

The value of df=97 is equal to N−K−1, so that’s what we use for our t-tests. In the final part of the output we have the F-test and the R 2 values which assess the performance of the model as a whole

So in this case, the model performs significantly better than you’d expect by chance (F(2,97)=215.2, p<.001), which isn’t all that surprising: the R 2 =.812 value indicate that the regression model accounts for 81.2% of the variability in the outcome measure. However, when we look back up at the t-tests for each of the individual coefficients, we have pretty strong evidence that the baby.sleep variable has no significant effect; all the work is being done by the dan.sleep variable. Taken together, these results suggest that regression.2 is actually the wrong model for the data: you’d probably be better off dropping the baby.sleep predictor entirely. In other words, the regression.1 model that we started with is the better model.

An R companion to Statistics: data analysis and modelling

Chapter 9 linear mixed-effects models.

In this Chapter, we will look at how to estimate and perform hypothesis tests for linear mixed-effects models. The main workhorse for estimating linear mixed-effects models is the lme4 package ( Bates et al. 2023 ) . This package allows you to formulate a wide variety of mixed-effects and multilevel models through an extension of the R formula syntax. It is a really good package. But the main author of the package, Douglas Bates, has chosen not to provide $p$ -values for the model parameters. We will therefore also consider the afex package, which provides an interface to the two main approximations (Kenward-Roger and Satterthwaite) to provide the degrees of freedom to compute $p$ -values for $F$ tests. While the mixed function it provides is in principle all you need to estimate the models and get the estimates, I think it is useful to also understand the underlying lme4 package ( Bates et al. 2023 ) , so we will start with a discussion of this, and then move to the afex package ( Singmann et al. 2023 ) . If you install the afex package, it will install quite a few other packages on which it relies. So to get all the required packages for this chapter, you can just type

in the R console, and you should have everything you need.

9.1 Formulating and estimating linear mixed-effects models with lme4

The gold standard for fitting linear mixed-effects models in R is the lmer() (for l inear m ixed- e ffects r egression) in the lme4 package. This function takes the following arguments (amongst others, for the full list of arguments, see ?lmer ):

formula : a two-sided linear formula describing both the fixed-effects and random-effects part of the model, with the response on the left of the ~ operator and predictors and random effects on the right-hand side of the ~ operator.
data : A data.frame , which needs to be in the so-called “long” format, with a single row per observation. This may be different from what you might be used to when dealing with repeated-measures. A repeated-measures ANOVA in SPSS requires data in the “wide” format, where you use columns for the different repeated measures. Data in the “wide” format has a single row for each participants. In the “long” format, you will have multiple rows for the data from a single grouping level (e.g., participant, lab, etc.).
REML : A logical variable whether to estimate the model with restricted maximum likelihood ( REML = TRUE , the default) or with maximum likelihood ( REML = FALSE ).

As correct use of the formula interface is vital, let’s first consider again how the formula interface works in general. Formulas allow you to specify a linear model succinctly (and by default, any model created with a formula will include an intercept, unless explicitly removed). Here are some examples (adapted from Singmann and Kellen ( 2019 ) ):

The lme4 package extends the formula interface to specify random effects structures. Random effects are added to the formula by writing elements between parentheses () . Within these parentheses, you provide the specification of the random effects to be included on the left-hand side of a conditional sign | . On the right-hand side of this conditional sign, you specify the grouping factor, or grouping factors, on which these random effects depend. The grouping factors need to be of class factor (i.e., they can not be numeric variables). Here are some examples of such specifications (again mostly adapted from Singmann and Kellen ( 2019 ) ):

9.1.1 Random intercepts model

Now let’s try to define a relatively simple linear mixed-effects model for the anchoring data set in the sdamr package. We will use the data from all the referrers try a few variations to get acquainted with the lmer() function. First, let’s load the packages and the data:

Now let’s estimate a first linear mixed-effects model, with a fixed effect for anchor, and random intercepts, using everest_feet as the dependent variable. We will first ensure that anchor is a factor and associate a sum-to-zero contrast to it. We will also make referrer a factor; the contrast for this shouldn’t really matter, so we’ll leave it as a dummy code. We then set up the model, using (1|referrer) to specify that random intercept-effects should be included for each level of the referrer factor. Finally, we use the summary() function on the estimated model to obtain the estimates of the parameters.

The output first shows some information about the structure of the model, and the value of the optimized “-2 log REML” (the logarithm of the minus two restricted maximum likelihood). Then some summary statistics for the standardized residuals are shown (these are the “raw” residuals divided by the estimated standard deviation of the residuals).

Under the Random effects: header, you will find a table with estimates of the variance and standard deviation of the random effects terms for each grouping factor (just referrer in this case). So the estimated standard deviation of the random intercept: $\hat{\sigma}_{\gamma_0} = 1648$ . You will also find an estimate of the variance and standard deviation of the residuals in this table: $\hat{\sigma}_\epsilon = 9692$ }.

Under the Fixed effects: header, you will find a table with an estimate for each fixed effect, as well as the standard error of this estimate, and an associated $t$ statistic. This output is much like the output of calling the summary() function on a standard model estimated with the lm function. But you won’t find the $p$ -value for these estimates. This is because the author of the lme4 package, perhaps rightly, finds none of the approximations to the error degrees of freedom good enough for general usage. Opinions on this will vary. It is agreed that the true Type 1 error rate when using one of the approximations will not be exactly equal to the $\alpha$ level. In some cases, the difference may be substantial, but often the approximation will be reasonable enough to be useful in practice. For further information on this, there is a section in a very useful GLMM FAQ . You can also see ?lme4::pvalues for some information about various approaches to obtaining $p$ -values.

The final table, under the Correlation of Fixed Effects , shows the approximate correlation between the estimators of the fixed effects. You can think of it as the expected correlation between the estimates of the fixed effects over all different datasets that you might obtain (assuming that the predictors have the same values in each). It is not something you generally need to be concerned about.

9.1.2 Visually assessing model assumptions

You can use the predict and residuals function to obtain the predicted values and residuals for a linear mixed effects model. You can then plot these, using e.g. ggplot2 , as follows:

It might be cool to colour the residuals by referrer as follows:

If the legend gets in the way, you can remove it as follows:

The theme function allows for lots of functionality (check ?theme ). You can also get a quick predicted vs residual plot from base R by simply calling plot(mod) .

We can get a nice-looking histogram of the residuals, and a QQ plot, as follows:

9.1.3 Random intercepts and slopes

Now let’s estimate a model with random intercepts and random slopes for anchor . To do so, we can simply add anchor in the mixed effects structure specification, as follows:

As you can see, the model now estimates a variance of the random slopes effects, as well as a correlation between the random intercept and slope effects. We could try to get a model without the correlations as follows:

As you can see in the warning messages, this leads to various estimation issues. Moreover, the correlation is actually still there!

As it turns out, the || notation does not work well for factors !. It only works as expected with metric predictors. We can get the desired model by defining a contrast-coded predictor for anchor explicitly, as follows:

That is a little cumbersome, especially if you have a factor with lots of levels, in which case you would have to specify many contrast-coded predictors. The lmer_alt() function in the afex package will automatically generate the contrast-coded predictors needed, which will be convenient. You can try this by running:

Note the use of the additional set_data_arg = TRUE argument, which is necessary to later use the object for model comparisons with the likelihood ratio test in the next section. Also note that the parameters now do come with $p$ -values (using the Satterthwaite approximation).

9.1.4 Likelihood ratio test with the anova function

We now have two versions of our random intercepts + slopes model, one which estimates the correlation between the random intercept and slope, and one which sets this to 0. A likelihood-ratio test comparing these two models is easily obtained as:

Note the message refitting model(s) with ML (instead of REML) . The likelihood-ratio test requires that the models are estimated by maximum likelihood, rather than restricted maximum likelihood (REML). The lme4 package is clever enough to realize this, and first re-estimates the model before computing the likelihood ratio test. Also note that the test statistic is now called “Chisq”, for Chi-squared. This is the one we want. The test result is significant, and hence we can reject the null hypothesis that the correlation between the random intercept and slope is $\rho_{\gamma_0,\gamma_1} = 0$ .

9.1.5 Confidence intervals

While the lme4 package does not provide $p$ -values, it does have functionality to compute confidence intervals via the confint() function. The default option is to compute so-called profile likelihood confidence intervals for all (fixed and random) parameters:

Note that .sig01 refers to the standard deviation of the random intercept (i.e. $\sigma{\gamma_0}$ ), .sig02 refers to the correlation between the random intercept and random slope (i.e. $\rho_{\gamma_0,\gamma_1}$ ), and .sig03 to the standard deviation of the random slope (i.e. $\sigma_{\gamma_1}$ ). The value of .sig refers to the standard deviation of the error term (i.e. $\sigma_\epsilon$ ). Unfortunately, these are not the most informative labels, so it pays to check the values reported in summary(modg) to match them to the output here.

Parametric bootstrap confidence (via simulation) can be obtained by setting the argument method = "boot" . This is a very computationally intensive procedure, so you will have to wait some time for the results! Moreover, due to the random simulation involved, the results will vary (hopefully a little) every time you run the procedure:

Note the warning messages. By default, the bootstrap simulates nsim=500 datasets and re-estimates the model for each. In some of the simulated datasets, the estimation may fail, which provides the resulting warning messages. While confidence intervals and hypothesis tests, in the case of “standard” linear models give the same results, this is not necessarily the case for mixed models, as the $F$ tests for the fixed effects involve approximation of the error degrees of freedom ( $\text{df}_2$ ), whilst the computation of the confidence intervals rely on other forms of approximation (e.g. simulation for the parametric bootstrap). As confidence intervals are included by default in lme4 , it seems like the author of the package believes these are perhaps more principled than the $F$ tests for the fixed effects.

9.1.6 Plotting model predictions

It can be useful to plot the model predictions for each level of the random grouping factor. We can obtain such a plot by storing the model predictions with the data. By adding the group = and colour = arguments in the aes() function, you can then get separate results for all levels of the random effect factor ( referrer here). For instance, we can plot the predictions for the different levels of the anchor factor with connecting lines as follows:

This would also work if the variable of the x-axis is continuous, rather than categorical.

9.2 Obtaining p-values with afex::mixed

Despite some of the concerns about the validity of the $p$ -values for $F$ -tests of the fixed effects, they are often useful (if only to satisfy reviewers of your paper). Packages such as pbkrtest ( Højsgaard 2020 ) and lmerTest ( Kuznetsova, Bruun Brockhoff, and Haubo Bojesen Christensen 2020 ) have been developed to provide these for mixed-effects models estimated with the lmer() function, using the Kenward-Roger or parametric bootstrap, and Satterthwaite approximations, respectively. The afex package ( Singmann et al. 2023 ) provides a common interface to the functionality of these packages, via its mixed function. The mixed function offers some other convenient features, such as automatically using sum-to-zero contrasts (via contr.sum() ), although I prefer setting my own contrasts and turn this off.

Some of the main arguments to the mixed function (see ?mixed for the full overview) are:

type : Sums of Squares type to use (1, 2, or 3). Default is 3.
method : Character vector indicating which approximation method to use for obtaining the p-values. "KR" for the Kenward-Roger approximation, "S" for the Satterthwaite approximation (the default), "PB" for a parametric bootstrap, and "LRT" for the likelihood ratio test.
test_intercept : Logical variable indicating whether to obtain a test of the fixed intercept (only for Type 3 SS). Default is FALSE
check_contrasts : Logical variable indicating whether contrasts for factors should be checked and changed to contr.sum if they are not identical to contr.sum . Default is TRUE . You should set this to FALSE if you supply your own orthogonal contrasts.
expand_re : Logical variable indicating whether random effect terms should be expanded (i.e. factors transformed into contrast-coded numerical predictors) before fitting with lmer. This allows proper use of the || notation with factors.

Let’s try it out! First, let’s load the package:

Note that after loading the afex package, the lmer function from lme4 will be “masked” and the corresponding function from the afex namespace will be used (it is actually the same as the one from the lmerTest namespace), which is mostly the same, but expands the class of the returned object somewhat. Afterwards, you either have to use lme4::lmer whenever you explicitly want the function from the lme4 package, or avoid loading the afex package, and always type e.g. afex::mixed . Either is fine, and mostly you wouldn’t need to worry, but sometimes the overriding of function names in the global workspace can give confusion and unexpected results, so it is good to be aware if this behaviour.

In the code below, I use the mixed function to estimate the model and compute $p$ -values for the fixed effect of anchor and the intercept with the "KR" option (note that this takes some time!):

The class of the returned object saved as afmodg is different from the usual one returned by the lmer function. To get the $F$ tests, you can just type in the name of the object:

You can also use the summary() function to obtain the parameter estimates:

Note that we now also get $p$ -values for the effects (which are here derived from the Satterthwaite approximation to the degrees of freedom). That is, as mentioned before, because by loading the afex package, the lme4::lmer() function is masked and replaced by the lmerTest::lmer() function. This function uses the lme4::lmer() function to estimate the model, but then adds further results to compute the Satterthwaite degrees of freedom. If you like, you could force the use of the lme4::lmer() function, by specifying the appropriate namespace, as in e.g.

Continuing our example, we can also estimate the model without correlation between the random effects as follows:

and get the $F$ tests for this model:

and parameter estimates

Note that entering these two mixed models as is into the anova function will not provide the desired re-estimation of the models by maximum likelihood:

This is not clear from the output (apart from the missing refitting model(s) with ML (instead of REML) message). For the correct results, you will need to provide the lmer model, stored in the afmodr and afmodg objects under $full_model :

Although the differences are small, the first test compares the “-2 log REML”, instead of the desired “-2 log ML” values. The assumptions underlying the likelihood-ratio test require the latter.

9.3 Crossed random effects

The lme4::lmer() function (and the afex::mixed() function which that is built on top of that function) allows you to specify multiple sources for random effects. In the book, we discussed an example with crossed random effects, using the speeddate data. In this data, participants (identified by the iid variable) provide ratings of various attributes of different dating partners (identified by the pid variable). Here, iid refers to the Person factor, and pid to the Item factor.

To estimate a model with crossed random effects, you simply include two separate statements for the random effects in the model formula, one for the effects depending on iid , and one for the random effects depending on pid :

Note there are some warnings. The first states that, due to missing values, the number of observations is reduced. If a row in the dataset has a missing value on either the dependent variable to one of the predictors, it is not possible to compute an error or model prediction. Hence, such rows are removed from the dataset before fitting the model. The second warning, starting with Numerical variables NOT centered on 0 we can ignore, as it is erroneous (we have centered the numeric predictors). The checks performed by afex::mixed() sometimes miss the mark. The third warning regarding boundary singular fit ( boundary (singular) fit: see ?isSingular ), is something that will require attention. It signals that the model may be too complex to estimate properly. Such issues often (but not always!) show in estimated variances of random effects of 0 (i.e., no variance) or correlations between random effects of 1 or -1. Before doing anything else, we should therefore inspect the random-effects estimates. We can view these estimates (as well as those of the fixed effects) as usual with the summary() function:

In this case, none of the variances are estimated as 0, and none of the correlations are estimated as 1 or -1. But the estimated variances of the random other_attr_c:other_intel_c interactions are quite low, and close to 0. As such, it might make sense to remove these effects.

This does not resolve the issue, as we still get the boundary (singular) fit warning. Moreover, we now get another warning: Model failed to converge with 1 negative eigenvalue . Checking the parameter estimates:

shows a rather high negative correlation between the iid -wise random effects of other_attr_c and other_intel_c . Perhaps allowing the random effects to be correlated is too complex for this data. Hence, we can try a model with independent random effects:

This eliminates the boundary singular fit issue. Inspecting the random-effects estimates shows no clear signs of further problems, as all the variances are well-above 0:

Although the “keep it maximal” idea is good in theory, in practice, maximal models are often difficult or impossible to estimate. Determining an appropriate random-effects structure may require reducing the complexity of the model iteratively. Some guidelines for this process are provided by Matuschek et al. ( 2017 ) .

In this case, the results regarding the fixed effects are robust over the different random-effects structures. For instance, comparing the maximal model to the final model shows that the same effects are significant:

The $F$ -values of the last model are higher, which is consistent with the increased power that generally results from simpler random-effects structures (at a potential cost of increasing Type I error). In any case, knowing that the maximal and simpler random-effects structures both would lead to the same substantive conclusions provides some reassurance that the results are robust. The afex package also has a useful “vignette” (a longer document describing an analysis and package functionality) on mixed-effects models, which you can view through the command:

9.4 Automatically finding optimal random effects structures with the buildmer package

A common issue with “maximal” linear mixed-effects models is that not all random effects can be estimated. These estimation issues are noted by warnings from lme4 such as boundary (singular) fit or Model failed to converge . These warnings indicate that the model is too complex for the data at hand. Matuschek et al. ( 2017 ) propose to search for a model with a parsimonious random effects structure. Starting with the maximal model, they propose to eliminate random effects from the model which don’t lead to a significant reduction in the likelihood ratio. This backwards search procedure is implemented by the buildmer() function from the buildmer package. By default, the buildmer() function will find the maximal model that can be estimated without issues, and then reduces the complexity of this model by removing random effects through model comparisons via likelihood ratio tests, removing terms in the order of their contribution to the likelihood ratio until the test is significant (i.e. until removal leads to a significantly worse fitting model). The basic usage of the buildmer() function is similar to the lme4::lmer() and afex::mixed() functions: you supply a model formula with the intended maximal model and a data.frame containing the data. The buildmer() function will then estimate all possible subsets of this model and perform model comparisons between them. The procedure can be fine-tuned via the buildmerControl argument. For example, instead of the default backwards search procedure (starting from the most complex model and eliminating terms), the direction can be changed to a forwards procedure where the starting point is the simplest model, and random effects terms are added until the increase in the likelihood ratio is not significant. It is also possible to change the criterion for elimination from a likelihood-ratio test to a model selection criterion such as the Akaike Information Criterion (AIC). In the include argument, you can specify which terms will always be included. By default, the buildmer() function will also consider removing fixed effects. This can be avoided by specifying the fixed effects in the model formula provided as the include argument. Other arguments are used to specify further calculations on the final model. For example, whether to calculate an ANOVA table ( calc.anova = TRUE ) and how to approximate the degrees of freedom for this (e.g. ddf = "Satterthwaite" ). In the code below, we use the buildmer() function to find the parsimonious maximal model for the speeddate data. The (rather extensive) list of messages show all the models fitted by the buildmer() function, which includes models without random effects (estimated via the lm function) as well as models with random effects (estimated via the lmer function).

The final model contains random intercepts and slopes for other_attr_c and other_intel_c for pid . For the iid groupings, only random intercepts are included:

All Courses

Interview Questions
Free Courses
Career Guide
PGP in Data Science and Business Analytics
PG Program in Data Science and Business Analytics Classroom
PGP in Data Science and Engineering (Data Science Specialization)
PGP in Data Science and Engineering (Bootcamp)
PGP in Data Science & Engineering (Data Engineering Specialization)
Master of Data Science (Global) – Deakin University
MIT Data Science and Machine Learning Course Online
Master’s (MS) in Data Science Online Degree Programme
MTech in Data Science & Machine Learning by PES University
Data Analytics Essentials by UT Austin
Data Science & Business Analytics Program by McCombs School of Business
MTech In Big Data Analytics by SRM
M.Tech in Data Engineering Specialization by SRM University
M.Tech in Big Data Analytics by SRM University
PG in AI & Machine Learning Course
Weekend Classroom PG Program For AI & ML
AI for Leaders & Managers (PG Certificate Course)
Artificial Intelligence Course for School Students
IIIT Delhi: PG Diploma in Artificial Intelligence
Machine Learning PG Program
MIT No-Code AI and Machine Learning Course
Study Abroad: Masters Programs
MS in Information Science: Machine Learning From University of Arizon
SRM M Tech in AI and ML for Working Professionals Program
UT Austin Artificial Intelligence (AI) for Leaders & Managers
UT Austin Artificial Intelligence and Machine Learning Program Online
MS in Machine Learning
IIT Roorkee Full Stack Developer Course
IIT Madras Blockchain Course (Online Software Engineering)
IIIT Hyderabad Software Engg for Data Science Course (Comprehensive)
IIIT Hyderabad Software Engg for Data Science Course (Accelerated)
IIT Bombay UX Design Course – Online PG Certificate Program
Online MCA Degree Course by JAIN (Deemed-to-be University)
Cybersecurity PG Course
Online Post Graduate Executive Management Program
Product Management Course Online in India
NUS Future Leadership Program for Business Managers and Leaders
PES Executive MBA Degree Program for Working Professionals
Online BBA Degree Course by JAIN (Deemed-to-be University)
MBA in Digital Marketing or Data Science by JAIN (Deemed-to-be University)
Master of Business Administration- Shiva Nadar University
Post Graduate Diploma in Management (Online) by Great Lakes
Online MBA Programs
Cloud Computing PG Program by Great Lakes
University Programs
Stanford Design Thinking Course Online
Design Thinking : From Insights to Viability
PGP In Strategic Digital Marketing
Post Graduate Diploma in Management
Master of Business Administration Degree Program
MS in Business Analytics in USA
MS in Machine Learning in USA
Study MBA in Germany at FOM University
M.Sc in Big Data & Business Analytics in Germany
Study MBA in USA at Walsh College
MS Data Analytics
MS Artificial Intelligence and Machine Learning
MS in Data Analytics
Master of Business Administration (MBA)
MS in Information Science: Machine Learning
MS in Machine Learning Online
MIT Data Science Program
AI For Leaders Course
Data Science and Business Analytics Course
Cyber Security Course
PG Program Online Artificial Intelligence Machine Learning
PG Program Online Cloud Computing Course
Data Analytics Essentials Online Course
MIT Programa Ciencia De Dados Machine Learning
MIT Programa Ciencia De Datos Aprendizaje Automatico
Program PG Ciencia Datos Analitica Empresarial Curso Online
Mit Programa Ciencia De Datos Aprendizaje Automatico
Online Data Science Business Analytics Course
Online Ai Machine Learning Course
Online Full Stack Software Development Course
Online Cloud Computing Course
Cybersecurity Course Online
Online Data Analytics Essentials Course
Ai for Business Leaders Course
Mit Data Science Program
No Code Artificial Intelligence Machine Learning Program
MS Information Science Machine Learning University Arizona
Wharton Online Advanced Digital Marketing Program
Introduction to Hypothesis Testing in R
Testing of Hypothesis in R
p-value: An Alternative way of Hypothesis Testing:
t-test: Hypothesis Testing of Population Mean when Population Standard Deviation is Unknown:
Two Samples Tests: Hypothesis Testing for the difference between two population means
Hypothesis Testing for Equality of Population Variances
Let’s Look at some Case studies:
References:

Hypothesis Testing in R- Introduction Examples and Case Study

– By Dr. Masood H. Siddiqui, Professor & Dean (Research) at Jaipuria Institute of Management, Lucknow

The premise of Data Analytics is based on the philosophy of the “ Data-Driven Decision Making ” that univocally states that decision-making based on data has less probability of error than those based on subjective judgement and gut-feeling. So, we require data to make decisions and to answer the business/functional questions. Data may be collected from each and every unit/person, connected with the problem-situation (totality related to the situation). This is known as Census or Complete Enumeration and the ‘totality’ is known as Population . Obv.iously, this will generally give the most optimum results with maximum correctness but this may not be always possible. Actually, it is rare to have access to information from all the members connected with the situation. So, due to practical considerations, we take up a representative subset from the population, known as Sample . A sample is a representative in the sense that it is expected to exhibit the properties of the population, from where it has been drawn.

So, we have evidence (data) from the sample and we need to decide for the population on the basis of that data from the sample i.e. inferring about the population on the basis of a sample. This concept is known as Statistical Inference .

Before going into details, we should be clear about certain terms and concepts that will be useful:

Parameter and Statistic

Parameters are unknown constants that effectively define the population distribution , and in turn, the population , e.g. population mean (µ), population standard deviation (σ), population proportion (P) etc. Statistics are the values characterising the sample i.e. characteristics of the sample. They are actually functions of sample values e. g. sample mean (x̄), sample standard deviation (s), sample proportion (p) etc.

Sampling Distribution

A large number of samples may be drawn from a population. Each sample may provide a value of sample statistic, so there will be a distribution of sample statistic value from all the possible samples i.e. frequency distribution of sample statistic . This is better known as Sampling distribution of the sample statistic . Alternatively, the sample statistic is a random variable , being a function of sample values (which are random variables themselves). The probability distribution of the sample statistic is known as sampling distribution of sample statistic. Just like any other distribution, sampling distribution may partially be described by its mean and standard deviation . The standard deviation of sampling distribution of a sample statistic is better known as the Standard Error of the sample statistic.

Standard Error

It is a measure of the extent of variation among different values of statistics from different possible samples. Higher the standard error, higher is the variation among different possible values of statistics. Hence, less will be the confidence that we may place on the value of the statistic for estimation purposes. Hence, the sample statistic having a lower value of standard error is supposed to be better for estimation of the population parameter.

1(a). A sample of size ‘n’ has been drawn for a normal population N (µ, σ). We are considering sample mean (x̄) as the sample statistic. Then, the sampling distribution of sample statistic x̄ will follow Normal Distribution with mean µ x̄ = µ and standard error σ x̄ = σ/ √ n.

Even if the population is not following the Normal Distribution but for a large sample (n = large), the sampling distribution of x̄ will approach to (approximated by) normal distribution with mean µ x̄ = µ and standard error σ x̄ = σ/ √ n, as per the Central Limit Theorem .

(b). A sample of size ‘n’ has been drawn for a normal population N (µ, σ), but population standard deviation σ is unknown, so in this case σ will be estimated by sample standard deviation(s). Then, sampling distribution of sample statistic x̄ will follow the student’s t distribution (with degree of freedom = n-1) having mean µ x̄ = µ and standard error σ x̄ = s/ √ n.

2. When we consider proportions for categorical data. Sampling distribution of sample proportion p =x/n (where x = Number of success out of a total of n) will follow Normal Distribution with mean µ p = P and standard error σ p = √( PQ/n), (where Q = 1-P). This is under the condition that n is large such that both np and nq should be minimum 5.

Statistical Inference

Statistical Inference encompasses two different but related problems:

1. Knowing about the population-values on the basis of data from the sample. This is known as the problem of Estimation . This is a common problem in business decision-making because of lack of complete information and uncertainty but by using sample information, the estimate will be based on the concept of data based decision making. Here, the concept of probability is used through sampling distribution to deal with the uncertainty. If sample statistics is used to estimate the population parameter , then in that situation that is known as the Estimator; {like sample mean (x̄) to estimate population mean µ, sample proportion (p) to estimate population proportion (P) etc.}. A particular value of the estimator for a given sample is known as Estimate . For example, if we want to estimate average sales of 1000+ outlets of a retail chain and we have taken a sample of 40 outlets and sample mean ( estimator ) x̄ is 40000. Then the estimate will be 40000.

There are two types of estimation:

Point Estimation : Single value/number of the estimator is used to estimate unknown population parameters. The example is given above.
Confidence Interval/Interval Estimation : Interval Estimate gives two values of sample statistic/estimator, forming an interval or range, within which an unknown population is expected to lie. This interval estimate provides confidence with the interval vis-à-vis the population parameter. For example: 95% confidence interval for population mean sale is (35000, 45000) i.e. we are 95% confident that interval estimate will contain the population parameter.

2. Examining the declaration/perception/claim about the population for its correctness on the basis of sample data. This is known as the problem of Significant Testing or Testing of Hypothesis . This belongs to the Confirmatory Data Analysis , as to confirm or otherwise the hypothesis developed in the earlier Exploratory Data Analysis stage.

One Sample Tests

z-test – Hypothesis Testing of Population Mean when Population Standard Deviation is known:

Hypothesis testing in R starts with a claim or perception of the population. Hypothesis may be defined as a claim/ positive declaration/ conjecture about the population parameter. If hypothesis defines the distribution completely, it is known as Simple Hypothesis, otherwise Composite Hypothesis .

Hypothesis may be classified as:

Null Hypothesis (H 0 ): Hypothesis to be tested is known as Null Hypothesis (H 0 ). It is so known because it assumes no relationship or no difference from the hypothesized value of population parameter(s) or to be nullified.

Alternative Hypothesis (H 1 ): The hypothesis opposite/complementary to the Null Hypothesis .

Note: Here, two points are needed to be considered. First, both the hypotheses are to be constructed only for the population parameters. Second, since H 0 is to be tested so it is H 0 only that may be rejected or failed to be rejected (retained).

Hypothesis Testing: Hypothesis testing a rule or statistical process that may be resulted in either rejecting or failing to reject the null hypothesis (H 0 ).

The Five Steps Process of Hypothesis Testing

Here, we take an example of Testing of Mean:

1. Setting up the Hypothesis:

This step is used to define the problem after considering the business situation and deciding the relevant hypotheses H 0 and H 1 , after mentioning the hypotheses in the business language.

We are considering the random variable X = Quarterly sales of the sales executive working in a big FMCG company. Here, we assume that sales follow normal distribution with mean µ (unknown) and standard deviation σ (known) . The value of the population parameter (population mean) to be tested be µ 0 (Hypothesised Value).

Here the hypothesis may be:

H 0 : µ = µ 0 or µ ≤ µ 0 or µ ≥ µ 0 (here, the first one is Simple Hypothesis , rest two variants are composite hypotheses )

H 1 : µ > µ 0 or

H 1 : µ < µ 0 or

H 1 : µ ≠ µ 0

(Here, all three variants are Composite Hypothesis )

2. Defining Test and Test Statistic:

The test is the statistical rule/process of deciding to ‘reject’ or ‘fail to reject’ (retain) the H0. It consists of dividing the sample space (the totality of all the possible outcomes) into two complementary parts. One part, providing the rejection of H 0 , known as Critical Region . The other part, representing the failing to reject H 0 situation , is known as Acceptance Region .

The logic is, since we have evidence only from the sample, we use sample data to decide about the rejection/retaining of the hypothesised value. Sample, in principle, can never be a perfect replica of the population so we do expect that there will be variation in between population and sample values. So the issue is not the difference but actually the magnitude of difference . Suppose, we want to test the claim that the average quarterly sale of the executive is 75k vs sale is below 75k. Here, the hypothesised value for the population mean is µ 0 =75 i.e.

H 0 : µ = 75

H 1 : µ < 75.

Suppose from a sample, we get a value of sample mean x̄=73. Here, the difference is too small to reject the claim under H 0 since the chances (probability) of happening of such a random sample is quite large so we will retain H 0 . Suppose, in some other situation, we get a sample with a sample mean x̄=33. Here, the difference between the sample mean and hypothesised population mean is too large. So the claim under H 0 may be rejected as the chance of having such a sample for this population is quite low.

So, there must be some dividing value (s) that differentiates between the two decisions: rejection (critical region) and retention (acceptance region), this boundary value is known as the critical value .

Type I and Type II Error:

There are two types of situations (H 0 is true or false) which are complementary to each other and two types of complementary decisions (Reject H 0 or Failing to Reject H 0 ). So we have four types of cases:

So, the two possible errors in hypothesis testing can be:

Type I Error = [Reject H 0 when H 0 is true]

Type II Error = [Fails to reject H 0 when H 0 is false].

Type I Error is also known as False Positive and Type II Error is also known as False Negative in the language of Business Analytics.

Since these two are probabilistic events, so we measure them using probabilities:

α = Probability of committing Type I error = P [Reject H 0 / H 0 is true]

β = Probability of committing Type II error = P [Fails to reject H 0 / H 0 is false].

For a good testing procedure, both types of errors should be low (minimise α and β) but simultaneous minimisation of both the errors is not possible because they are interconnected. If we minimize one, the other will increase and vice versa. So, one error is fixed and another is tried to be minimised. Normally α is fixed and we try to minimise β. If Type I error is critical, α is fixed at a low value (allowing β to take relatively high value) otherwise at relatively high value (to minimise β to a low value, Type II error being critical).

Example: In Indian Judicial System we have H 0 : Under trial is innocent. Here, Type I Error = An innocent person is sentenced, while Type II Error = A guilty person is set free. Indian (Anglo Saxon) Judicial System considers type I error to be critical so it will have low α for this case.

Power of the test = 1- β = P [Reject H 0 / H 0 is false].

Higher the power of the test, better it is considered and we look for the Most Powerful Test since power of test can be taken as the probability that the test will detect a deviation from H 0 given that the deviation exists.

One Tailed and Two Tailed Tests of Hypothesis:

H 0 : µ ≤ µ 0

H 1 : µ > µ 0

When x̄ is significantly above the hypothesized population mean µ 0 then H 0 will be rejected and the test used will be right tailed test (upper tailed test) since the critical region (denoting rejection of H 0 will be in the right tail of the normal curve (representing sampling distribution of sample statistic x̄). (The critical region is shown as a shaded portion in the figure).

H 0 : µ ≥ µ 0

H 1 : µ < µ 0

In this case, if x̄ is significantly below the hypothesised population mean µ 0 then H 0 will be rejected and the test used will be the left tailed test (lower tailed test) since the critical region (denoting rejection of H 0 ) will be in the left tail of the normal curve (representing sampling distribution of sample statistic x̄). (The critical region is shown as a shaded portion in the figure).

These two tests are also known as One-tailed tests as there will be a critical region in only one tail of the sampling distribution.

H 0 : µ = µ 0

H 1 : µ ≠ µ 0

When x̄ is significantly different (significantly higher or lower than) from the hypothesised population mean µ 0 , then H 0 will be rejected. In this case, the two tailed test will be applicable because there will be two critical regions (denoting rejection of H 0 ) on both the tails of the normal curve (representing sampling distribution of sample statistic x̄). (The critical regions are shown as shaded portions in the figure).

Hypothesis Testing using Standardized Scale: Here, instead of measuring sample statistic (variable) in the original unit, standardised value is taken (better known as test statistic ). So, the comparison will be between observed value of test statistic (estimated from sample), and critical value of test statistic (obtained from relevant theoretical probability distribution).

Here, since population standard deviation (σ) is known, so the test statistics :

Z= (x- µx̄ x )/σ x̄ = (x- µ 0 )/(σ/√n) follows Standard Normal Distribution N (0, 1).

3.Deciding the Criteria for Rejection or otherwise:

As discussed, hypothesis testing means deciding a rule for rejection/retention of H 0 . Here, the critical region decides rejection of H 0 and there will be a value, known as Critical Value , to define the boundary of the critical region/acceptance region. The size (probability/area) of a critical region is taken as α . Here, α may be known as Significance Level , the level at which hypothesis testing is performed. It is equal to type I error , as discussed earlier.

Suppose, α has been decided as 5%, so the critical value of test statistic (Z) will be +1.645 (for right tail test), -1.645 (for left tail test). For the two tails test, the critical value will be -1.96 and +1.96 (as per the Standard Normal Distribution Z table). The value of α may be chosen as per the criticality of type I and type II. Normally, the value of α is taken as 5% in most of the analytical situations (Fisher, 1956).

4. Taking sample, data collection and estimating the observed value of test statistic:

In this stage, a proper sample of size n is taken and after collecting the data, the values of sample mean (x̄) and the observed value of test statistic Z obs is being estimated, as per the test statistic formula.

5. Taking the Decision to reject or otherwise:

On comparing the observed value of Test statistic with that of the critical value, we may identify whether the observed value lies in the critical region (reject H 0 ) or in the acceptance region (do not reject H 0 ) and decide accordingly.

Right Tailed Test: If Z obs > 1.645 : Reject H 0 at 5% Level of Significance.
Left Tailed Test: If Z obs < -1.645 : Reject H 0 at 5% Level of Significance.
Two Tailed Test: If Z obs > 1.96 or If Z obs < -1.96 : Reject H 0 at 5% Level of Significance.

There is an alternative approach for hypothesis testing, this approach is very much used in all the software packages. It is known as probability value/ prob. value/ p-value. It gives the probability of getting a value of statistic this far or farther from the hypothesised value if H0 is true. This denotes how likely is the result that we have observed. It may be further explained as the probability of observing the test statistic if H 0 is true i.e. what are the chances in support of occurrence of H 0 . If p-value is small, it means there are less chances (rare case) in favour of H 0 occuring, as the difference between a sample value and hypothesised value is significantly large so H 0 may be rejected, otherwise it may be retained.

If p-value < α : Reject H 0

If p-value ≥ α : Fails to Reject H 0

So, it may be mentioned that the level of significance (α) is the maximum threshold for p-value. It should be noted that p-value (two tailed test) = 2* p-value (one tailed test).

Note: Though the application of z-test requires the ‘Normality Assumption’ for the parent population with known standard deviation/ variance but if sample is large (n>30), the normality assumption for the parent population may be relaxed, provided population standard deviation/variance is known (as per Central Limit Theorem).

As we discussed in the previous case, for testing of population mean, we assume that sample has been drawn from the population following normal distribution mean µ and standard deviation σ. In this case test statistic Z = (x- µ 0 )/(σ/√n) ~ Standard Normal Distribution N (0, 1). But in the situations where population s.d. σ is not known (it is a very common situation in all the real life business situations), we estimate population s.d. (σ) by sample s.d. (s).

Hence the corresponding test statistic:

t= (x- µx̄ x )/σ x̄ = (x- µ 0 )/(s/√n) follows Student’s t distribution with (n-1) degrees of freedom. One degree of freedom has been sacrificed for estimating population s.d. (σ) by sample s.d. (s).

Everything else in the testing process remains the same.

t-test is not much affected if assumption of normality is violated provided data is slightly asymmetrical (near to symmetry) and data-set does not contain outliers.

t-distribution:

The Student’s t-distribution, is much similar to the normal distribution. It is a symmetric distribution (bell shaped distribution). In general Student’s t distribution is flatter i.e. having heavier tails. Shape of t distribution changes with degrees of freedom (exact distribution) and becomes approximately close to Normal distribution for large n.

In many business decision making situations, decision makers are interested in comparison of two populations i.e. interested in examining the difference between two population parameters. Example: comparing sales of rural and urban outlets, comparing sales before the advertisement and after advertisement, comparison of salaries in between male and female employees, comparison of salary before and after joining the data science courses etc.

Independent Samples and Dependent (Paired Samples):

Depending on method of collection data for the two samples, samples may be termed as independent or dependent samples. If two samples are drawn independently without any relation (may be from different units/respondents in the two samples), then it is said that samples are drawn independently . If samples are related or paired or having two observations at different points of time on the same unit/respondent, then the samples are said to be dependent or paired . This approach (paired samples) enables us to compare two populations after controlling the extraneous effect on them.

Testing the Difference Between Means: Independent Samples

Two samples z test:.

We have two populations, both following Normal populations as N (µ 1 , σ 1 ) and N (µ 2 , σ 2 ). We want to test the Null Hypothesis:

H 0 : µ 1 – µ 2 = θ or µ 1 – µ 2 ≤ θ or µ 1 – µ 2 ≥ θ

Alternative hypothesis:

H 1 : µ 1 – µ 2 > θ or

H 0 : µ 1 – µ 2 < θ or

H 1 : µ 1 – µ 2 ≠ θ

(where θ may take any value as per the situation or θ =0).

Two samples of size n 1 and n 2 have been taken randomly from the two normal populations respectively and the corresponding sample means are x̄ 1 and x̄ 2 .

Here, we are not interested in individual population parameters (means) but in the difference of population means (µ 1 – µ 2 ). So, the corresponding statistic is = (x̄ 1 – x̄ 2 ).

According, sampling distribution of the statistic (x̄ 1 – x̄ 2 ) will follow Normal distribution with mean µ x̄ = µ 1 – µ 2 and standard error σ x̄ = √ (σ² 1 / n 1 + σ² 2 / n 2 ). So, the corresponding Test Statistics will be:

Other things remaining the same as per the One Sample Tests (as explained earlier).

Two Independent Samples t-Test (when Population Standard Deviations are Unknown):

Here, for testing the difference of two population mean, we assume that samples have been drawn from populations following Normal Distributions, but it is a very common situation that population standard deviations (σ 1 and σ 2 ) are unknown. So they are estimated by sample standard deviations (s 1 and s 2 ) from the respective two samples.

Here, two situations are possible:

(a) Population Standard Deviations are unknown but equal:

In this situation (where σ 1 and σ 2 are unknown but assumed to be equal), sampling distribution of the statistic (x̄ 1 – x̄ 2 ) will follow Student’s t distribution with mean µ x̄ = µ 1 – µ 2 and standard error σ x̄ = √ Sp 2 (1/ n 1 + 1/ n 2 ). Where Sp 2 is the pooled estimate, given by:

Sp 2 = (n 1 -1) S 1 2 +(n 2 -1) S 2 2 /(n 1 +n 2 -2)

So, the corresponding Test Statistics will be:

t = {(x̄ 1 – x̄ 2 ) – (µ 1 – µ 2 )}/{√ Sp 2 (1/n 1 +1/n 2 )}

Here, t statistic will follow t distribution with d.f. (n 1 +n 2 -2).

(b) Population Standard Deviations are unknown but unequal:

In this situation (where σ 1 and σ 2 are unknown and unequal).

Then the sampling distribution of the statistic (x̄ 1 – x̄ 2 ) will follow Student’s t distribution with mean µ x̄ = µ 1 – µ 2 and standard error Se =√ (s² 1 / n 1 + s² 2 / n 2 ).

t = {(x̄ 1 – x̄ 2 ) – (µ 1 – µ 2 )}/{√ (s2 1 /n 1 +s2 2 /n 2 )}

The test statistic will follow Student’s t distribution with degrees of freedom (rounding down to nearest integers):

As discussed in the aforementioned two cases, it is important to figure out whether the two population variances are equal or otherwise. For this purpose, F test can be employed as:

H 0 : σ² 1 = σ² 2 and H 1 : σ² 1 ≠ σ² 2

Two samples of sizes n 1 and n 2 have been drawn from two populations respectively. They provide sample standard deviations s 1 and s 2 . The test statistic is F = s 1 ²/s 2 ²

The test statistic will follow F-distribution with (n 1 -1) df for numerator and (n 2 -1) df for denominator.

Note: There are many other tests that are applied for this purpose.

Paired Sample t-Test (Testing Difference between Means with Dependent Samples):

As discussed earlier, in the situation of Before-After Tests, to examine the impact of any intervention like a training program, health program, any campaign to change status, we have two set of observations (x i and y i ) on the same test unit (respondent or units) before and after the program. Each sample has “n” paired observations. The Samples are said to be dependent or paired.

Here, we consider a random variable: d i = x i – y i .

Accordingly, the sampling distribution of the sample statistic (sample mean of the differentces d i ’s) will follow Student’s t distribution with mean = θ and standard error = sd/ √ n, where sd is the sample standard deviation of d i ’s.

Hence, the corresponding test statistic: t = (d̅- θ)/sd/√n will follow t distribution with (n-1).

As we have observed, paired t-test is actually one sample test since two samples got converted into one sample of differences. If ‘Two Independent Samples t-Test’ and ‘Paired t-test’ are applied on the same data set then two tests will give much different results because in case of Paired t-Test, standard error will be quite low as compared to Two Independent Samples t-Test. The Paired t-Test is applied essentially on one sample while the earlier one is applied on two samples. The result of the difference in standard error is that t-statistic will take larger value in case of ‘Paired t-Test’ in comparison to the ‘Two Independent Samples t-Test and finally p-values get affected accordingly.

t-Test in SPSS:

One sample t-test.

Analyze => Compare Means => One-Sample T-Test to open relevant dialogue box.
Test variable (variable under consideration) in the Test variable(s) box and hypothesised value µ 0 = 75 (for example) in the Test Value box are to be entered.
Press Ok to have the output.

Here, we consider the example of Ventura Sales, and want to examine the perception that average sales in the first quarter is 75 (thousand) vs it is not. So, the Hypotheses:

Null Hypothesis H 0 : µ=75

Alternative Hypothesis H 1 : µ≠75

One-Sample Statistics

Descriptive table showing the sample size n = 60, sample mean x̄=72.02, sample sd s=9.724.

One-Sample Test

One Sample Test Table shows the result of the t-test. Here, test statistic value (from the sample) is t = -2.376 and the corresponding p-value (2 tailed) = 0.021 <0.05. So, H 0 got rejected and it can be said that the claim of average first quarterly sales being 75 (thousand) does not hold.

Two Independent Samples t-Test

Analyze => Compare Means => Independent-Samples T-Test to open the dialogue box.
Enter the Test variable (variable under consideration) in the Test Variable(s) box and variable categorising the groups in the Grouping Variable box.
Define the groups by clicking on Define Groups and enter the relevant numeric-codes into the relevant groups in the Define Groups sub-dialogue box. Press Continue to return back to the main dialogue box.

We continue with the example of Ventura Sales, and want to compare the average first quarter sales with respect to Urban Outlets and Rural Outlets (two independent samples/groups). Here, the claim is that urban outlets are giving lower sales as compared to rural outlets. So, the Hypotheses:

H 0 : µ 1 – µ 2 = 0 or µ 1 = µ 2 (Where, µ 1 = Population Mean Sale of Urban Outlets and µ 2 = Population Mean Sale of Rural Outlets)

H 1 : µ 1 < µ 2

Group Statistics

Descriptive table showing the sample sizes n 1 =37 and n 2 =23, sample means x̄ 1 =67.86 and x̄ 2 =78.70, sample standard deviations s 1 =8.570 and s 2 = 7.600.

The below table is the Independent Sample Test Table, proving all the relevant test statistics and p-values. Here, both the outputs for Equal Variance (assumed) and Unequal Variance (assumed) are presented.

Independent Samples Test

So, we have to figure out whether we should go for ‘equal variance’ case or for ‘unequal variances’ case.

Here, Levene’s Test for Equality of Variances has to be applied for this purpose with the hypotheses: H 0 : σ² 1 = σ² 2 and H 1 : σ² 1 ≠ σ² 2 . The p-value (Sig) = 0.460 >0.05, so we can’t reject (so retained) H 0 . Hence, variances can be assumed to be equal.

So, “Equal Variances assumed” case is to be taken up. Accordingly, the value of t statistic = -4.965 and the p-value (two tailed) = 0.000, so the p-value (one tailed) = 0.000/2 = 0.000 <0.05. Hence, H 0 got rejected and it can be said that urban outlets are giving lower sales in the first quarter. So, the claim stands.

Paired t-Test (Testing Difference between Means with Dependent Samples):

Analyze => Compare Means => Paired-Samples T-Test to open the dialogue box.
Enter the relevant pair of variables (paired samples) in the Paired Variables box.
After entering the paired samples, press Ok to have the output.

We continue with the example of Ventura Sales, and want to compare the average first quarter sales with the second quarter sales. Some sales promotion interventions were executed with an expectation of increasing sales in the second quarter. So, the Hypotheses:

H 0 : µ 1 = µ 2 (Where, µ 1 = Population Mean Sale of Quarter-I and µ 2 = Population Mean Sale of Quarter-II)

H 1 : µ 1 < µ 2 (representing the increase of sales i.e. implying the success of sales interventions)

Paired Samples Statistics

Descriptive table showing the sample size n=60, sample means x̄ 1 =72.02 and x̄ 2 =72.43.

As per the following output table (Paired Samples Test), sample mean of differences d̅ = -0.417 with standard deviation of differences sd = 8.011 and value of t statistic = -0.403. Accordingly, the p-value (two tailed) = 0.688, so the p-value (one tailed) = 0.688/2 = 0.344 > 0.05. So, there have not been sufficient reasons to Reject H 0 i.e. H 0 should be retained. So, the effectiveness (success) of the sales promotion interventions is doubtful i.e. it didn’t result in significant increase in sales, provided all other extraneous factors remain the same.

Paired Samples Test

t-Test Application One Sample

Experience Marketing Services reported that the typical American spends a mean of 144 minutes (2.4 hours) per day accessing the Internet via a mobile device. (Source: The 2014 Digital Marketer, available at ex.pn/1kXJifX.) To test the validity of this statement, you select a sample of 30 friends and family. The result for the time spent per day accessing the Internet via a mobile device (in minutes) are stored in Internet_Mobile_Time.csv file.

Is there evidence that the populations mean time spent per day accessing the Internet via a mobile device is different from 144 minutes? Use the p-value approach and a level of significance of 0.05

What assumption about the population distribution is needed to conduct the test in A?

Solution In R

[1] 1.224674

[1] 0.2305533

[1] “Accepted”

Independent t-test two sample

A hotel manager looks to enhance the initial impressions that hotel guests have when they check-in. Contributing to initial impressions is the time it takes to deliver a guest’s luggage to the room after check-in. A random sample of 20 deliveries on a particular day was selected each from Wing A and Wing B of the hotel. The data collated is given in Luggage.csv file. Analyze the data and determine whether there is a difference in the mean delivery times in the two wings of the hotel. (use alpha = 0.05).

Two Sample t-test data: WingA and WingB t = 5.1615, df = 38, p-value = 4.004e-06 alternative hypothesis: true difference in means is greater than 0 95 percent confidence interval: 1.531895 Inf sample estimates: mean of x mean of y 10.3975 8.1225 > t.test(WingA,WingB) Welch Two Sample t-test

t = 5.1615, df = 37.957, p-value = 8.031e-06 alternative hypothesis: true difference in means is not equal to 0 95 per cent confidence interval: 1.38269 3.16731 sample estimates: mean of x mean of y 10.3975 8.1225

Case Study- Titan Insurance Company

The Titan Insurance Company has just installed a new incentive payment scheme for its lift policy salesforce. It wants to have an early view of the success or failure of the new scheme. Indications are that the sales force is selling more policies, but sales always vary in an unpredictable pattern from month to month and it is not clear that the scheme has made a significant difference.

Life Insurance companies typically measure the monthly output of a salesperson as the total sum assured for the policies sold by that person during the month. For example, suppose salesperson X has, in the month, sold seven policies for which the sums assured are £1000, £2500, £3000, £5000, £10000, £35000. X’s output for the month is the total of these sums assured, £61,500.

Titan’s new scheme is that the sales force receives low regular salaries but are paid large bonuses related to their output (i.e. to the total sum assured of policies sold by them). The scheme is expensive for the company, but they are looking for sales increases which more than compensate. The agreement with the sales force is that if the scheme does not at least break even for the company, it will be abandoned after six months.

The scheme has now been in operation for four months. It has settled down after fluctuations in the first two months due to the changeover.

To test the effectiveness of the scheme, Titan has taken a random sample of 30 salespeople measured their output in the penultimate month before changeover and then measured it in the fourth month after the changeover (they have deliberately chosen months not too close to the changeover). Ta ble 1 shows t he outputs of the salespeople in Table 1

Data preparation

Since the given data are in 000, it will be better to convert them in thousands. Problem 1 Describe the five per cent significance test you would apply to these data to determine whether the new scheme has significantly raised outputs? What conclusion does the test lead to? Solution: It is asked that whether the new scheme has significantly raised the output, it is an example of the one-tailed t-test. Note: Two-tailed test could have been used if it was asked “new scheme has significantly changed the output” Mean of amount assured before the introduction of scheme = 68450 Mean of amount assured after the introduction of scheme = 72000 Difference in mean = 72000 – 68450 = 3550 Let, μ1 = Average sums assured by salesperson BEFORE changeover. μ2 = Average sums assured by salesperson AFTER changeover. H0: μ1 = μ2 ; μ2 – μ1 = 0 HA: μ1 < μ2 ; μ2 – μ1 > 0 ; true difference of means is greater than zero. Since population standard deviation is unknown, paired sample t-test will be used.

Since p-value (=0.06529) is higher than 0.05, we accept (fail to reject) NULL hypothesis. The new scheme has NOT significantly raised outputs .

Problem 2 Suppose it has been calculated that for Titan to break even, the average output must increase by £5000. If this figure is an alternative hypothesis, what is: (a) The probability of a type 1 error? (b) What is the p-value of the hypothesis test if we test for a difference of $5000? (c) Power of the test: Solution: 2.a. The probability of a type 1 error? Solution: Probability of Type I error = significant level = 0.05 or 5% 2.b. What is the p-value of the hypothesis test if we test for a difference of $5000? Solution: Let μ2 = Average sums assured by salesperson AFTER changeover. μ1 = Average sums assured by salesperson BEFORE changeover. μd = μ2 – μ1 H0: μd ≤ 5000 HA: μd > 5000 This is a right tail test.

P-value = 0.6499 2.c. Power of the test. Solution: Let μ2 = Average sums assured by salesperson AFTER changeover. μ1 = Average sums assured by salesperson BEFORE changeover. μd = μ2 – μ1 H0: μd = 4000 HA: μd > 0

H0 will be rejected if test statistics > t_critical. With α = 0.05 and df = 29, critical value for t statistic (or t_critical ) will be 1.699127. Hence, H0 will be rejected for test statistics ≥ 1.699127. Hence, H0 will be rejected if for 𝑥̅ ≥ 4368.176

Graphically,

Probability (type II error) is P(Do not reject H0 | H0 is false) Our NULL hypothesis is TRUE at μd = 0 so that H0: μd = 0 ; HA: μd > 0 Probability of type II error at μd = 5000

= P (Do not reject H0 | H0 is false) = P (Do not reject H0 | μd = 5000) = P (𝑥̅ < 4368.176 | μd = 5000) = P (t < | μd = 5000) = P (t < -0.245766) = 0.4037973

R Code: Now, β=0.5934752, Power of test = 1- β = 1- 0.5934752 = 0.4065248

While performing Hypothesis-Testing, Hypotheses can’t be proved or disproved since we have evidence from sample (s) only. At most, Hypotheses may be rejected or retained.
Use of the term “accept H 0 ” in place of “do not reject” should be avoided even if the test statistic falls in the Acceptance Region or p-value ≥ α. This simply means that the sample does not provide sufficient statistical evidence to reject the H 0 . Since we have tried to nullify (reject) H 0 but we haven’t found sufficient support to do so, we may retain it but it won’t be accepted.
Confidence Interval (Interval Estimation) can also be used for testing of hypotheses. If the hypothesis parameter falls within the confidence interval, we do not reject H 0 . Otherwise, if the hypothesised parameter falls outside the confidence interval i.e. confidence interval does not contain the hypothesized parameter, we reject H 0 .
Downey, A. B. (2014). Think Stat: Exploratory Data Analysis , 2 nd Edition, Sebastopol, CA: O’Reilly Media Inc
Fisher, R. A. (1956). Statistical Methods and Scientific Inference , New York: Hafner Publishing Company.
Hogg, R. V.; McKean, J. W. & Craig, A. T. (2013). Introduction to Mathematical Statistics , 7 th Edition, New Delhi: Pearson India.
IBM SPSS Statistics. (2020). IBM Corporation.
Levin, R. I.; Rubin, D. S; Siddiqui, M. H. & Rastogi, S. (2017). Statistics for Management , 8 th Edition, New Delhi: Pearson India.

If you want to get a detailed understanding of Hypothesis testing, you can take up this hypothesis testing in machine learning course. This course will also provide you with a certificate at the end of the course.

If you want to learn more about R programming and other concepts of Business Analytics or Data Science, sign up for Great Learning ’s PG program in Data Science and Business Analytics.

Top Free Courses

Top 30 Python Libraries To Know

Python Dictionary Append: How To Add Key/Value Pair?

¿Qué es la Ciencia de Datos? – Una Guía Completa [2024]

What is Data Science? – The Complete Guide

What is Time Complexity And Why Is It Essential?

Python NumPy Tutorial – 2024

R news and tutorials contributed by hundreds of R bloggers

Grammar as a biometric for authorship verification.

Posted on April 23, 2024 by Valerio Gherardi in R bloggers | 0 Comments

About a month ago we finally managed to drop (Nini et al. 2024) , “Authorship Verification based on the Likelihood Ratio of Grammar Models” , on the arXiv. Delving into topics such as authorship verification, grammar and forensics, was quite a detour for me, and I’d like to summarize here some of the ideas and learnings I got from working with all this new and interesting material.

The main qualitative idea put forward by Ref. (Nini et al. 2024) is that grammar is a fundamentally personal and unique trait of an individual , therefore providing a sort of “behavioural biometric”. One first goal of this work was to put this general principle under test, by applying it to the problem of Authorship Verification (AV): the process of validating whether a certain document was written by a claimed author. Concretely, we built an algorithm for AV that relies entirely on the grammatical features of the examined textual data, and compared it with the state-of-the-art methods for AV.

The results were very encouraging. In fact, our method actually turned out to be generally superior to the previous state-of-the-art on the benchmarks we examined. This is a notable result, keeping also into account that our method uses less textual information (only the grammar part) than other methods to perform its inferences.

The algorithm

I sketch here a pseudo-implementation of our method in R. For the fit of $k$ -gram models and perplexity computations, I use my package {kgrams} , which can be installed from CRAN. Model (hyper)parameters such as number of impostors, order of the $k$ -gram models, etc. are hardcoded, see (Nini et al. 2024) for details.

This is just for illustrating the essence of the method. For practical reasons, in the code chunk below I’m not reproducing the definition of the function extract_grammar() , which in our work is embodied by the POS-noise algorithm. This function should transform a regular sentence, such as “He wrote a sentence”, to its underlying grammatical structure, say “[Pronoun] [verb] a [noun]”.

To be used as follows:

The “score” computed by this algorithm turns out to be a good truthfulness predictor for the claimed authorship, higher scores being correlated with true attributions. If the impostor corpus is fixed once and for all, and if the pairs q_doc and auth_corpus are randomly sampled from a fixed joint distribution, we can set a threshold for score in such a way that the attribution criterion score > threshold maximizes some objective such as accuracy. This is, more or less, what we studied quantitatively in our paper.

Reflections on in silico Authorship Verification

The various ifs at the end of the previous sections led me to reflexionate on the applicability of machine-learning approaches, such as the one we discussed in our work, to real-life contexts.

As implied above, in order for a metric such as accuracy to represent a sensible measure of predictive performance, we should be able to regard the AV problems encountered in our favorite practical use case as random samples from some fixed population. In other words, we consider a stationary source of random authorship claims, and assume that our trained model is routinely used to verify claims from this source.

Now, while there are many circumstances in which the above assumptions make total sense, I think there are also interesting AV applications in which one is not interested in the long-run properties of the method but, rather, in a single inference. The real case of the poem “Shall I die?”, controversially attributed to Shakespeare in 1985, is an example of this kind. An approach to this case based on empirical Bayes is discussed in (Efron and Hastie 2021, vol. 6, sec. 6.2) . Although we may be able to build a reasonable impostor corpus to be used with this problem, it is not clear how one should come up with a relevant testing dataset of AV problems to empirically quantify uncertainty.

For cases such as the “Shall I die?” controversy, the machine-learning setting considered in our study is just an in silico model of real AV. As such, I believe it still provides useful indications on what could be good authorship indicators and work in general, but we must acknowledge the practical limitations in our way to quantify uncertainty. Other approaches, such as classical null hypothesis testing, may be more suited to this specific kind of AV problems.

Never miss an update! Subscribe to R-bloggers to receive e-mails with the latest R posts. (You will not see this message again.)

More From Forbes

2024 rivian r1t only pickup truck to qualify for highest safety rating.

Share to Facebook
Share to Twitter
Share to Linkedin

The 2024 Rivian R1T electric pickup truck was designated a Top Safety Pick+, the highest award from ... [+] the nonprofit.

Earlier this month, the 2024 Rivian R1T electric pickup truck continued its safety legacy.

The battery-powered truck from the U.S.-based EV startup was once again the only pickup truck (and only electric truck) to earn a Top Safety Pick+ award from the Insurance Institute for Highway Safety (IIHS). The award recognizes vehicles that perform well in a series of safety tests, which have been updated with more stringent criteria for 2024.

In an email about the designation last week, Rivian wrote the company was “proud” to be the only pickup truck to earn the highest rating so far this year. It earned the award in 2022 and 2023. The only other truck that qualifies for the lower-level Top Safety Pick (no Plus) is the 2024 Toyota Tundra . Other Top Safety Pick+ winners in other categories include EVs like the 2024 Tesla Model Y, 2024 Genesis GV60, 2024 Genesis Electrified G80 and 2024 Hyundai Ioniq 6.

A spokesperson for the IIHS said in an email this week, “As a newer automaker, it’s encouraging to see that Rivian is clearly prioritizing good performance in IIHS safety tests.”

To qualify for a Top Safety Pick+ designation a vehicle must achieve “good” ratings in the small overlap front test, “acceptable” or “good” in an updated moderate overlap front test, “good” in the updated side test, “acceptable” or “good” in the pedestrian front crash prevention test and have “acceptable” or “good” headlights.

The updated moderate overlap test now has a dummy in the back seat to encourage second row safety. The R1T received a “good” rating in this test with sensors determining low risk of head, neck and chest injury during a crash. Seat belts stayed properly positioned and the dummy’s head stayed away from the front seatback.

An updated front crash test involving motorcycles and other vehicles offered weak results for most ... [+] small SUVs.

Apple iPhone 16: Unique All-New Design Promised In New Report

Huawei s pura 70 ultra beats iphone with pioneering new feature, meet the fintech billionaire making a fortune rewarding home renters, tougher testing.

On Wednesday night, the IIHS released results from a tougher front crash test for 2024. Initial findings concluded most small SUVs don’t pass the updated vehicle-to-vehicle test, which examines high-speed crashes and those involving a motorcycle or large truck. The tougher front crash prevention test puts cars at 31, 37 and 43 mph instead of the previous 12 and 25 mph speeds.

Of the first 10 small SUVs that underwent the updated testing, only one received a “good” rating: the Subaru Forester. The Honda CR-V and Toyota RAV4 were rated as “acceptable.” The Ford Escape, Hyundai Tucson and Jeep Compass earned “marginal” ratings, while the Chevrolet Equinox, Mazda CX-5, Mitsubishi Outlander and Volkswagen Taos all rated “poor.”

The R1T hasn’t undergone this new test, but in its pedestrian front crash prevention test it earned an Acceptable rating for 2024.

The other American testing agency, the National Highway Traffic Safety Administration (NHTSA), has yet to rate the 2024 Rivian R1T. Most pickups have weak safety ratings.

Editorial Standards
Reprints & Permissions

IMAGES

Introduction to Hypothesis Testing in R
How to Perform Hypothesis Testing in R using T-tests and μ-Tests
Hypothesis Tests in R
Hypothesis testing in R
Introduction to Hypothesis Testing in R
regression

VIDEO

Geostatistics 4
Proportion Hypothesis Testing, example 2
Hypothesis Testing
ISS Syllabus
One Sample t-test in R Programming
Introduction to Hypothesis Testing Part 2

COMMENTS

The Complete Guide: Hypothesis Testing in R
A hypothesis test is a formal statistical test we use to reject or fail to reject some statistical hypothesis. This tutorial explains how to perform the following hypothesis tests in R: One sample t-test. Two sample t-test. Paired samples t-test. We can use the t.test () function in R to perform each type of test:
Hypothesis Tests in R
This tutorial covers basic hypothesis testing in R. Normality tests. Shapiro-Wilk normality test. Kolmogorov-Smirnov test. Comparing central tendencies: Tests with continuous / discrete data. One-sample t-test : Normally-distributed sample vs. expected mean. Two-sample t-test: Two normally-distributed samples.
15.5: Hypothesis Tests for Regression Models
Testing the model as a whole. Okay, suppose you've estimated your regression model. The first hypothesis test you might want to try is one in which the null hypothesis that there is no relationship between the predictors and the outcome, and the alternative hypothesis is that the data are distributed in exactly the way that the regression model predicts.
The Complete Guide: Hypothesis Testing in R
A hypothesis test is a formal statistical test we use to reject or fail to reject some statistical hypothesis. This tutorial explains how to perform the following hypothesis tests in R: One sample t-test. Two sample t-test. Paired samples t-test. We can use the t.test () function in R to perform each type of test:
Hypothesis Testing
In the following tutorials, we demonstrate the procedure of hypothesis testing in R first with the intuitive critical value approach. Then we discuss the popular p-value approach as alternative. Lower Tail Test of Population Mean with Known Variance; ... Hierarchical Linear Model;
Hypothesis testing
11. Hypothesis testing. The process of induction is the process of assuming the simplest law that can be made to harmonize with our experience. This process, however, has no logical foundation but only a psychological one. It is clear that there are no grounds for believing that the simplest course of events will really happen.
Introduction to Hypothesis Testing in R
With this R hypothesis testing tutorial, learn about the decision errors, two-sample T-test with unequal variance, one-sample T-testing, formula syntax and subsetting samples in T-test and μ test in R. ... Goodness of Fit Tests in R. While fitting a statistical model for observed data, an analyst must identify how accurately the model analysis ...
Hypothesis testing in R
HYPOTHESIS TESTING IN R. Hypothesis testing is a statistical procedure used to make decisions or draw conclusions about the characteristics of a population based on information provided by a sample. ... allowing to determine whether a theoretical model accurately represents the observed data distribution. Pearson's Chi-squared test with chisq ...
R Handbook: Hypothesis Testing and p-values
Using a binomial test, the p -value is < 0.0001. (Actually, R reports it as < 2.2e-16, which is shorthand for the number in scientific notation, 2.2 x 10 -16, which is 0.00000000000000022, with 15 zeros after the decimal point.) Assuming an alpha of 0.05, since the p -value is less than alpha, we reject the null hypothesis.
Hypothesis Testing in R Course
Discover Hypothesis Testing in R. Hypothesis testing lets you ask questions about your datasets and answer them in a statistically rigorous way. In this course, you'll learn how and when to use common tests like t-tests, proportion tests, and chi-square tests. You'll gain a deep understanding of how they work and the assumptions that underlie them.
Hypothesis Testing in R Programming
Hypothesis Testing in R Programming. Hypothesis testing is a statistical method used to determine whether there is enough evidence to reject a null hypothesis in favor of an alternative hypothesis. In R programming, you can perform various types of hypothesis tests, such as t-tests, chi-squared tests, and ANOVA tests, among others.
Hypothesis Testing in R: A Comprehensive Guide with Code and Examples
Understanding the Hypothesis Testing Process. Before diving into code examples, let's grasp the key concepts of hypothesis testing: Null Hypothesis (H0): This is the default assumption that there is no significant effect, difference, or relationship in the population. It's often denoted as H0. Alternative Hypothesis (Ha): This is the ...
17: Hypothesis Testing in R
This page titled 17: Hypothesis Testing in R is shared under a CC BY-NC 2.0 license and was authored, remixed, and/or curated by Russell A. Poldrack via source content that was edited to the style and standards of the LibreTexts platform; a detailed edit history is available upon request.
linearHypothesis function
rhs. right-hand-side vector for hypothesis, with as many entries as rows in the hypothesis matrix; can be omitted, in which case it defaults to a vector of zeroes. For a multivariate linear model, rhs is a matrix, defaulting to 0. This argument isn't available for F-tests for linear mixed models. singular.ok.
Hypothesis Testing in R Programming
Based on the results of the testing, the hypothesis is either selected or rejected. This concept is known as Statistical Inference. In this article, we'll discuss the four-step process of hypothesis testing, One sample T-Testing, Two-sample T-Testing, Directional Hypothesis, one sample -test, two samples -test and correlation test in R ...
Hypothesis Testing in R
Hypothesis Testing in R. In this course, you'll learn advanced statistical concepts like significance testing and multi-category chi-square testing, which will help you perform more advanced data analysis using R. Enroll for free. Part of the Data Analyst (R), and Probability and Statistics with R paths. 4.8 (359 reviews)
Hypothesis Testing in R
About this Guided Project. Welcome to this project-based course Hypothesis Testing in R. In this project, you will learn how to perform extensive hypothesis tests for one and two samples in R. By the end of this 2-hour long project, you will understand the rationale behind performing hypothesis testing. Also, you will learn how to perform ...
How to Compare Nested Models in R
Notice the the P-value of the F-Test of the fit1 model is 0.0008866 which actually actually tests the Null Hypothesis that "all the beta coefficients are zero" versus the alternative hypothesis that "at least one beta coefficient is not zero". Since we have only one beta coefficient, the pop15 the p-value of the F-Test is the same with the p-value of the T-Test as we can see above.
15.5: Hypothesis Tests for Regression Models
Testing the model as a whole; Tests for individual coefficients; Running the hypothesis tests in R; So far we've talked about what a regression model is, how the coefficients of a regression model are estimated, and how we quantify the performance of the model (the last of these, incidentally, is basically our measure of effect size).
Chapter 9 Linear mixed-effects models
Chapter 9 Linear mixed-effects models. In this Chapter, we will look at how to estimate and perform hypothesis tests for linear mixed-effects models. The main workhorse for estimating linear mixed-effects models is the lme4 package (Bates et al. 2023).This package allows you to formulate a wide variety of mixed-effects and multilevel models through an extension of the R formula syntax.
Hypothesis Testing in R- Introduction Examples and Case Study
Here, we take an example of Testing of Mean: 1. Setting up the Hypothesis: This step is used to define the problem after considering the business situation and deciding the relevant hypotheses H 0 and H 1, after mentioning the hypotheses in the business language.
Hypothesis Testing in R
Now we can perform a one-sample t-test. t.test (x = weights, mu = 310) One Sample t-test data: weights t = 0.045145, df = 12, p-value = 0.9647 alternative hypothesis: true mean is not equal to 310 95 percent confidence interval: 306.3644 313.7895 sample estimates: mean of x 310.0769. From the output we can see:
Grammar as a biometric for Authorship Verification
The algorithm I sketch here a pseudo-implementation of our method in R. For the fit of $k$-gram models and perplexity computations, I use my package {kgrams}, which can be installed from CRAN. Model (hyper)parameters such as number of impostors, order of the $k$-gram models, etc. are hardcoded, see (Nini et al. 2024) for details.
2024 Rivian R1T Only Pickup Truck To Qualify For Highest ...
The tougher front crash prevention test puts cars at 31, 37 and 43 mph instead of the previous 12 and 25 mph speeds. Of the first 10 small SUVs that underwent the updated testing, only one ...

Hypothesis Tests in R

Hypothesis Testing

The Problem of Induction

Falsification

Null and Alternative Hypotheses

Type I vs. Type II Errors

Statistical Significance vs. Importance

Science vs. Non-science

Example Data

Variable Types

Normality Tests

The Shapiro-Wilk Normality Test

The Kolmogorov-Smirnov Test

Modality Tests of Samples

One Sample T-Test (One-Sided)

Box-and-Whisker Chart

Two-Sample T-Test

Wilcoxen Rank Sum Test (Mann-Whitney U-Test)

Weighted Two-Sample T-Test

Comparing Proportions: Tests with Categorical Data

Chi-Squared Contingency Analysis / Test of Independence

Weighted Chi-Squared Contingency Analysis

Comparing Categorical and Continuous Variables

Kruskal-Wallace One-Way Analysis of Variance

The Complete Guide: Hypothesis Testing in R

Example 1: One Sample t-test in R

Example 2: Two Sample t-test in R

Example 3: Paired Samples t-test in R

Additional Resources

How to Calculate Mode from Frequency Table (With Examples)

An R Introduction to Statistics

Hypothesis Testing

R Tutorial eBook

R Tutorials

11 Hypothesis testing

11.1 A menagerie of hypotheses

11.1.1 Research hypotheses versus statistical hypotheses

11.1.2 Null hypotheses and alternative hypotheses

11.2 Two types of errors

11.3 Test statistics and sampling distributions

11.4 Making decisions

11.4.1 Critical regions and critical values

11.4.2 A note on statistical “significance”

11.4.3 The difference between one sided and two sided tests

11.5 The \(p\) value of a test

11.5.1 A softer view of decision making

11.5.2 The probability of extreme data

11.5.3 A common mistake

11.6 Reporting the results of a hypothesis test

11.6.1 The issue

11.6.2 Two proposed solutions

11.7 Running the hypothesis test in practice

11.8 Effect size, sample size and power

11.8.1 The power function

11.8.2 Effect size

11.8.3 Increasing the power of your study

11.9 Some issues to consider

11.9.1 Neyman versus Fisher

11.9.2 Bayesians versus frequentists

11.9.3 Traps

11.10 Summary

Share This Book

HYPOTHESIS TESTING IN R

NORMALITY TESTS

Shapiro Wilk normality test

Lilliefors normality test

GOODNESS OF FIT TESTS

Pearson's Chi-squared test with chisq.test()

Kolmogorov-Smirnov test with ks.test()

Wilcoxon signed rank test

Wilcoxon rank sum test (Mann-Whitney U test)

Kruskal Wallis rank sum test (H test)

OTHER TYPES OF TESTS

T-test to compare means

F test with var.test() to compare two variances

Test for proportions with prop.test()

Summary and Analysis of Extension Program Evaluation in R

Hypothesis Testing and p-values

Initial comments

Statistical inference