Have a language expert improve your writing

Run a free plagiarism check in 10 minutes, generate accurate citations for free.

  • Knowledge Base

Methodology

Reliability vs. Validity in Research | Difference, Types and Examples

Published on July 3, 2019 by Fiona Middleton . Revised on June 22, 2023.

Reliability and validity are concepts used to evaluate the quality of research. They indicate how well a method , technique. or test measures something. Reliability is about the consistency of a measure, and validity is about the accuracy of a measure.opt

It’s important to consider reliability and validity when you are creating your research design , planning your methods, and writing up your results, especially in quantitative research . Failing to do so can lead to several types of research bias and seriously affect your work.

Table of contents

Understanding reliability vs validity, how are reliability and validity assessed, how to ensure validity and reliability in your research, where to write about reliability and validity in a thesis, other interesting articles.

Reliability and validity are closely related, but they mean different things. A measurement can be reliable without being valid. However, if a measurement is valid, it is usually also reliable.

What is reliability?

Reliability refers to how consistently a method measures something. If the same result can be consistently achieved by using the same methods under the same circumstances, the measurement is considered reliable.

What is validity?

Validity refers to how accurately a method measures what it is intended to measure. If research has high validity, that means it produces results that correspond to real properties, characteristics, and variations in the physical or social world.

High reliability is one indicator that a measurement is valid. If a method is not reliable, it probably isn’t valid.

If the thermometer shows different temperatures each time, even though you have carefully controlled conditions to ensure the sample’s temperature stays the same, the thermometer is probably malfunctioning, and therefore its measurements are not valid.

However, reliability on its own is not enough to ensure validity. Even if a test is reliable, it may not accurately reflect the real situation.

Validity is harder to assess than reliability, but it is even more important. To obtain useful results, the methods you use to collect data must be valid: the research must be measuring what it claims to measure. This ensures that your discussion of the data and the conclusions you draw are also valid.

Prevent plagiarism. Run a free check.

Reliability can be estimated by comparing different versions of the same measurement. Validity is harder to assess, but it can be estimated by comparing the results to other relevant data or theory. Methods of estimating reliability and validity are usually split up into different types.

Types of reliability

Different types of reliability can be estimated through various statistical methods.

Types of validity

The validity of a measurement can be estimated based on three main types of evidence. Each type can be evaluated through expert judgement or statistical methods.

To assess the validity of a cause-and-effect relationship, you also need to consider internal validity (the design of the experiment ) and external validity (the generalizability of the results).

The reliability and validity of your results depends on creating a strong research design , choosing appropriate methods and samples, and conducting the research carefully and consistently.

Ensuring validity

If you use scores or ratings to measure variations in something (such as psychological traits, levels of ability or physical properties), it’s important that your results reflect the real variations as accurately as possible. Validity should be considered in the very earliest stages of your research, when you decide how you will collect your data.

  • Choose appropriate methods of measurement

Ensure that your method and measurement technique are high quality and targeted to measure exactly what you want to know. They should be thoroughly researched and based on existing knowledge.

For example, to collect data on a personality trait, you could use a standardized questionnaire that is considered reliable and valid. If you develop your own questionnaire, it should be based on established theory or findings of previous studies, and the questions should be carefully and precisely worded.

  • Use appropriate sampling methods to select your subjects

To produce valid and generalizable results, clearly define the population you are researching (e.g., people from a specific age range, geographical location, or profession).  Ensure that you have enough participants and that they are representative of the population. Failing to do so can lead to sampling bias and selection bias .

Ensuring reliability

Reliability should be considered throughout the data collection process. When you use a tool or technique to collect data, it’s important that the results are precise, stable, and reproducible .

  • Apply your methods consistently

Plan your method carefully to make sure you carry out the same steps in the same way for each measurement. This is especially important if multiple researchers are involved.

For example, if you are conducting interviews or observations , clearly define how specific behaviors or responses will be counted, and make sure questions are phrased the same way each time. Failing to do so can lead to errors such as omitted variable bias or information bias .

  • Standardize the conditions of your research

When you collect your data, keep the circumstances as consistent as possible to reduce the influence of external factors that might create variation in the results.

For example, in an experimental setup, make sure all participants are given the same information and tested under the same conditions, preferably in a properly randomized setting. Failing to do so can lead to a placebo effect , Hawthorne effect , or other demand characteristics . If participants can guess the aims or objectives of a study, they may attempt to act in more socially desirable ways.

It’s appropriate to discuss reliability and validity in various sections of your thesis or dissertation or research paper . Showing that you have taken them into account in planning your research and interpreting the results makes your work more credible and trustworthy.

If you want to know more about statistics , methodology , or research bias , make sure to check out some of our other articles with explanations and examples.

  • Normal distribution
  • Degrees of freedom
  • Null hypothesis
  • Discourse analysis
  • Control groups
  • Mixed methods research
  • Non-probability sampling
  • Quantitative research
  • Ecological validity

Research bias

  • Rosenthal effect
  • Implicit bias
  • Cognitive bias
  • Selection bias
  • Negativity bias
  • Status quo bias

Cite this Scribbr article

If you want to cite this source, you can copy and paste the citation or click the “Cite this Scribbr article” button to automatically add the citation to our free Citation Generator.

Middleton, F. (2023, June 22). Reliability vs. Validity in Research | Difference, Types and Examples. Scribbr. Retrieved April 7, 2024, from https://www.scribbr.com/methodology/reliability-vs-validity/

Is this article helpful?

Fiona Middleton

Fiona Middleton

Other students also liked, what is quantitative research | definition, uses & methods, data collection | definition, methods & examples, "i thought ai proofreading was useless but..".

I've been using Scribbr for years now and I know it's a service that won't disappoint. It does a good job spotting mistakes”

Grad Coach

Validity & Reliability In Research

A Plain-Language Explanation (With Examples)

By: Derek Jansen (MBA) | Expert Reviewer: Kerryn Warren (PhD) | September 2023

Validity and reliability are two related but distinctly different concepts within research. Understanding what they are and how to achieve them is critically important to any research project. In this post, we’ll unpack these two concepts as simply as possible.

This post is based on our popular online course, Research Methodology Bootcamp . In the course, we unpack the basics of methodology  using straightfoward language and loads of examples. If you’re new to academic research, you definitely want to use this link to get 50% off the course (limited-time offer).

Overview: Validity & Reliability

  • The big picture
  • Validity 101
  • Reliability 101 
  • Key takeaways

First, The Basics…

First, let’s start with a big-picture view and then we can zoom in to the finer details.

Validity and reliability are two incredibly important concepts in research, especially within the social sciences. Both validity and reliability have to do with the measurement of variables and/or constructs – for example, job satisfaction, intelligence, productivity, etc. When undertaking research, you’ll often want to measure these types of constructs and variables and, at the simplest level, validity and reliability are about ensuring the quality and accuracy of those measurements .

As you can probably imagine, if your measurements aren’t accurate or there are quality issues at play when you’re collecting your data, your entire study will be at risk. Therefore, validity and reliability are very important concepts to understand (and to get right). So, let’s unpack each of them.

Free Webinar: Research Methodology 101

What Is Validity?

In simple terms, validity (also called “construct validity”) is all about whether a research instrument accurately measures what it’s supposed to measure .

For example, let’s say you have a set of Likert scales that are supposed to quantify someone’s level of overall job satisfaction. If this set of scales focused purely on only one dimension of job satisfaction, say pay satisfaction, this would not be a valid measurement, as it only captures one aspect of the multidimensional construct. In other words, pay satisfaction alone is only one contributing factor toward overall job satisfaction, and therefore it’s not a valid way to measure someone’s job satisfaction.

validity and reliability research instrument

Oftentimes in quantitative studies, the way in which the researcher or survey designer interprets a question or statement can differ from how the study participants interpret it . Given that respondents don’t have the opportunity to ask clarifying questions when taking a survey, it’s easy for these sorts of misunderstandings to crop up. Naturally, if the respondents are interpreting the question in the wrong way, the data they provide will be pretty useless . Therefore, ensuring that a study’s measurement instruments are valid – in other words, that they are measuring what they intend to measure – is incredibly important.

There are various types of validity and we’re not going to go down that rabbit hole in this post, but it’s worth quickly highlighting the importance of making sure that your research instrument is tightly aligned with the theoretical construct you’re trying to measure .  In other words, you need to pay careful attention to how the key theories within your study define the thing you’re trying to measure – and then make sure that your survey presents it in the same way.

For example, sticking with the “job satisfaction” construct we looked at earlier, you’d need to clearly define what you mean by job satisfaction within your study (and this definition would of course need to be underpinned by the relevant theory). You’d then need to make sure that your chosen definition is reflected in the types of questions or scales you’re using in your survey . Simply put, you need to make sure that your survey respondents are perceiving your key constructs in the same way you are. Or, even if they’re not, that your measurement instrument is capturing the necessary information that reflects your definition of the construct at hand.

If all of this talk about constructs sounds a bit fluffy, be sure to check out Research Methodology Bootcamp , which will provide you with a rock-solid foundational understanding of all things methodology-related. Remember, you can take advantage of our 60% discount offer using this link.

Need a helping hand?

validity and reliability research instrument

What Is Reliability?

As with validity, reliability is an attribute of a measurement instrument – for example, a survey, a weight scale or even a blood pressure monitor. But while validity is concerned with whether the instrument is measuring the “thing” it’s supposed to be measuring, reliability is concerned with consistency and stability . In other words, reliability reflects the degree to which a measurement instrument produces consistent results when applied repeatedly to the same phenomenon , under the same conditions .

As you can probably imagine, a measurement instrument that achieves a high level of consistency is naturally more dependable (or reliable) than one that doesn’t – in other words, it can be trusted to provide consistent measurements . And that, of course, is what you want when undertaking empirical research. If you think about it within a more domestic context, just imagine if you found that your bathroom scale gave you a different number every time you hopped on and off of it – you wouldn’t feel too confident in its ability to measure the variable that is your body weight 🙂

It’s worth mentioning that reliability also extends to the person using the measurement instrument . For example, if two researchers use the same instrument (let’s say a measuring tape) and they get different measurements, there’s likely an issue in terms of how one (or both) of them are using the measuring tape. So, when you think about reliability, consider both the instrument and the researcher as part of the equation.

As with validity, there are various types of reliability and various tests that can be used to assess the reliability of an instrument. A popular one that you’ll likely come across for survey instruments is Cronbach’s alpha , which is a statistical measure that quantifies the degree to which items within an instrument (for example, a set of Likert scales) measure the same underlying construct . In other words, Cronbach’s alpha indicates how closely related the items are and whether they consistently capture the same concept . 

Reliability reflects whether an instrument produces consistent results when applied to the same phenomenon, under the same conditions.

Recap: Key Takeaways

Alright, let’s quickly recap to cement your understanding of validity and reliability:

  • Validity is concerned with whether an instrument (e.g., a set of Likert scales) is measuring what it’s supposed to measure
  • Reliability is concerned with whether that measurement is consistent and stable when measuring the same phenomenon under the same conditions.

In short, validity and reliability are both essential to ensuring that your data collection efforts deliver high-quality, accurate data that help you answer your research questions . So, be sure to always pay careful attention to the validity and reliability of your measurement instruments when collecting and analysing data. As the adage goes, “rubbish in, rubbish out” – make sure that your data inputs are rock-solid.

Literature Review Course

Psst… there’s more!

This post is an extract from our bestselling Udemy Course, Methodology Bootcamp . If you want to work smart, you don't want to miss this .

You Might Also Like:

Narrative analysis explainer

THE MATERIAL IS WONDERFUL AND BENEFICIAL TO ALL STUDENTS.

THE MATERIAL IS WONDERFUL AND BENEFICIAL TO ALL STUDENTS AND I HAVE GREATLY BENEFITED FROM THE CONTENT.

Submit a Comment Cancel reply

Your email address will not be published. Required fields are marked *

Save my name, email, and website in this browser for the next time I comment.

  • Print Friendly

validity and reliability research instrument

  • My Bookings
  • How to Determine the Validity and Reliability of an Instrument

How to Determine the Validity and Reliability of an Instrument By: Yue Li

Validity and reliability are two important factors to consider when developing and testing any instrument (e.g., content assessment test, questionnaire) for use in a study. Attention to these considerations helps to insure the quality of your measurement and of the data collected for your study.

Understanding and Testing Validity

Validity refers to the degree to which an instrument accurately measures what it intends to measure. Three common types of validity for researchers and evaluators to consider are content, construct, and criterion validities.

  • Content validity indicates the extent to which items adequately measure or represent the content of the property or trait that the researcher wishes to measure. Subject matter expert review is often a good first step in instrument development to assess content validity, in relation to the area or field you are studying.
  • Construct validity indicates the extent to which a measurement method accurately represents a construct (e.g., a latent variable or phenomena that can’t be measured directly, such as a person’s attitude or belief) and produces an observation, distinct from that which is produced by a measure of another construct. Common methods to assess construct validity include, but are not limited to, factor analysis, correlation tests, and item response theory models (including Rasch model).
  • Criterion-related validity indicates the extent to which the instrument’s scores correlate with an external criterion (i.e., usually another measurement from a different instrument) either at present ( concurrent validity ) or in the future ( predictive validity ). A common measurement of this type of validity is the correlation coefficient between two measures.

Often times, when developing, modifying, and interpreting the validity of a given instrument, rather than view or test each type of validity individually, researchers and evaluators test for evidence of several different forms of validity, collectively (e.g., see Samuel Messick’s work regarding validity).

Understanding and Testing Reliability

Reliability refers to the degree to which an instrument yields consistent results. Common measures of reliability include internal consistency, test-retest, and inter-rater reliabilities.

  • Internal consistency reliability looks at the consistency of the score of individual items on an instrument, with the scores of a set of items, or subscale, which typically consists of several items to measure a single construct. Cronbach’s alpha is one of the most common methods for checking internal consistency reliability. Group variability, score reliability, number of items, sample sizes, and difficulty level of the instrument also can impact the Cronbach’s alpha value.
  • Test-retest measures the correlation between scores from one administration of an instrument to another, usually within an interval of 2 to 3 weeks. Unlike pre-post tests, no treatment occurs between the first and second administrations of the instrument, in order to test-retest reliability. A similar type of reliability called alternate forms , involves using slightly different forms or versions of an instrument to see if different versions yield consistent results.
  • Inter-rater reliability checks the degree of agreement among raters (i.e., those completing items on an instrument). Common situations where more than one rater is involved may occur when more than one person conducts classroom observations, uses an observation protocol or scores an open-ended test, using a rubric or other standard protocol. Kappa statistics, correlation coefficients, and intra-class correlation (ICC) coefficient are some of the commonly reported measures of inter-rater reliability.

Developing a valid and reliable instrument usually requires multiple iterations of piloting and testing which can be resource intensive. Therefore, when available, I suggest using already established valid and reliable instruments, such as those published in peer-reviewed journal articles. However, even when using these instruments, you should re-check validity and reliability, using the methods of your study and your own participants’ data before running additional statistical analyses. This process will confirm that the instrument performs, as intended, in your study with the population you are studying, even though they are identical to the purpose and population for which the instrument was initially developed. Below are a few additional, useful readings to further inform your understanding of validity and reliability.

Resources for Understanding and Testing Reliability

  • American Educational Research Association, American Psychological Association, & National Council on Measurement in Education. (1985).  Standards for educational and psychological testing . Washington, DC: Authors.
  • Bond, T. G., & Fox, C. M. (2001). Applying the Rasch model: Fundamental measurement in the human sciences . Mahwah, NJ: Lawrence Erlbaum.
  • Cronbach, L. (1990).  Essentials of psychological testing .   New York, NY: Harper & Row.
  • Carmines, E., & Zeller, R. (1979).  Reliability and Validity Assessment . Beverly Hills, CA: Sage Publications.
  • Messick, S. (1987). Validity . ETS Research Report Series, 1987: i–208. doi: 10.1002/j.2330-8516.1987.tb00244.x
  • Liu, X. (2010). Using and developing measurement instruments in science education: A Rasch modeling approach . Charlotte, NC: Information Age.
  • Search for:

Recent Posts

  • Avoiding Data Analysis Pitfalls
  • Advice in Building and Boasting a Successful Grant Funding Track Record
  • Personal History of Miami University’s Discovery and E & A Centers
  • Center Director’s Message

Recent Comments

  • November 2016
  • September 2016
  • February 2016
  • November 2015
  • October 2015
  • Uncategorized
  • Entries feed
  • Comments feed
  • WordPress.org

Uncomplicated Reviews of Educational Research Methods

  • Instrument, Validity, Reliability

.pdf version of this page

Part I: The Instrument

Instrument is the general term that researchers use for a measurement device (survey, test, questionnaire, etc.). To help distinguish between instrument and instrumentation, consider that the instrument is the device and instrumentation is the course of action (the process of developing, testing, and using the device).

Instruments fall into two broad categories, researcher-completed and subject-completed, distinguished by those instruments that researchers administer versus those that are completed by participants. Researchers chose which type of instrument, or instruments, to use based on the research question. Examples are listed below:

Usability refers to the ease with which an instrument can be administered, interpreted by the participant, and scored/interpreted by the researcher. Example usability problems include:

  • Students are asked to rate a lesson immediately after class, but there are only a few minutes before the next class begins (problem with administration).
  • Students are asked to keep self-checklists of their after school activities, but the directions are complicated and the item descriptions confusing (problem with interpretation).
  • Teachers are asked about their attitudes regarding school policy, but some questions are worded poorly which results in low completion rates (problem with scoring/interpretation).

Validity and reliability concerns (discussed below) will help alleviate usability issues. For now, we can identify five usability considerations:

  • How long will it take to administer?
  • Are the directions clear?
  • How easy is it to score?
  • Do equivalent forms exist?
  • Have any problems been reported by others who used it?

It is best to use an existing instrument, one that has been developed and tested numerous times, such as can be found in the Mental Measurements Yearbook . We will turn to why next.

Part II: Validity

Validity is the extent to which an instrument measures what it is supposed to measure and performs as it is designed to perform. It is rare, if nearly impossible, that an instrument be 100% valid, so validity is generally measured in degrees. As a process, validation involves collecting and analyzing data to assess the accuracy of an instrument. There are numerous statistical tests and measures to assess the validity of quantitative instruments, which generally involves pilot testing. The remainder of this discussion focuses on external validity and content validity.

External validity is the extent to which the results of a study can be generalized from a sample to a population. Establishing eternal validity for an instrument, then, follows directly from sampling. Recall that a sample should be an accurate representation of a population, because the total population may not be available. An instrument that is externally valid helps obtain population generalizability, or the degree to which a sample represents the population.

Content validity refers to the appropriateness of the content of an instrument. In other words, do the measures (questions, observation logs, etc.) accurately assess what you want to know? This is particularly important with achievement tests. Consider that a test developer wants to maximize the validity of a unit test for 7th grade mathematics. This would involve taking representative questions from each of the sections of the unit and evaluating them against the desired outcomes.

Part III: Reliability

Reliability can be thought of as consistency. Does the instrument consistently measure what it is intended to measure? It is not possible to calculate reliability; however, there are four general estimators that you may encounter in reading research:

  • Inter-Rater/Observer Reliability : The degree to which different raters/observers give consistent answers or estimates.
  • Test-Retest Reliability : The consistency of a measure evaluated over time.
  • Parallel-Forms Reliability: The reliability of two tests constructed the same way, from the same content.
  • Internal Consistency Reliability: The consistency of results across items, often measured with Cronbach’s Alpha.

Relating Reliability and Validity

Reliability is directly related to the validity of the measure. There are several important principles. First, a test can be considered reliable, but not valid. Consider the SAT, used as a predictor of success in college. It is a reliable test (high scores relate to high GPA), though only a moderately valid indicator of success (due to the lack of structured environment – class attendance, parent-regulated study, and sleeping habits – each holistically related to success).

Second, validity is more important than reliability. Using the above example, college admissions may consider the SAT a reliable test, but not necessarily a valid measure of other quantities colleges seek, such as leadership capability, altruism, and civic involvement. The combination of these aspects, alongside the SAT, is a more valid measure of the applicant’s potential for graduation, later social involvement, and generosity (alumni giving) toward the alma mater.

Finally, the most useful instrument is both valid and reliable. Proponents of the SAT argue that it is both. It is a moderately reliable predictor of future success and a moderately valid measure of a student’s knowledge in Mathematics, Critical Reading, and Writing.

Part IV: Validity and Reliability in Qualitative Research

Thus far, we have discussed Instrumentation as related to mostly quantitative measurement. Establishing validity and reliability in qualitative research can be less precise, though participant/member checks, peer evaluation (another researcher checks the researcher’s inferences based on the instrument ( Denzin & Lincoln, 2005 ), and multiple methods (keyword: triangulation ), are convincingly used. Some qualitative researchers reject the concept of validity due to the constructivist viewpoint that reality is unique to the individual, and cannot be generalized. These researchers argue for a different standard for judging research quality. For a more complete discussion of trustworthiness, see Lincoln and Guba’s (1985) chapter .

Share this:

  • How To Assess Research Validity | Windranger5
  • How unreliable are the judges on Strictly Come Dancing? | Delight Through Logical Misery

Comments are closed.

About Research Rundowns

Research Rundowns was made possible by support from the Dewar College of Education at Valdosta State University .

  • Experimental Design
  • What is Educational Research?
  • Writing Research Questions
  • Mixed Methods Research Designs
  • Qualitative Coding & Analysis
  • Qualitative Research Design
  • Correlation
  • Effect Size
  • Mean & Standard Deviation
  • Significance Testing (t-tests)
  • Steps 1-4: Finding Research
  • Steps 5-6: Analyzing & Organizing
  • Steps 7-9: Citing & Writing
  • Writing a Research Report

Blog at WordPress.com.

' src=

  • Already have a WordPress.com account? Log in now.
  • Subscribe Subscribed
  • Copy shortlink
  • Report this content
  • View post in Reader
  • Manage subscriptions
  • Collapse this bar
  • How it works

Reliability and Validity – Definitions, Types & Examples

Published by Alvin Nicolas at August 16th, 2021 , Revised On October 26, 2023

A researcher must test the collected data before making any conclusion. Every  research design  needs to be concerned with reliability and validity to measure the quality of the research.

What is Reliability?

Reliability refers to the consistency of the measurement. Reliability shows how trustworthy is the score of the test. If the collected data shows the same results after being tested using various methods and sample groups, the information is reliable. If your method has reliability, the results will be valid.

Example: If you weigh yourself on a weighing scale throughout the day, you’ll get the same results. These are considered reliable results obtained through repeated measures.

Example: If a teacher conducts the same math test of students and repeats it next week with the same questions. If she gets the same score, then the reliability of the test is high.

What is the Validity?

Validity refers to the accuracy of the measurement. Validity shows how a specific test is suitable for a particular situation. If the results are accurate according to the researcher’s situation, explanation, and prediction, then the research is valid. 

If the method of measuring is accurate, then it’ll produce accurate results. If a method is reliable, then it’s valid. In contrast, if a method is not reliable, it’s not valid. 

Example:  Your weighing scale shows different results each time you weigh yourself within a day even after handling it carefully, and weighing before and after meals. Your weighing machine might be malfunctioning. It means your method had low reliability. Hence you are getting inaccurate or inconsistent results that are not valid.

Example:  Suppose a questionnaire is distributed among a group of people to check the quality of a skincare product and repeated the same questionnaire with many groups. If you get the same response from various participants, it means the validity of the questionnaire and product is high as it has high reliability.

Most of the time, validity is difficult to measure even though the process of measurement is reliable. It isn’t easy to interpret the real situation.

Example:  If the weighing scale shows the same result, let’s say 70 kg each time, even if your actual weight is 55 kg, then it means the weighing scale is malfunctioning. However, it was showing consistent results, but it cannot be considered as reliable. It means the method has low reliability.

Internal Vs. External Validity

One of the key features of randomised designs is that they have significantly high internal and external validity.

Internal validity  is the ability to draw a causal link between your treatment and the dependent variable of interest. It means the observed changes should be due to the experiment conducted, and any external factor should not influence the  variables .

Example: age, level, height, and grade.

External validity  is the ability to identify and generalise your study outcomes to the population at large. The relationship between the study’s situation and the situations outside the study is considered external validity.

Also, read about Inductive vs Deductive reasoning in this article.

Looking for reliable dissertation support?

We hear you.

  • Whether you want a full dissertation written or need help forming a dissertation proposal, we can help you with both.
  • Get different dissertation services at ResearchProspect and score amazing grades!

Threats to Interval Validity

Threats of external validity, how to assess reliability and validity.

Reliability can be measured by comparing the consistency of the procedure and its results. There are various methods to measure validity and reliability. Reliability can be measured through  various statistical methods  depending on the types of validity, as explained below:

Types of Reliability

Types of validity.

As we discussed above, the reliability of the measurement alone cannot determine its validity. Validity is difficult to be measured even if the method is reliable. The following type of tests is conducted for measuring validity. 

Does your Research Methodology Have the Following?

  • Great Research/Sources
  • Perfect Language
  • Accurate Sources

If not, we can help. Our panel of experts makes sure to keep the 3 pillars of Research Methodology strong.

Does your Research Methodology Have the Following?

How to Increase Reliability?

  • Use an appropriate questionnaire to measure the competency level.
  • Ensure a consistent environment for participants
  • Make the participants familiar with the criteria of assessment.
  • Train the participants appropriately.
  • Analyse the research items regularly to avoid poor performance.

How to Increase Validity?

Ensuring Validity is also not an easy job. A proper functioning method to ensure validity is given below:

  • The reactivity should be minimised at the first concern.
  • The Hawthorne effect should be reduced.
  • The respondents should be motivated.
  • The intervals between the pre-test and post-test should not be lengthy.
  • Dropout rates should be avoided.
  • The inter-rater reliability should be ensured.
  • Control and experimental groups should be matched with each other.

How to Implement Reliability and Validity in your Thesis?

According to the experts, it is helpful if to implement the concept of reliability and Validity. Especially, in the thesis and the dissertation, these concepts are adopted much. The method for implementation given below:

Frequently Asked Questions

What is reliability and validity in research.

Reliability in research refers to the consistency and stability of measurements or findings. Validity relates to the accuracy and truthfulness of results, measuring what the study intends to. Both are crucial for trustworthy and credible research outcomes.

What is validity?

Validity in research refers to the extent to which a study accurately measures what it intends to measure. It ensures that the results are truly representative of the phenomena under investigation. Without validity, research findings may be irrelevant, misleading, or incorrect, limiting their applicability and credibility.

What is reliability?

Reliability in research refers to the consistency and stability of measurements over time. If a study is reliable, repeating the experiment or test under the same conditions should produce similar results. Without reliability, findings become unpredictable and lack dependability, potentially undermining the study’s credibility and generalisability.

What is reliability in psychology?

In psychology, reliability refers to the consistency of a measurement tool or test. A reliable psychological assessment produces stable and consistent results across different times, situations, or raters. It ensures that an instrument’s scores are not due to random error, making the findings dependable and reproducible in similar conditions.

What is test retest reliability?

Test-retest reliability assesses the consistency of measurements taken by a test over time. It involves administering the same test to the same participants at two different points in time and comparing the results. A high correlation between the scores indicates that the test produces stable and consistent results over time.

How to improve reliability of an experiment?

  • Standardise procedures and instructions.
  • Use consistent and precise measurement tools.
  • Train observers or raters to reduce subjective judgments.
  • Increase sample size to reduce random errors.
  • Conduct pilot studies to refine methods.
  • Repeat measurements or use multiple methods.
  • Address potential sources of variability.

What is the difference between reliability and validity?

Reliability refers to the consistency and repeatability of measurements, ensuring results are stable over time. Validity indicates how well an instrument measures what it’s intended to measure, ensuring accuracy and relevance. While a test can be reliable without being valid, a valid test must inherently be reliable. Both are essential for credible research.

Are interviews reliable and valid?

Interviews can be both reliable and valid, but they are susceptible to biases. The reliability and validity depend on the design, structure, and execution of the interview. Structured interviews with standardised questions improve reliability. Validity is enhanced when questions accurately capture the intended construct and when interviewer biases are minimised.

Are IQ tests valid and reliable?

IQ tests are generally considered reliable, producing consistent scores over time. Their validity, however, is a subject of debate. While they effectively measure certain cognitive skills, whether they capture the entirety of “intelligence” or predict success in all life areas is contested. Cultural bias and over-reliance on tests are also concerns.

Are questionnaires reliable and valid?

Questionnaires can be both reliable and valid if well-designed. Reliability is achieved when they produce consistent results over time or across similar populations. Validity is ensured when questions accurately measure the intended construct. However, factors like poorly phrased questions, respondent bias, and lack of standardisation can compromise their reliability and validity.

You May Also Like

A hypothesis is a research question that has to be proved correct or incorrect through hypothesis testing – a scientific approach to test a hypothesis.

A case study is a detailed analysis of a situation concerning organizations, industries, and markets. The case study generally aims at identifying the weak areas.

Discourse analysis is an essential aspect of studying a language. It is used in various disciplines of social science and humanities such as linguistic, sociolinguistics, and psycholinguistic.

USEFUL LINKS

LEARNING RESOURCES

secure connection

COMPANY DETAILS

Research-Prospect-Writing-Service

  • How It Works
  • Privacy Policy

Buy Me a Coffee

Research Method

Home » Reliability Vs Validity

Reliability Vs Validity

Table of Contents

Reliability Vs Validity

Reliability and validity are two important concepts in research that are used to evaluate the quality of measurement instruments or research studies.

Reliability

Reliability refers to the degree to which a measurement instrument or research study produces consistent and stable results over time, across different observers or raters, or under different conditions.

In other words, reliability is the extent to which a measurement instrument or research study produces results that are free from random error. A reliable measurement instrument or research study should produce similar results each time it is used or conducted, regardless of who is using it or conducting it.

Validity, on the other hand, refers to the degree to which a measurement instrument or research study accurately measures what it is supposed to measure or tests what it is supposed to test.

In other words, validity is the extent to which a measurement instrument or research study measures or tests what it claims to measure or test. A valid measurement instrument or research study should produce results that accurately reflect the concept or construct being measured or tested.

Difference Between Reliability Vs Validity

Here’s a comparison table that highlights the differences between reliability and validity:

Also see Research Methods

About the author

' src=

Muhammad Hassan

Researcher, Academic Writer, Web developer

You may also like

Validity

Validity – Types, Examples and Guide

Inductive Vs Deductive Research

Inductive Vs Deductive Research

Alternate Forms Reliability

Alternate Forms Reliability – Methods, Examples...

Exploratory Vs Explanatory Research

Exploratory Vs Explanatory Research

Construct Validity

Construct Validity – Types, Threats and Examples

Internal Validity

Internal Validity – Threats, Examples and Guide

U.S. flag

An official website of the United States government

The .gov means it’s official. Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

The site is secure. The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

  • Publications
  • Account settings

Preview improvements coming to the PMC website in October 2024. Learn More or Try it out now .

  • Advanced Search
  • Journal List
  • J Family Med Prim Care
  • v.4(3); Jul-Sep 2015

Validity, reliability, and generalizability in qualitative research

Lawrence leung.

1 Department of Family Medicine, Queen's University, Kingston, Ontario, Canada

2 Centre of Studies in Primary Care, Queen's University, Kingston, Ontario, Canada

In general practice, qualitative research contributes as significantly as quantitative research, in particular regarding psycho-social aspects of patient-care, health services provision, policy setting, and health administrations. In contrast to quantitative research, qualitative research as a whole has been constantly critiqued, if not disparaged, by the lack of consensus for assessing its quality and robustness. This article illustrates with five published studies how qualitative research can impact and reshape the discipline of primary care, spiraling out from clinic-based health screening to community-based disease monitoring, evaluation of out-of-hours triage services to provincial psychiatric care pathways model and finally, national legislation of core measures for children's healthcare insurance. Fundamental concepts of validity, reliability, and generalizability as applicable to qualitative research are then addressed with an update on the current views and controversies.

Nature of Qualitative Research versus Quantitative Research

The essence of qualitative research is to make sense of and recognize patterns among words in order to build up a meaningful picture without compromising its richness and dimensionality. Like quantitative research, the qualitative research aims to seek answers for questions of “how, where, when who and why” with a perspective to build a theory or refute an existing theory. Unlike quantitative research which deals primarily with numerical data and their statistical interpretations under a reductionist, logical and strictly objective paradigm, qualitative research handles nonnumerical information and their phenomenological interpretation, which inextricably tie in with human senses and subjectivity. While human emotions and perspectives from both subjects and researchers are considered undesirable biases confounding results in quantitative research, the same elements are considered essential and inevitable, if not treasurable, in qualitative research as they invariable add extra dimensions and colors to enrich the corpus of findings. However, the issue of subjectivity and contextual ramifications has fueled incessant controversies regarding yardsticks for quality and trustworthiness of qualitative research results for healthcare.

Impact of Qualitative Research upon Primary Care

In many ways, qualitative research contributes significantly, if not more so than quantitative research, to the field of primary care at various levels. Five qualitative studies are chosen to illustrate how various methodologies of qualitative research helped in advancing primary healthcare, from novel monitoring of chronic obstructive pulmonary disease (COPD) via mobile-health technology,[ 1 ] informed decision for colorectal cancer screening,[ 2 ] triaging out-of-hours GP services,[ 3 ] evaluating care pathways for community psychiatry[ 4 ] and finally prioritization of healthcare initiatives for legislation purposes at national levels.[ 5 ] With the recent advances of information technology and mobile connecting device, self-monitoring and management of chronic diseases via tele-health technology may seem beneficial to both the patient and healthcare provider. Recruiting COPD patients who were given tele-health devices that monitored lung functions, Williams et al. [ 1 ] conducted phone interviews and analyzed their transcripts via a grounded theory approach, identified themes which enabled them to conclude that such mobile-health setup and application helped to engage patients with better adherence to treatment and overall improvement in mood. Such positive findings were in contrast to previous studies, which opined that elderly patients were often challenged by operating computer tablets,[ 6 ] or, conversing with the tele-health software.[ 7 ] To explore the content of recommendations for colorectal cancer screening given out by family physicians, Wackerbarth, et al. [ 2 ] conducted semi-structure interviews with subsequent content analysis and found that most physicians delivered information to enrich patient knowledge with little regard to patients’ true understanding, ideas, and preferences in the matter. These findings suggested room for improvement for family physicians to better engage their patients in recommending preventative care. Faced with various models of out-of-hours triage services for GP consultations, Egbunike et al. [ 3 ] conducted thematic analysis on semi-structured telephone interviews with patients and doctors in various urban, rural and mixed settings. They found that the efficiency of triage services remained a prime concern from both users and providers, among issues of access to doctors and unfulfilled/mismatched expectations from users, which could arouse dissatisfaction and legal implications. In UK, a care pathways model for community psychiatry had been introduced but its benefits were unclear. Khandaker et al. [ 4 ] hence conducted a qualitative study using semi-structure interviews with medical staff and other stakeholders; adopting a grounded-theory approach, major themes emerged which included improved equality of access, more focused logistics, increased work throughput and better accountability for community psychiatry provided under the care pathway model. Finally, at the US national level, Mangione-Smith et al. [ 5 ] employed a modified Delphi method to gather consensus from a panel of nominators which were recognized experts and stakeholders in their disciplines, and identified a core set of quality measures for children's healthcare under the Medicaid and Children's Health Insurance Program. These core measures were made transparent for public opinion and later passed on for full legislation, hence illustrating the impact of qualitative research upon social welfare and policy improvement.

Overall Criteria for Quality in Qualitative Research

Given the diverse genera and forms of qualitative research, there is no consensus for assessing any piece of qualitative research work. Various approaches have been suggested, the two leading schools of thoughts being the school of Dixon-Woods et al. [ 8 ] which emphasizes on methodology, and that of Lincoln et al. [ 9 ] which stresses the rigor of interpretation of results. By identifying commonalities of qualitative research, Dixon-Woods produced a checklist of questions for assessing clarity and appropriateness of the research question; the description and appropriateness for sampling, data collection and data analysis; levels of support and evidence for claims; coherence between data, interpretation and conclusions, and finally level of contribution of the paper. These criteria foster the 10 questions for the Critical Appraisal Skills Program checklist for qualitative studies.[ 10 ] However, these methodology-weighted criteria may not do justice to qualitative studies that differ in epistemological and philosophical paradigms,[ 11 , 12 ] one classic example will be positivistic versus interpretivistic.[ 13 ] Equally, without a robust methodological layout, rigorous interpretation of results advocated by Lincoln et al. [ 9 ] will not be good either. Meyrick[ 14 ] argued from a different angle and proposed fulfillment of the dual core criteria of “transparency” and “systematicity” for good quality qualitative research. In brief, every step of the research logistics (from theory formation, design of study, sampling, data acquisition and analysis to results and conclusions) has to be validated if it is transparent or systematic enough. In this manner, both the research process and results can be assured of high rigor and robustness.[ 14 ] Finally, Kitto et al. [ 15 ] epitomized six criteria for assessing overall quality of qualitative research: (i) Clarification and justification, (ii) procedural rigor, (iii) sample representativeness, (iv) interpretative rigor, (v) reflexive and evaluative rigor and (vi) transferability/generalizability, which also double as evaluative landmarks for manuscript review to the Medical Journal of Australia. Same for quantitative research, quality for qualitative research can be assessed in terms of validity, reliability, and generalizability.

Validity in qualitative research means “appropriateness” of the tools, processes, and data. Whether the research question is valid for the desired outcome, the choice of methodology is appropriate for answering the research question, the design is valid for the methodology, the sampling and data analysis is appropriate, and finally the results and conclusions are valid for the sample and context. In assessing validity of qualitative research, the challenge can start from the ontology and epistemology of the issue being studied, e.g. the concept of “individual” is seen differently between humanistic and positive psychologists due to differing philosophical perspectives:[ 16 ] Where humanistic psychologists believe “individual” is a product of existential awareness and social interaction, positive psychologists think the “individual” exists side-by-side with formation of any human being. Set off in different pathways, qualitative research regarding the individual's wellbeing will be concluded with varying validity. Choice of methodology must enable detection of findings/phenomena in the appropriate context for it to be valid, with due regard to culturally and contextually variable. For sampling, procedures and methods must be appropriate for the research paradigm and be distinctive between systematic,[ 17 ] purposeful[ 18 ] or theoretical (adaptive) sampling[ 19 , 20 ] where the systematic sampling has no a priori theory, purposeful sampling often has a certain aim or framework and theoretical sampling is molded by the ongoing process of data collection and theory in evolution. For data extraction and analysis, several methods were adopted to enhance validity, including 1 st tier triangulation (of researchers) and 2 nd tier triangulation (of resources and theories),[ 17 , 21 ] well-documented audit trail of materials and processes,[ 22 , 23 , 24 ] multidimensional analysis as concept- or case-orientated[ 25 , 26 ] and respondent verification.[ 21 , 27 ]

Reliability

In quantitative research, reliability refers to exact replicability of the processes and the results. In qualitative research with diverse paradigms, such definition of reliability is challenging and epistemologically counter-intuitive. Hence, the essence of reliability for qualitative research lies with consistency.[ 24 , 28 ] A margin of variability for results is tolerated in qualitative research provided the methodology and epistemological logistics consistently yield data that are ontologically similar but may differ in richness and ambience within similar dimensions. Silverman[ 29 ] proposed five approaches in enhancing the reliability of process and results: Refutational analysis, constant data comparison, comprehensive data use, inclusive of the deviant case and use of tables. As data were extracted from the original sources, researchers must verify their accuracy in terms of form and context with constant comparison,[ 27 ] either alone or with peers (a form of triangulation).[ 30 ] The scope and analysis of data included should be as comprehensive and inclusive with reference to quantitative aspects if possible.[ 30 ] Adopting the Popperian dictum of falsifiability as essence of truth and science, attempted to refute the qualitative data and analytes should be performed to assess reliability.[ 31 ]

Generalizability

Most qualitative research studies, if not all, are meant to study a specific issue or phenomenon in a certain population or ethnic group, of a focused locality in a particular context, hence generalizability of qualitative research findings is usually not an expected attribute. However, with rising trend of knowledge synthesis from qualitative research via meta-synthesis, meta-narrative or meta-ethnography, evaluation of generalizability becomes pertinent. A pragmatic approach to assessing generalizability for qualitative studies is to adopt same criteria for validity: That is, use of systematic sampling, triangulation and constant comparison, proper audit and documentation, and multi-dimensional theory.[ 17 ] However, some researchers espouse the approach of analytical generalization[ 32 ] where one judges the extent to which the findings in one study can be generalized to another under similar theoretical, and the proximal similarity model, where generalizability of one study to another is judged by similarities between the time, place, people and other social contexts.[ 33 ] Thus said, Zimmer[ 34 ] questioned the suitability of meta-synthesis in view of the basic tenets of grounded theory,[ 35 ] phenomenology[ 36 ] and ethnography.[ 37 ] He concluded that any valid meta-synthesis must retain the other two goals of theory development and higher-level abstraction while in search of generalizability, and must be executed as a third level interpretation using Gadamer's concepts of the hermeneutic circle,[ 38 , 39 ] dialogic process[ 38 ] and fusion of horizons.[ 39 ] Finally, Toye et al. [ 40 ] reported the practicality of using “conceptual clarity” and “interpretative rigor” as intuitive criteria for assessing quality in meta-ethnography, which somehow echoed Rolfe's controversial aesthetic theory of research reports.[ 41 ]

Food for Thought

Despite various measures to enhance or ensure quality of qualitative studies, some researchers opined from a purist ontological and epistemological angle that qualitative research is not a unified, but ipso facto diverse field,[ 8 ] hence any attempt to synthesize or appraise different studies under one system is impossible and conceptually wrong. Barbour argued from a philosophical angle that these special measures or “technical fixes” (like purposive sampling, multiple-coding, triangulation, and respondent validation) can never confer the rigor as conceived.[ 11 ] In extremis, Rolfe et al. opined from the field of nursing research, that any set of formal criteria used to judge the quality of qualitative research are futile and without validity, and suggested that any qualitative report should be judged by the form it is written (aesthetic) and not by the contents (epistemic).[ 41 ] Rolfe's novel view is rebutted by Porter,[ 42 ] who argued via logical premises that two of Rolfe's fundamental statements were flawed: (i) “The content of research report is determined by their forms” may not be a fact, and (ii) that research appraisal being “subject to individual judgment based on insight and experience” will mean those without sufficient experience of performing research will be unable to judge adequately – hence an elitist's principle. From a realism standpoint, Porter then proposes multiple and open approaches for validity in qualitative research that incorporate parallel perspectives[ 43 , 44 ] and diversification of meanings.[ 44 ] Any work of qualitative research, when read by the readers, is always a two-way interactive process, such that validity and quality has to be judged by the receiving end too and not by the researcher end alone.

In summary, the three gold criteria of validity, reliability and generalizability apply in principle to assess quality for both quantitative and qualitative research, what differs will be the nature and type of processes that ontologically and epistemologically distinguish between the two.

Source of Support: Nil.

Conflict of Interest: None declared.

Log in using your username and password

  • Search More Search for this keyword Advanced search
  • Latest content
  • Current issue
  • Write for Us
  • BMJ Journals More You are viewing from: Google Indexer

You are here

  • Volume 18, Issue 3
  • Validity and reliability in quantitative studies
  • Article Text
  • Article info
  • Citation Tools
  • Rapid Responses
  • Article metrics

Download PDF

  • Roberta Heale 1 ,
  • Alison Twycross 2
  • 1 School of Nursing, Laurentian University , Sudbury, Ontario , Canada
  • 2 Faculty of Health and Social Care , London South Bank University , London , UK
  • Correspondence to : Dr Roberta Heale, School of Nursing, Laurentian University, Ramsey Lake Road, Sudbury, Ontario, Canada P3E2C6; rheale{at}laurentian.ca

https://doi.org/10.1136/eb-2015-102129

Statistics from Altmetric.com

Request permissions.

If you wish to reuse any or all of this article please use the link below which will take you to the Copyright Clearance Center’s RightsLink service. You will be able to get a quick price and instant permission to reuse the content in many different ways.

Evidence-based practice includes, in part, implementation of the findings of well-conducted quality research studies. So being able to critique quantitative research is an important skill for nurses. Consideration must be given not only to the results of the study but also the rigour of the research. Rigour refers to the extent to which the researchers worked to enhance the quality of the studies. In quantitative research, this is achieved through measurement of the validity and reliability. 1

  • View inline

Types of validity

The first category is content validity . This category looks at whether the instrument adequately covers all the content that it should with respect to the variable. In other words, does the instrument cover the entire domain related to the variable, or construct it was designed to measure? In an undergraduate nursing course with instruction about public health, an examination with content validity would cover all the content in the course with greater emphasis on the topics that had received greater coverage or more depth. A subset of content validity is face validity , where experts are asked their opinion about whether an instrument measures the concept intended.

Construct validity refers to whether you can draw inferences about test scores related to the concept being studied. For example, if a person has a high score on a survey that measures anxiety, does this person truly have a high degree of anxiety? In another example, a test of knowledge of medications that requires dosage calculations may instead be testing maths knowledge.

There are three types of evidence that can be used to demonstrate a research instrument has construct validity:

Homogeneity—meaning that the instrument measures one construct.

Convergence—this occurs when the instrument measures concepts similar to that of other instruments. Although if there are no similar instruments available this will not be possible to do.

Theory evidence—this is evident when behaviour is similar to theoretical propositions of the construct measured in the instrument. For example, when an instrument measures anxiety, one would expect to see that participants who score high on the instrument for anxiety also demonstrate symptoms of anxiety in their day-to-day lives. 2

The final measure of validity is criterion validity . A criterion is any other instrument that measures the same variable. Correlations can be conducted to determine the extent to which the different instruments measure the same variable. Criterion validity is measured in three ways:

Convergent validity—shows that an instrument is highly correlated with instruments measuring similar variables.

Divergent validity—shows that an instrument is poorly correlated to instruments that measure different variables. In this case, for example, there should be a low correlation between an instrument that measures motivation and one that measures self-efficacy.

Predictive validity—means that the instrument should have high correlations with future criterions. 2 For example, a score of high self-efficacy related to performing a task should predict the likelihood a participant completing the task.

Reliability

Reliability relates to the consistency of a measure. A participant completing an instrument meant to measure motivation should have approximately the same responses each time the test is completed. Although it is not possible to give an exact calculation of reliability, an estimate of reliability can be achieved through different measures. The three attributes of reliability are outlined in table 2 . How each attribute is tested for is described below.

Attributes of reliability

Homogeneity (internal consistency) is assessed using item-to-total correlation, split-half reliability, Kuder-Richardson coefficient and Cronbach's α. In split-half reliability, the results of a test, or instrument, are divided in half. Correlations are calculated comparing both halves. Strong correlations indicate high reliability, while weak correlations indicate the instrument may not be reliable. The Kuder-Richardson test is a more complicated version of the split-half test. In this process the average of all possible split half combinations is determined and a correlation between 0–1 is generated. This test is more accurate than the split-half test, but can only be completed on questions with two answers (eg, yes or no, 0 or 1). 3

Cronbach's α is the most commonly used test to determine the internal consistency of an instrument. In this test, the average of all correlations in every combination of split-halves is determined. Instruments with questions that have more than two responses can be used in this test. The Cronbach's α result is a number between 0 and 1. An acceptable reliability score is one that is 0.7 and higher. 1 , 3

Stability is tested using test–retest and parallel or alternate-form reliability testing. Test–retest reliability is assessed when an instrument is given to the same participants more than once under similar circumstances. A statistical comparison is made between participant's test scores for each of the times they have completed it. This provides an indication of the reliability of the instrument. Parallel-form reliability (or alternate-form reliability) is similar to test–retest reliability except that a different form of the original instrument is given to participants in subsequent tests. The domain, or concepts being tested are the same in both versions of the instrument but the wording of items is different. 2 For an instrument to demonstrate stability there should be a high correlation between the scores each time a participant completes the test. Generally speaking, a correlation coefficient of less than 0.3 signifies a weak correlation, 0.3–0.5 is moderate and greater than 0.5 is strong. 4

Equivalence is assessed through inter-rater reliability. This test includes a process for qualitatively determining the level of agreement between two or more observers. A good example of the process used in assessing inter-rater reliability is the scores of judges for a skating competition. The level of consistency across all judges in the scores given to skating participants is the measure of inter-rater reliability. An example in research is when researchers are asked to give a score for the relevancy of each item on an instrument. Consistency in their scores relates to the level of inter-rater reliability of the instrument.

Determining how rigorously the issues of reliability and validity have been addressed in a study is an essential component in the critique of research as well as influencing the decision about whether to implement of the study findings into nursing practice. In quantitative studies, rigour is determined through an evaluation of the validity and reliability of the tools or instruments utilised in the study. A good quality research study will provide evidence of how all these factors have been addressed. This will help you to assess the validity and reliability of the research and help you decide whether or not you should apply the findings in your area of clinical practice.

  • Lobiondo-Wood G ,
  • Shuttleworth M
  • ↵ Laerd Statistics . Determining the correlation coefficient . 2013 . https://statistics.laerd.com/premium/pc/pearson-correlation-in-spss-8.php

Twitter Follow Roberta Heale at @robertaheale and Alison Twycross at @alitwy

Competing interests None declared.

Read the full text or download the PDF:

Validity and reliability of measurement instruments used in research

Affiliation.

  • 1 Department of Pharmaceutical Outcomes and Policy, College of Pharmacy, University of Florida, Gainesville, FL 32610, USA. [email protected]
  • PMID: 19020196
  • DOI: 10.2146/ajhp070364

Purpose: Issues related to the validity and reliability of measurement instruments used in research are reviewed.

Summary: Key indicators of the quality of a measuring instrument are the reliability and validity of the measures. The process of developing and validating an instrument is in large part focused on reducing error in the measurement process. Reliability estimates evaluate the stability of measures, internal consistency of measurement instruments, and interrater reliability of instrument scores. Validity is the extent to which the interpretations of the results of a test are warranted, which depends on the particular use the test is intended to serve. The responsiveness of the measure to change is of interest in many of the applications in health care where improvement in outcomes as a result of treatment is a primary goal of research. Several issues may affect the accuracy of data collected, such as those related to self-report and secondary data sources. Self-report of patients or subjects is required for many of the measurements conducted in health care, but self-reports of behavior are particularly subject to problems with social desirability biases. Data that were originally gathered for a different purpose are often used to answer a research question, which can affect the applicability to the study at hand.

Conclusion: In health care and social science research, many of the variables of interest and outcomes that are important are abstract concepts known as theoretical constructs. Using tests or instruments that are valid and reliable to measure such constructs is a crucial component of research quality.

Publication types

  • Biomedical Research*
  • Clinical Laboratory Techniques / standards
  • Data Collection / standards*
  • Delivery of Health Care / standards*
  • Drug-Related Side Effects and Adverse Reactions
  • Health Status
  • Medical Records / standards
  • Patient Compliance
  • Patient Satisfaction
  • Quality of Life
  • Reproducibility of Results
  • Surveys and Questionnaires / standards
  • Treatment Outcome

IACE logo

Measuring Reliability and Validity of Evaluation Instruments

How do you know if your evaluation instrument is “good”? Or if the instrument you find on CSEdResearch.org is a decent one to use in your study?

Evaluation instruments (like surveys, questionnaires, and interview protocols) can go through their own evaluation to assess whether or not they have evidence of reliability or validity. In the Filters section on the Evaluation Instruments page, you can find a category called Assessed where you can include instruments in your search that have been previously shown to have evidence of reliability and validity. So, what do these measures mean? And, what is the difference between them?

Evaluation instruments are often designed to measure the impact of outreach activities, curriculum, and other interventions in computing education. But how do you know if these evaluation instruments actually measure what they say they are measuring? We gain confidence in these instruments by assessing evidence of their reliability and validity.

Instruments with evidence of reliability yield the same results each time they are administered. Let’s say that you created an evaluation instrument in computing education research, and you gave it to the same group of high school students four times at (nearly) the same time. If the instrument was reliable, you would expect that the results of these tests to be the same, statistically speaking.

Instruments with evidence of validity are those that have been checked in one or more ways to determine whether or not the instrument measures what it is supposed to measure. So, if your instrument is designed to measure whether or not parental support of high school students taking computer science courses is positively correlated with their grades in these courses, then statistical tests and other steps can be taken to ensure that the instrument does exactly that.

Those are still very broad definitions. Let’s break it down some more. But before we do, there is one very important caveat.

Evidence of reliability and/or validity are assessed for a specified particular demographic in a particular setting. Using an instrument that has evidence for reliability and/or validity does not mean that the evidence applies to your usage of the instrument. It can provide, however, a greater measure of confidence than an instrument that has no evidence of validity or reliability. And, if you are able to find an instrument that has evidence of validity with a population similar to your own (e.g. Hispanic students in an urban middle school), this can provide even greater confidence.

Now, let’s take a look at what each of these terms mean and how they can be measured.

Select here to go to next page to learn about Reliability.

validity and reliability research instrument

Privacy Overview

Encyclopedia

  • Scholarly Community Encyclopedia
  • Log in/Sign up

validity and reliability research instrument

Video Upload Options

  • MDPI and ACS Style
  • Chicago Style

Questionnaire is one of the most widely used tools to collect data in especially social science research. The main objective of questionnaire in research is to obtain relevant information in most reliable and valid manner. Thus the accuracy and consistency of survey/questionnaire forms a significant aspect of research methodology which are known as validity and reliability. Often new researchers are confused with selection and conducting of proper validity type to test their research instrument (questionnaire/survey). 

1. Introduction

Validity explains how well the collected data covers the actual area of investigation [ 1 ] . Validity basically means “measure what is intended to be measured” [ 2 ] .

2. Face Validity

Face validity is a subjective judgment on the operationalization of a construct. Face validity is the degree to which a measure appears to be related to a specific construct, in the judgment of non-experts such as test takers and representatives of the legal system. That is, a test has face validity if its content simply looks relevant to the person taking the test. It evaluates the appearance of the questionnaire in terms of feasibility, readability, consistency of style and formatting, and the clarity of the language used.

In other words, face validity refers to researchers’ subjective assessments of the presentation and relevance of the measuring instrument as to whether the items in the instrument appear to be relevant, reasonable, unambiguous and clear [ 3 ] .

In order to examine the face validity, the dichotomous scale can be used with categorical option of “Yes” and “No” which indicate a favourable and unfavourable item respectively. Where favourable item means that the item is objectively structured and can be positively classified under the thematic category. Then the collected data is analysed using Cohen’s Kappa Index (CKI) in determining the face validity of the instrument. DM. et al. [ 4 ] recommended a minimally acceptable Kappa of 0.60 for inter-rater agreement. Unfortunately, face validity is arguably the weakest form of validity and many would suggest that it is not a form of validity in the strictest sense of the word. 

3. Content Validity

Content validity is defined as “the degree to which items in an instrument reflect the content universe to which the instrument will be generalized” (Straub, Boudreau et al. [ 5 ] ). In the field of IS, it is highly recommended to apply content validity while the new instrument is developed. In general, content validity involves evaluation of a new survey instrument in order to ensure that it includes all the items that are essential and eliminates undesirable items to a particular construct domain [ 6 ] ). The judgemental approach to establish content validity involves literature reviews and then follow-ups with the evaluation by expert judges or panels. The procedure of judgemental approach of content validity requires researchers to be present with experts in order to facilitate validation. However it is not always possible to have many experts of a particular research topic at one location. It poses a limitation to conduct validity on a survey instrument when experts are located in different geographical areas  (Choudrie and Dwivedi [ 7 ] ). Contrastingly, a quantitative approach may allow researchers to send content validity questionnaires to experts working at different locations, whereby distance is not a limitation. In order to apply content validity following steps are followed:

1.An exhaustive literature reviews to extract the related items.

2.A content validity survey is generated (each item is assessed using three point scale (not necessary, useful but not essential and essential).

3.The survey should sent to the experts in the same field of the research.

4.The content validity ratio (CVR) is then calculated for each item by employing Lawshe [ 8 ] (1975) ‘s method.

5.Items that are not significant at the critical level are eliminated. In following the critical level of Lawshe method is explained.

4. Construct Validity

If a relationship is causal, what are the particular cause and effect behaviours or constructs involved in the relationship? Construct validity refers to how well you translated or transformed a concept, idea, or behaviour that is a construct into a functioning and operating reality, the operationalization. Construct validity has two components: convergent and discriminant validity.

4.1 Discriminant Validity

Discriminant validity is the extent to which latent variable A discriminates from other latent variables (e.g., B, C, D). Discriminant validity means that a latent variable is able to account for more variance in the observed variables associated with it than a) measurement error or similar external, unmeasured influences; or b) other constructs within the conceptual framework. If this is not the case, then the validity of the individual indicators and of the construct is questionable (Fornell and Larcker [ 9 ] ). In brief, Discriminant validity (or divergent validity) tests that constructs that should have no relationship do, in fact, not have any relationship.

4.2 Convergent Validity

Convergent validity, a parameter often used in sociology, psychology, and other behavioural sciences, refers to the degree to which two measures of constructs that theoretically should be related, are in fact related.  In brief, Convergent validity tests that constructs that are expected to be related are, in fact, related.

With the purpose of verifying the construct validity (discriminant and convergent validity), a factor analysis can be conducted utilizing principal component analysis (PCA) with varimax rotation method (Koh and Nam [ 9 ] , Wee and Quazi, [ 10 ] ). Items loaded above 0.40, which is the minimum recommended value in research are considered for further analysis. Also, items cross loading above 0.40 should be deleted. Therefore, the factor analysis results will satisfy the criteria of construct validity including both the discriminant validity (loading of at least 0.40, no cross-loading of items above 0.40) and convergent validity (eigenvalues of 1, loading of at least 0.40, items that load on posited constructs) (Straub et al., [ 11 ] ). There are also other methods to test the convergent and discriminant validity.

5. Criterion Validity

Criterion or concrete validity is the extent to which a measure is related to an outcome.  It measures how well one measure predicts an outcome for another measure. A test has this type of validity if it is useful for predicting performance or behavior in another situation (past, present, or future).

Criterion validity is an alternative perspective that de-emphasizes the conceptual meaning or interpretation of test scores. Test users might simply wish to use a test to differentiate between groups of people or to make predictions about future outcomes. For example, a human resources director might need to use a test to help predict which applicants are most likely to perform well as employees. From a very practical standpoint, she focuses on the test’s ability to differentiate good employees from poor employees. If the test does this well, then the test is “valid” enough for her purposes. From the traditional three-faceted view of validity, criterion validity refers to the degree to which test scores can predict specific criterion variables. The key to validity is the empirical association between test scores and scores on the relevant criterion variable, such as “job performance.”

Messick [ 12 ] suggests that “even for purposes of applied decision making, reliance on criterion validity or content coverage is not enough. The meaning of the measure, and hence its construct validity, must always be pursued – not only to support test interpretation but also to justify test use”. There are two types of criterion validity namely; concurrent validity, predictive and postdictive validity.

6. Reliability

Reliability concerns the extent to which a measurement of a phenomenon provides stable and consist result (Carmines and Zeller [ 13 ] ). Reliability is also concerned with repeatability. For example, a scale or test is said to be reliable if repeat measurement made by it under constant conditions will give the same result (Moser and Kalton [ 14 ] ).

Testing for reliability is important as it refers to the consistency across the parts of a measuring instrument (Huck [ 15 ] ). A scale is said to have high internal consistency reliability if the items of a scale “hang together” and measure the same construct (Huck [ 16 ] Robinson [ 17 ] ). The most commonly used internal consistency measure is the Cronbach Alpha coefficient. It is viewed as the most appropriate measure of reliability when making use of Likert scales (Whitley [ 18 ] , Robinson [ 19 ] ). No absolute rules exist for internal consistencies, however most agree on a minimum internal consistency coefficient of .70 (Whitley [ 20 ] , Robinson [ 21 ] ).

For an exploratory or pilot study, it is suggested that reliability should be equal to or above 0.60 (Straub et al. [ 22 ] ). Hinton et al. [ 23 ] have suggested four cut-off points for reliability, which includes excellent reliability (0.90 and above), high reliability (0.70-0.90), moderate reliability (0.50-0.70) and low reliability (0.50 and below)(Hinton et al., [ 24 ] ). Although reliability is important for study, it is not sufficient unless combined with validity. In other words, for a test to be reliable, it also needs to be valid [ 25 ] .

  • ACKOFF, R. L. 1953. The Design of Social Research, Chicago, University of Chicago Press.
  • BARTLETT, J. E., KOTRLIK, J. W. & HIGGINS, C. C. 2001. Organizational research: determining appropriate sample size in survey research. Learning and Performance Journal, 19, 43-50.
  • BOUDREAU, M., GEFEN, D. & STRAUB, D. 2001. Validation in IS research: A state-of-the-art assessment. MIS Quarterly, 25, 1-24.
  • BREWETON, P. & MILLWARD, L. 2001. Organizational Research Methods, London, SAGE.
  • BROWN, G. H. 1947. A comparison of sampling methods. Journal of Marketing, 6, 331-337.
  • BRYMAN, A. & BELL, E. 2003. Business research methods, Oxford, Oxford University Press.
  • CARMINES, E. G. & ZELLER, R. A. 1979. Reliability and Validity Assessment, Newbury Park, CA, SAGE.
  • CHOUDRIE, J. & DWIVEDI, Y. K. Investigating Broadband Diffusion in the Household: Towards Content Validity and Pre-Test of the Survey Instrument. Proceedings of the 13th European Conference on Information Systems (ECIS 2005), May 26-28, 2005 2005 Regensburg, Germany.
  • DAVIS, D. 2005. Business Research for Decision Making, Australia, Thomson South-Western.
  • DM., G., DP., H., CC., C., CL., S. & ., P. B. 1975. The effects of instructional prompts and praise on children's donation rates. Child Development 46, 980-983.
  • ENGELLANT, K., HOLLAND, D. & PIPER, R. 2016. Assessing Convergent and Discriminant Validity of the Motivation Construct for the Technology Integration Education (TIE) Model. Journal of Higher Education Theory and Practice 16, 37-50.
  • FIELD, A. P. 2005. Discovering Statistics Using SPSS, Sage Publications Inc.
  • FORNELL, C. & LARCKER, D. F. 1981. Evaluating structural equation models with unobservable variables and measurement error. Journal of Marketing Research, 18, 39-50.
  • FOWLER, F. J. 2002. Survey research methods, Newbury Park, CA, SAGE.
  • GHAURI, P. & GRONHAUG, K. 2005. Research Methods in Business Studies, Harlow, FT/Prentice Hall.
  • GILL, J., JOHNSON, P. & CLARK, M. 2010. Research Methods for Managers, SAGE Publications.
  • HINTON, P. R., BROWNLOW, C., MCMURRAY, I. & COZENS, B. 2004. SPSS explained, East Sussex, England, Routledge Inc.
  • HUCK, S. W. 2007. Reading Statistics and Research, United States of America, Allyn & Bacon.
  • KOH, C. E. & NAM, K. T. 2005. Business use of the internet: a longitudinal study from a value chain perspective. Industrial Management & Data Systems, 105 85-95.
  • LAWSHE, C. H. 1975. A quantitative approach to content validity. Personnel Psychology, 28, 563-575.
  • LEWIS, B. R., SNYDER, C. A. & RAINER, K. R. 1995. An empirical assessment of the Information Resources Management construct. Journal of Management Information Systems, 12, 199-223.
  • MALHOTRA, N. K. & BIRKS, D. F. 2006. Marketing Research: An Applied Approach, Harlow, FT/Prentice Hall.
  • MAXWELL, J. A. 1996. Qualitative Research Design: An Intractive Approach London, Applied Social Research Methods Series.
  • MESSICK, S. 1989. Validity. In: LINN, R. L. (ed.) Educational measurement. New York: Macmillan.
  • MOSER, C. A. & KALTON, G. 1989. Survey methods in social investigation, Aldershot, Gower.

encyclopedia

  • Terms and Conditions
  • Privacy Policy
  • Advisory Board

validity and reliability research instrument

  • Introduction
  • Conclusions
  • Article Information

Screenshots of the smartphone cognitive tasks developed by Datacubed Health and included in the ALLFTD Mobile App. Details about the task design and instructions are included in the eMethods in Supplement 1. A, Flanker (Ducks in a Pond) is a task of cognitive control requiring participants to select the direction of the center duck. B, Go/no-go (Go Sushi Go!) requires participants to quickly tap on pieces of sushi (go) but not to tap when they see a fish skeleton (no-go). C, Card sort (Card Shuffle) is a task of cognitive flexibility requiring participants to learn rules that change during the task. D, The adaptative, associative memory task (Humi’s Bistro) requires participants to learn the food orders of several restaurant tables. E, Stroop (Color Clash) is a cognitive inhibition paradigm requiring participants to inhibit their tendency to read words and instead respond based on the color of the word. F, The 2-back task (Animal Parade) requires participants to determine whether animals on a parade float match the animals they saw 2 stimuli previously. G, Participants are asked to complete 3 testing sessions over 2 weeks. Shown in dark blue, they have 3 days to complete each testing session with a washout day between sessions on which no tests are available. Session 2 always begins on day 5 and session 3 on day 9. Screenshots are provided with permission from Datacubed Health.

Forest plots present internal consistency and test-retest reliability results in the discovery and validation cohorts, as well as an estimate in a combined sample of discovery and validation participants. ICC indicates interclass correlation coefficient.

A and B, Correlation matrices display associations of in-clinic criterion standard measures and ALLFTD mobile App (mApp) test scores in discovery and validation cohorts. Below the horizontal dashed lines, the associations among app tests and between app tests and demographic characteristics convergent clinical measures, divergent cognitive tests, and neuroimaging regions of interest can be viewed. Most app tests show strong correlations with each other and with age, convergent clinical measures, and brain volume. The measures show weaker correlations with divergent measures of visuospatial (Benson Figure Copy) and language (Multilingual Naming Test [MINT]) abilities. The strength of convergent correlations between app measures and outcomes is similar to the correlations between criterion standard neuropsychological scores and these outcomes, which can be viewed by looking across the rows above the horizontal black line. C and D, In the discovery and validation cohorts, receiver operating characteristics curves were calculated to determine how well a composite of app tests, the Uniform Data Set, version 3.0, Executive Functioning Composite (UDS3-EF), and the Montreal Cognitive Assessment (MoCA) discriminate individuals without symptoms (Clinical Dementia Rating Scale plus National Alzheimer’s Coordinating Center FTLD module sum of boxes [CDR plus NACC-FTLD-SB] score = 0) from individuals with the mildest symptoms of FTLD (CDR plus NACC-FTLD-SB score = 0.5). AUC indicates area under the curve; CVLT, California Verbal Learning Test.

eMethods. Instruments and Statistical Analysis

eResults. Participants

eTable 1. Participant Characteristics and Test Scores in Original and Validation Cohorts

eTable 2. Comparison of Diagnostic Accuracy for ALLFTD Mobile App Composite Score Across Cohorts

eTable 3. Number of Distractions Reported During the Remote Smartphone Testing Sessions

eTable 4. Qualitative Description of the Distractions Reported During Remote Testing Sessions

eFigure 1. Scatterplots of Test-Retest Reliability in a Mixed Sample of Adults Without Functional Impairment and Participants With FTLD

eFigure 2. Comparison of Test-Retest Reliability Estimates by Endorsement of Distractions

eFigure 3. Comparison of Test-Retest Reliability Estimates by Operating System

eFigure 4. Correlation Matrix in the Combined Cohort

eFigure 5. Neural Correlates of Smartphone Cognitive Test Performance

eReferences

Nonauthor Collaborators

Data Sharing Statement

See More About

Sign up for emails based on your interests, select your interests.

Customize your JAMA Network experience by selecting one or more topics from the list below.

  • Academic Medicine
  • Acid Base, Electrolytes, Fluids
  • Allergy and Clinical Immunology
  • American Indian or Alaska Natives
  • Anesthesiology
  • Anticoagulation
  • Art and Images in Psychiatry
  • Artificial Intelligence
  • Assisted Reproduction
  • Bleeding and Transfusion
  • Caring for the Critically Ill Patient
  • Challenges in Clinical Electrocardiography
  • Climate and Health
  • Climate Change
  • Clinical Challenge
  • Clinical Decision Support
  • Clinical Implications of Basic Neuroscience
  • Clinical Pharmacy and Pharmacology
  • Complementary and Alternative Medicine
  • Consensus Statements
  • Coronavirus (COVID-19)
  • Critical Care Medicine
  • Cultural Competency
  • Dental Medicine
  • Dermatology
  • Diabetes and Endocrinology
  • Diagnostic Test Interpretation
  • Drug Development
  • Electronic Health Records
  • Emergency Medicine
  • End of Life, Hospice, Palliative Care
  • Environmental Health
  • Equity, Diversity, and Inclusion
  • Facial Plastic Surgery
  • Gastroenterology and Hepatology
  • Genetics and Genomics
  • Genomics and Precision Health
  • Global Health
  • Guide to Statistics and Methods
  • Hair Disorders
  • Health Care Delivery Models
  • Health Care Economics, Insurance, Payment
  • Health Care Quality
  • Health Care Reform
  • Health Care Safety
  • Health Care Workforce
  • Health Disparities
  • Health Inequities
  • Health Policy
  • Health Systems Science
  • History of Medicine
  • Hypertension
  • Images in Neurology
  • Implementation Science
  • Infectious Diseases
  • Innovations in Health Care Delivery
  • JAMA Infographic
  • Law and Medicine
  • Leading Change
  • Less is More
  • LGBTQIA Medicine
  • Lifestyle Behaviors
  • Medical Coding
  • Medical Devices and Equipment
  • Medical Education
  • Medical Education and Training
  • Medical Journals and Publishing
  • Mobile Health and Telemedicine
  • Narrative Medicine
  • Neuroscience and Psychiatry
  • Notable Notes
  • Nutrition, Obesity, Exercise
  • Obstetrics and Gynecology
  • Occupational Health
  • Ophthalmology
  • Orthopedics
  • Otolaryngology
  • Pain Medicine
  • Palliative Care
  • Pathology and Laboratory Medicine
  • Patient Care
  • Patient Information
  • Performance Improvement
  • Performance Measures
  • Perioperative Care and Consultation
  • Pharmacoeconomics
  • Pharmacoepidemiology
  • Pharmacogenetics
  • Pharmacy and Clinical Pharmacology
  • Physical Medicine and Rehabilitation
  • Physical Therapy
  • Physician Leadership
  • Population Health
  • Primary Care
  • Professional Well-being
  • Professionalism
  • Psychiatry and Behavioral Health
  • Public Health
  • Pulmonary Medicine
  • Regulatory Agencies
  • Reproductive Health
  • Research, Methods, Statistics
  • Resuscitation
  • Rheumatology
  • Risk Management
  • Scientific Discovery and the Future of Medicine
  • Shared Decision Making and Communication
  • Sleep Medicine
  • Sports Medicine
  • Stem Cell Transplantation
  • Substance Use and Addiction Medicine
  • Surgical Innovation
  • Surgical Pearls
  • Teachable Moment
  • Technology and Finance
  • The Art of JAMA
  • The Arts and Medicine
  • The Rational Clinical Examination
  • Tobacco and e-Cigarettes
  • Translational Medicine
  • Trauma and Injury
  • Treatment Adherence
  • Ultrasonography
  • Users' Guide to the Medical Literature
  • Vaccination
  • Venous Thromboembolism
  • Veterans Health
  • Women's Health
  • Workflow and Process
  • Wound Care, Infection, Healing

Get the latest research based on your areas of interest.

Others also liked.

  • Download PDF
  • X Facebook More LinkedIn

Staffaroni AM , Clark AL , Taylor JC, et al. Reliability and Validity of Smartphone Cognitive Testing for Frontotemporal Lobar Degeneration. JAMA Netw Open. 2024;7(4):e244266. doi:10.1001/jamanetworkopen.2024.4266

Manage citations:

© 2024

  • Permissions

Reliability and Validity of Smartphone Cognitive Testing for Frontotemporal Lobar Degeneration

  • 1 Department of Neurology, Memory and Aging Center, Weill Institute for Neurosciences, University of California, San Francisco
  • 2 Department of Neurology, Columbia University, New York, New York
  • 3 Department of Neurology, Mayo Clinic, Rochester, Minnesota
  • 4 Department of Quantitative Health Sciences, Division of Biomedical Statistics and Informatics, Mayo Clinic, Rochester, Minnesota
  • 5 Department of Neurology, Case Western Reserve University, Cleveland, Ohio
  • 6 Department of Neurosciences, University of California, San Diego, La Jolla
  • 7 Department of Radiology, University of North Carolina, Chapel Hill
  • 8 Department of Neurology, Indiana University, Indianapolis
  • 9 Department of Neurology, Vanderbilt University, Nashville, Tennessee
  • 10 Department of Neurology, University of Washington, Seattle
  • 11 Department of Psychiatry and Psychology, Mayo Clinic, Rochester, Minnesota
  • 12 Department of Neurology, Institute for Precision Health, University of California, Los Angeles
  • 13 Department of Neurology, Knight Alzheimer Disease Research Center, Washington University, Saint Louis, Missouri
  • 14 Department of Psychiatry, Knight Alzheimer Disease Research Center, Washington University, Saint Louis, Missouri
  • 15 Department of Neuroscience, Mayo Clinic, Jacksonville, Florida
  • 16 Department of Neurology, University of Pennsylvania Perelman School of Medicine, Philadelphia
  • 17 Division of Neurology, University of British Columbia, Musqueam, Squamish & Tsleil-Waututh Traditional Territory, Vancouver, Canada
  • 18 Department of Neurosciences, University of California, San Diego, La Jolla
  • 19 Department of Neurology, Nantz National Alzheimer Center, Houston Methodist and Weill Cornell Medicine, Houston Methodist, Houston, Texas
  • 20 Department of Neurology, UCLA (University of California, Los Angeles)
  • 21 Department of Neurology, University of Colorado, Aurora
  • 22 Department of Neurology, David Geffen School of Medicine, UCLA
  • 23 Department of Neurology, University of Alabama, Birmingham
  • 24 Tanz Centre for Research in Neurodegenerative Diseases, Division of Neurology, University of Toronto, Toronto, Ontario, Canada
  • 25 Department of Neurology, Massachusetts General Hospital and Harvard Medical School, Boston
  • 26 Department of Epidemiology and Biostatistics, University of California, San Francisco
  • 27 Department of Psychological & Brain Sciences, Washington University, Saint Louis, Missouri

Question   Can remote cognitive testing via smartphones yield reliable and valid data for frontotemporal lobar degeneration (FTLD)?

Findings   In this cohort study of 360 patients, remotely deployed smartphone cognitive tests showed moderate to excellent reliability comparedwith criterion standard measures (in-person disease severity assessments and neuropsychological tests) and brain volumes. Smartphone tests accurately detected dementia and were more sensitive to the earliest stages of familial FTLD than standard neuropsychological tests.

Meaning   These findings suggest that remotely deployed smartphone-based assessments may be reliable and valid tools for evaluating FTLD and may enhance early detection, supporting the inclusion of digital assessments in clinical trials for neurodegeneration.

Importance   Frontotemporal lobar degeneration (FTLD) is relatively rare, behavioral and motor symptoms increase travel burden, and standard neuropsychological tests are not sensitive to early-stage disease. Remote smartphone-based cognitive assessments could mitigate these barriers to trial recruitment and success, but no such tools are validated for FTLD.

Objective   To evaluate the reliability and validity of smartphone-based cognitive measures for remote FTLD evaluations.

Design, Setting, and Participants   In this cohort study conducted from January 10, 2019, to July 31, 2023, controls and participants with FTLD performed smartphone application (app)–based executive functioning tasks and an associative memory task 3 times over 2 weeks. Observational research participants were enrolled through 18 centers of a North American FTLD research consortium (ALLFTD) and were asked to complete the tests remotely using their own smartphones. Of 1163 eligible individuals (enrolled in parent studies), 360 were enrolled in the present study; 364 refused and 439 were excluded. Participants were divided into discovery (n = 258) and validation (n = 102) cohorts. Among 329 participants with data available on disease stage, 195 were asymptomatic or had preclinical FTLD (59.3%), 66 had prodromal FTLD (20.1%), and 68 had symptomatic FTLD (20.7%) with a range of clinical syndromes.

Exposure   Participants completed standard in-clinic measures and remotely administered ALLFTD mobile app (app) smartphone tests.

Main Outcomes and Measures   Internal consistency, test-retest reliability, association of smartphone tests with criterion standard clinical measures, and diagnostic accuracy.

Results   In the 360 participants (mean [SD] age, 54.0 [15.4] years; 209 [58.1%] women), smartphone tests showed moderate-to-excellent reliability (intraclass correlation coefficients, 0.77-0.95). Validity was supported by association of smartphones tests with disease severity ( r range, 0.38-0.59), criterion-standard neuropsychological tests ( r range, 0.40-0.66), and brain volume (standardized β range, 0.34-0.50). Smartphone tests accurately differentiated individuals with dementia from controls (area under the curve [AUC], 0.93 [95% CI, 0.90-0.96]) and were more sensitive to early symptoms (AUC, 0.82 [95% CI, 0.76-0.88]) than the Montreal Cognitive Assessment (AUC, 0.68 [95% CI, 0.59-0.78]) ( z of comparison, −2.49 [95% CI, −0.19 to −0.02]; P  = .01). Reliability and validity findings were highly similar in the discovery and validation cohorts. Preclinical participants who carried pathogenic variants performed significantly worse than noncarrier family controls on 3 app tasks (eg, 2-back β = −0.49 [95% CI, −0.72 to −0.25]; P  < .001) but not a composite of traditional neuropsychological measures (β = −0.14 [95% CI, −0.42 to 0.14]; P  = .32).

Conclusions and Relevance   The findings of this cohort study suggest that smartphones could offer a feasible, reliable, valid, and scalable solution for remote evaluations of FTLD and may improve early detection. Smartphone assessments should be considered as a complementary approach to traditional in-person trial designs. Future research should validate these results in diverse populations and evaluate the utility of these tests for longitudinal monitoring.

Frontotemporal lobar degeneration (FTLD) is a neurodegenerative pathology causing early-onset dementia syndromes with impaired behavior, cognition, language, and/or motor functioning. 1 Although over 30 FTLD trials are planned or in progress, there are several barriers to conducting FTLD trials. Clinical trials for neurodegenerative disease are expensive, 2 and frequent in-person trial visits are burdensome for patients, caregivers, and clinicians, 3 a concern magnified in FTLD by behavioral and motor impairments. Given the rarity and geographical dispersion of eligible participants, FTLD trials require global recruitment, 4 particularly for those that are far from expert FTLD clinical trial centers. Furthermore, criterion standard neuropsychological tests are not adequately sensitive until symptoms are already noticeable to families, limiting their usefulness as outcomes in early-stage FTLD treatment trials. 4

Reliable, valid, and scalable remote data collection methods may help surmount these barriers to FTLD clinical trials. Smartphones are garnering interest across neurological conditions as a method for administering remote cognitive and motor evaluations. Preliminary evidence supports the feasibility, reliability, and/or validity of unsupervised smartphone cognitive and motor testing in older adults at risk for Alzheimer disease, 5 - 8 Parkinson disease, 9 and Huntington disease. 10 The clinical heterogeneity of FTLD necessitates a uniquely comprehensive smartphone battery. In the ALLFTD Consortium (Advancing Research and Treatment in Frontotemporal Lobar Degeneration [ARTFLD] and Longitudinal Evaluation of Familial Frontotemporal Dementia Subjects [LEFFTDS]), the ALLFTD mobile Application (ALLFTD-mApp) was designed to remotely monitor cognitive, behavioral, language, and motor functioning in FTLD research. Taylor et al 11 recently reported that unsupervised ALLFTD-mApp data collection through a multicenter North American FTLD research network was feasible and acceptable to participants. Herein, we extend that work by investigating the reliability and validity of unsupervised remote smartphone tests of executive functioning and memory in a cohort with FTLD that has undergone extensive phenotyping.

Participants were enrolled from ongoing FTLD studies requiring in-person assessment, including participants from 18 centers from the ALLFTD study study 12 and University of California, San Francisco (UCSF) FTLD studies. To study the app in older individuals, a small group of older adults without functional impairment was recruited from the UCSF Brain Aging Network for Cognitive Health. All study procedures were approved by the UCSF or Johns Hopkins Central Institutional Review Board. All participants or legally authorized representatives provided written informed consent. The study followed the Strengthening the Reporting of Observational Studies in Epidemiology ( STROBE ) reporting guideline.

Inclusion criteria were age 18 years or older, having access to a smartphone, and reporting English as the primary language. Race and ethnicity were self reported by participants using options consistent with the National Alzheimer’s Coordinating Center (NACC) Uniform Data Set (UDS) and were collected to contextualize the generalizability of these results. Participants were asked to complete tests on their own smartphones. Informants were encouraged for all participants and required for those with symptomatic FTLD (Clinical Dementia Rating Scale plus NACC FTLD module [CDR plus NACC-FTLD] global score ≥1). Recruitment targeted individuals with CDR plus NACC-FTLD global scores less than 2, but sites had discretion to enroll more severely impaired participants. Exclusion criteria were consistent with the parent ALLFTD study. 12

Participants were enrolled in the ALLFTD-mApp study within 90 days of annual ALLFTD study visits (including neuropsychological and neuroimaging data collection). Site research coordinators (including J.C.T., A.B.W., S.D., and M.M.) assisted participants with app download, setup, and orientation and observed participants completing the first questionnaire. All cognitive tasks were self-administered without supervision (except pilot participants, discussed below) in a predefined order with minor adjustments throughout the study. Study partners of participants with symptomatic FTLD were asked to remain nearby during participation to help navigate the ALLFTD-mApp but were asked not to assist with testing.

The baseline participation window was divided into three 25- to 35-minute assessment sessions occurring over 11 days. All cognitive tests were repeated in every session to enhance task reliability 6 , 13 and enable assessment of test-retest reliability, except for card sort, which was administered once every 6 months due to expected practice effects. Adherence was defined as the percentage of all available tasks that were completed. Participants were asked to complete the triplicate of sessions every 6 months for the duration of the app study. Only the baseline triplicate was analyzed in this study.

Replicability was tested by dividing the sample into a discovery cohort (n = 258) comprising all participants enrolled until the initial data freeze (October 1, 2022) and a validation cohort (n = 102) comprising participants enrolled after October 1, 2022, and 18 pilot participants 11 who completed the first session in person with an examiner present during cognitive pretesting. Sensitivity analyses excluded this small pilot cohort.

ALLFTD investigators partnered with Datacubed Health 14 to develop the ALLFTD-mApp on Datacubed Health’s Linkt platform. The app includes cognitive, motor, and speech tasks. This study focuses on 6 cognitive tests developed by Datacubed Health 11 comprising an adaptive associative memory task (Humi’s Bistro) and gamified versions of classic executive functioning paradigms: flanker (Ducks in a Pond), Stroop (Color Clash), 2-back (Animal Parade), go/no-go (Go Sushi Go!), and card sort (Card Shuffle) ( Figure 1 and eMethods in Supplement 1 ). Most participants with symptomatic FTLD (49 [72.1%]) were not administered Stroop or 2-back, as pilot studies identified these as too difficult. 11 The app test results were summarized as a composite score (eMethods in Supplement 1 ). Participants completed surveys to assess technological familiarity (daily or less than daily use of a smartphone) and distractions (present or absent).

Criterion standard clinical data were collected during parent project visits. Syndromic diagnoses were made according to published criteria 15 - 19 based on multidisciplinary conferences that considered neurological history, neurological examination results, and collateral interview. 20

The CDR plus NACC-FTLD module is an 8-domain rating scale based on informant and participant report. 21 A global score was calculated to categorize disease severity as asymptomatic or preclinical if a pathogenic variant carrier (0), prodromal (0.5), or symptomatic (1.0-3.0). 22 A sum of the 8 domain box scores (CDR plus NACC-FTLD sum of boxes) was also calculated. 22

Participants completed the UDS Neuropsychological Battery, version 3.0 23 (eMethods in Supplement 1 ), which includes traditional neuropsychological measures and the Montreal Cognitive Assessment (MoCA), a global cognitive screen. Executive functioning and processing speed measures were summarized into a composite score (UDS3-EF). 24 Participants also completed a 9-item list-learning memory test (California Verbal Learning Test, 2nd edition, Short Form). 25 Most (339 [94.2%]) neuropsychological evaluations were conducted in person. In a subsample (n = 270), motor speed and dexterity were assessed using the Movement Disorder Society Uniform Parkinson Disease Rating Scale 26 Finger Tapping subscale (0 indicates no deficits [n = 240]).

We acquired T1-weighted brain magnetic resonance imaging for 199 participants. Details of image acquisition, harmonization, preprocessing, and processing are provided in eMethods in Supplement 1 and prior publications. 27 Briefly, SPM12 (Statistical Parametric Mapping) was used for segmentation 28 and Large Deformation Diffeomorphic Metric Mapping for generating group templates. 29 Gray matter volumes were calculated in template space by integrating voxels and dividing by total intracranial volume in 2 regions of interest (ROIs) 30 : a frontoparietal and subcortical ROI and a hippocampal ROI. Voxel-based morphometry was used to test unbiased voxel-wise associations of volume with smartphone tests (eMethods in Supplement 1 ). 31 , 32

Participants in the ALLFTD study underwent genetic testing 33 at the University of California, Los Angeles. DNA samples were screened using targeted sequencing of a custom panel of genes previously implicated in neurodegenerative diseases, including GRN ( 138945 ) and MAPT ( 157140 ). Hexanucleotide repeat expansions in C9orf72 ( 614260 ) were detected using both fluorescent and repeat-primed polymerase chain reaction analysis. 34

Statistical analyses were conducted using Stata, version 17.0 (StataCorp LLC), and R, version 4.4.2 (R Project for Statistical Computing). All tests were 2 sided, with a statistical significance threshold of P < .05.

Psychometric properties of the smartphone tests were explored using descriptive statistics. Comparisons between CDR plus NACC-FTLD groups (ie, asymptomatic or preclinical, prodromal, and symptomatic) for continuous variables, including demographic characteristics and cognitive task scores (first exposure to each measure), were analyzed by fitting linear regressions. We used χ 2 difference tests for frequency data (eg, sex and race and ethnicity).

Internal consistency, which measures reliability within a task, was estimated for participants’ first exposure to each test using Cronbach α (details in eMethods in Supplement 1 ). Test-retest reliability was estimated using intraclass correlation coefficients for participants who completed a task at least twice; all exposures were included. Reliability estimates are described as poor (<0.500), moderate (0.500-0.749), good (0.750-0.890), and excellent (≥0.900) 35 ; these are reporting rules of thumb, and clinical interpretation should consider raw estimates. We calculated 95% CIs via bootstrapping with 1000 samples.

Validity analyses used participants’ first exposure to each test. Linear regressions were fitted in participants without symptoms with age, sex, and educational level as independent variables to understand the unique contribution of each demographic factor to cognitive test scores. Correlations and linear regression between the app-based tasks and disease severity (CDR plus NACC-FTLD sum of boxes score), neuropsychological test scores, and gray matter ROIs were used to investigate construct validity in the full sample. Demographic characteristics were not entered as covariates because the primary goal was to assess associations between app-based measures and criterion standards, rather than understand the incremental predictive value of app measures. To address potential motor confounds, associations with disease severity were evaluated in a subsample without finger dexterity deficits on motor examination (using the Movement Disorder Society Uniform Parkinson Disease Rating Scale Finger Tapping subscale). To complement ROI-based neuroimaging analysis based on a priori hypotheses, we conducted voxel-based morphometry (eMethods in Supplement 1 ) to uncover other potential neural correlates of test performance. 31 , 32 Finally, we evaluated the association of the number of distractions and operating system with reliability and validity, controlling for age and disease severity, which are predictive factors associated with test performance in correlation analyses.

To evaluate the app’s ability to select participants with prodromal or symptomatic FTLD for trial enrollment, we tested discrimination of participants without symptoms from those with prodromal and symptomatic FTLD. To understand the app’s utility for screening early cognitive impairment, we fit receiver operating characteristics curves testing the predictive value of the app composite, UDS3-EF, and MoCA for differentiating participants without symptoms and those with preclinical FTLD from those with prodromal FTLD; areas under the curves (AUC) for the app and MoCA were compared using the DeLong test in participants with results for both predictive factors.

We compared app performance in preclinical participants who carried pathogenic variants with that in noncarrier controls using linear regression adjusted for age (a predictive factor in earlier models). For this analysis, we excluded those younger than 45 years to remove participants likely to be years from symptom onset based on natural history studies. 4 We analyzed memory performance in participants who carried MAPT pathogenic variants, as early executive deficits may be less prominent. 34 , 36

Of 1163 eligible participants, 360 were enrolled, 439 were excluded, and 364 refused to participate (additional details are provided in the eResults in Supplement 1 ). Participant characteristics are reported in Table 1 for the full sample. The discovery and validation cohorts did not significantly differ in terms of demographic characteristics, disease severity, or cognition (eTable 1 in Supplement 1 ). In the full sample, there were 209 women (58.1%) and 151 men (41.9%), and the mean (SD) age was 54.0 (15.4) years (range, 18-89 years). The mean (SD) educational level was 16.5 (2.3) years (range, 12-20 years). Among the 358 participants with racial and ethnic data available, 340 (95.0%) identified as White. For the 18 participants self-identifying as being of other race or ethnicity, the specific group was not provided to protect participant anonymity. Among the 329 participants with available CDR plus NACC-FTLD scores ( Table 1 ), 195 (59.3%) were asymptomatic or preclinical (Global Score, 0), 66 (20.1%) were prodromal (Global score, 0.5), and 68 (20.7%) were symptomatic (global score, 1.0 or 2.0). Of those with available genetic testing results (n = 222), 100 (45.0%) carried a pathogenic familial FTLD pathogenic variant, including 63 of 120 participants without symptoms and with available results. On average, participants completed 78% of available smartphone measures over a mean (SD) of 2.6 (0.6) sessions.

Descriptive statistics for each task are presented in Table 2 . Ceiling effects were not observed for any tests. A small percentage of participants were at the floor for flanker (19 [5.3%]), go/no-go (13 [4.0%]), and card sort (9 [3.3%]) scores. Floor effects were only observed in participants with prodromal or symptomatic FTLD.

Except for go/no-go, internal consistency estimates ranged from good to excellent (Cronbach α range, 0.84 [95% CI, 0.81-0.87] to 0.99 [95% CI, 0.99-0.99]), and test-retest reliabilities were moderate to excellent (interclass correlation coefficient [ICC] range, 0.77 [95% CI, 0.69-0.83] to 0.95 [95% CI, 0.93-0.96]), with slightly higher estimates in participants with prodromal or symptomatic FTLD ( Table 2 , Figure 2 , and eFigure 1 in Supplement 1 ). Go/no-go reliability was particularly poor in participants without symptoms (ICC, 0.10 [95% CI, −0.37 to 0.48]) and was removed from subsequent validation analyses except the correlation matrix ( Figure 3 A and B). The 95% CIs for reliability estimates overlapped in the discovery and validation cohorts ( Figure 2 ). Reliability estimates showed overlapping 95% CIs regardless of distractions (eFigure 2 in Supplement 1 ) or operating systems (eFigure 3 in Supplement 1 ), with a pattern of slightly lower reliability estimates when distractions were endorsed for all comparisons except Stroop (Cronbach α).

In 57 participants without symptoms who did not carry pathogenic variants, older age was associated with worse performance on all measures (β range,  − 0.40 [95 CI, −0.68 to −0.13] to −0.78 [95 CI, −0.89 to −0.52]; P ≤ .03), except card sort (β = −0.22 [95% CI, −0.54 to 0.09]; P  = .16) and go-no/go (β = −0.15 [95% CI, −0.44 to 0.14]; P  = .31), though associations were in the expected direction. Associations with sex and educational level were not statistically significant.

Cognitive tests administered using the app showed evidence of convergent and divergent validity (eFigure 4 in Supplement 1 ), with very similar findings in discovery ( Figure 3 A) and validation cohorts ( Figure 3 B). App–based measures of executive functioning were generally correlated with criterion standard in-person measures of these domains and less with measures of other cognitive domains ( r range, 0.40-0.66). For example, the flanker task was associated with the UDS3-EF composite (β = 0.58 [95% CI, 0.48-0.68]; P  < .001) and measures of visuoconstruction (β for Benson Figure Copy, 0.43 [95% CI, 0.32-0.54]; P  = .01) and naming (β for Multilingual Naming Test, 0.25 [95% CI, 0.14-0.37]; P  < .001). The app memory test was associated with criterion standard memory and executive functioning tests.

Worse performance on all app measures was associated with greater disease severity on CDR plus NACC-FTLD ( r range, 0.38-0.59) ( Table 1 , Figure 3 , and eFigure 4 in Supplement 1 ). The same pattern of results was observed after excluding those with finger dexterity issues. Except for go/no-go, performance of participants with prodromal FTLD was statistically significantly worse than that of participants without symptoms on all measures ( P  < .001).

The AUC for the app composite to distinguish participants without symptoms from those with dementia was 0.93 (95% CI, 0.90-0.96). The app also accurately differentiated participants without symptoms from those with prodromal or symptomatic FTLD (AUC, 0.87 [95% CI, 0.84-0.92]). Compared with the MoCA (AUC, 0.68 [95% CI, 0.59-0.78), app composite performance (AUC, 0.82 [95% CI, 0.76-0.88]) more accurately differentiated participants without symptoms and with prodromal FTLD ( z of comparison, −2.49 [95% CI, −0.19 to −0.02]; P  = .01), with similar accuracy to the UDS3-EF (AUC, 0.81 [95% CI, 0.73-0.88]); highly similar results (eTable 2 in Supplement 1 ) were observed in the discovery ( Figure 3 C) and validation ( Figure 3 D) cohorts.

In 56 participants without symptoms who were older than 45 years, those carrying GRN , C9orf72 , or another rare pathogenic variants performed significantly worse on 3 of 4 executive tests compared with noncarrier controls, including flanker (β = −0.26 [95% CI, −0.46 to −0.05]; P  = .02), card sort (β = −0.28 [95% CI, −0.54 to −0.30]; P  = .03), and 2-back (β = −0.49 [95% CI, −0.72 to −0.25]; P  < .001). The estimated scores of participants who carried pathogenic variants were on average lower than those of carriers on a composite of criterion standard in-person tests, but the difference was not statistically significant (UDS3-EF β = −0.14 [95% CI, −0.42 to 0.14]; P  = .32). Participants who carried preclinical MAPT pathogenic variants scored higher than noncarriers on the app Memory test, though the difference was not statistically significant (β = 0.21 [95% CI, −0.50 to 0.58]; P  = .19).

In prespecified ROI analyses, worse app executive functioning scores were associated with lower frontoparietal and/or subcortical volume ( Figures 3 A and B) (β range, 0.34 [95% CI, 0.22-0.46] to 0.50 [95 CI, 0.40-0.60]; P < .001 for all) and worse memory scores with smaller hippocampal volume (β = 0.45 [95% CI, 0.34-0.56]; P  < .001). Voxel-based morphometry (eFigure 5 in Supplement 1 ) suggested worse app performance was associated with widespread atrophy, particularly in frontotemporal cortices.

Only for card sort were distractions (eTables 3 and 4 in Supplement 1 ) associated with task performance; those experiencing distractions unexpectedly performed better (β = 0.16 [95% CI, 0.05-0.28]; P  = .005). The iPhone operating system was associated with better performance on 2 speeded tasks: flanker (β = 0.16 [95% CI, 0.07-0.24]; P  < .001) and go/no-go (β = 0.16 [95% CI, 0.06-0.26]; P  = .002). In a sensitivity analysis, associations of all app tests with disease severity, UDS3-EF, and regional brain volumes remained after covarying for distractions and operating system, as did the models differentiating participants who carried preclinical pathogenic variants and noncarrier controls.

There is an urgent need to identify reliable and valid digital tools for remote neurobehavioral measurement in neurodegenerative diseases, including FTLD. Prior studies provided preliminary evidence that smartphones collect reliable and valid cognitive data in a variety of age-related and neurodegenerative illnesses. This is the first study, to our knowledge, to provide analogous support for the reliability and validity of remote cognitive testing via smartphones in FTLD and preliminary evidence that this approach improves early detection relative to traditional in-person measures.

Reliability, a prerequisite for a valid clinical trial end point, indicates measurements are consistent. In 2 cohorts, we found smartphone cognitive tests were reliable within a single administration (ie, internally consistent) and across repeated assessments (ie, test-retest reliability) with no apparent differences by operating system. For all measures except go/no-go, reliability estimates were moderate to excellent and on par with other remote digital assessments 5 , 6 , 10 , 37 , 38 and in-clinic criterion standards. 39 - 41 Go/no-go showed similar within- and between-person variability in participants without symptoms (ie, poor reliability), and participant feedback suggested instructions were confusing and the stimuli disappeared too quickly. Those endorsing distractions tended to have lower reliability, though 95% CIs largely overlapped; future research detailing the effect of the home environment on test performance is warranted.

Construct validity was supported by strong associations of smartphone tests with demographics, disease severity, neuroimaging, and criterion standard neuropsychological measures that replicated in a validation sample. These associations were similar to those observed among the criterion standard measures and similar to associations reported in other validation studies of smartphone cognitive tests. 5 , 6 , 10 Associations with disease severity were not explained by motor impairments. The iPhone operating system was associated with better performance on 2 time-based measures, consistent with prior findings. 6

A composite of brief smartphone tests was accurate in distinguishing dementia from cognitively unimpaired participants, screening out participants without symptoms, and detecting prodromal FTLD with greater sensitivity than the MoCA. Moreover, carriers of preclinical C9orf72 and GRN pathogenic variants performed significantly worse than noncarrier controls on 3 tests, whereas they did not significantly differ on criterion standard measures. These findings are consistent with previous studies showing digital executive functioning paradigms may be more sensitive to early FTLD than traditional measures. 42 , 43

This study has some limitations. Validation analyses focused on participants’ initial task exposure. Future studies will explore whether repeated measurements and more sophisticated approaches to composite building (current composite assumes equal weighting of tests) improve reliability and sensitivity, and a normative sample is being collected to better adjust for demographic effects on testing. 24 Longitudinal analyses will explore whether the floor effects in participants with symptomatic FTLD will affect the utility for monitoring. The generalizability of the findings is limited by the study cohort, which comprised participants who were college educated on average, mostly White, and primarily English speakers who owned smartphones and participated in the referring in-person research study. Equity in access to research is a priority in FTLD research 44 , 45 ; translations of the ALLFTD-mApp are in progress, cultural adaptations are being considered, and devices have been purchased for provisioning to improve the diversity of our sample.

The findings of this cohort study, coupled with prior reports indicating that smartphone testing is feasible and acceptable to patients with FTLD, 11 suggest that smartphones may complement traditional in-person research paradigms. More broadly, the scalability, ease of use, reliability, and validity of the ALLFTD-mApp suggest the feasibility and utility of remote digital assessments in dementia clinical trials. Future research should validate these results in diverse populations and evaluate the utility of these tests for longitudinal monitoring.

Accepted for Publication: February 2, 2024.

Published: April 1, 2024. doi:10.1001/jamanetworkopen.2024.4266

Open Access: This is an open access article distributed under the terms of the CC-BY License . © 2024 Staffaroni AM et al. JAMA Network Open .

Corresponding Author: Adam M. Staffaroni, PhD, Weill Institute for Neurosciences, Department of Neurology, Memory and Aging Center, University of California, San Francisco, 675 Nelson Rising Ln, Ste 190, San Francisco, CA 94158 ( [email protected] ).

Author Contributions: Dr Staffaroni had full access to all of the data in the study and takes responsibility for the integrity of the data and the accuracy of the data analysis.

Concept and design: Staffaroni, A. Clark, Taylor, Heuer, Wise, Forsberg, Miller, Hassenstab, Rosen, Boxer.

Acquisition, analysis, or interpretation of data: Staffaroni, A. Clark, Taylor, Heuer, Sanderson-Cimino, Wise, Dhanam, Cobigo, Wolf, Manoochehri, Mester, Rankin, Appleby, Bayram, Bozoki, D. Clark, Darby, Domoto-Reilly, Fields, Galasko, Geschwind, Ghoshal, Graff-Radford, Hsiung, Huey, Jones, Lapid, Litvan, Masdeu, Massimo, Mendez, Miyagawa, Pascual, Pressman, Ramanan, Ramos, Rascovsky, Roberson, Tartaglia, Wong, Kornak, Kremers, Kramer, Boeve, Boxer.

Drafting of the manuscript: Staffaroni, A. Clark, Taylor, Heuer, Wolf, Lapid.

Critical review of the manuscript for important intellectual content: Staffaroni, Taylor, Heuer, Sanderson-Cimino, Wise, Dhanam, Cobigo, Manoochehri, Forsberg, Mester, Rankin, Appleby, Bayram, Bozoki, D. Clark, Darby, Domoto-Reilly, Fields, Galasko, Geschwind, Ghoshal, Graff-Radford, Hsiung, Huey, Jones, Lapid, Litvan, Masdeu, Massimo, Mendez, Miyagawa, Pascual, Pressman, Ramanan, Ramos, Rascovsky, Roberson, Tartaglia, Wong, Miller, Kornak, Kremers, Hassenstab, Kramer, Boeve, Rosen, Boxer.

Statistical analysis: Staffaroni, A. Clark, Taylor, Heuer, Sanderson-Cimino, Cobigo, Kornak, Kremers.

Obtained funding: Staffaroni, Rosen, Boxer.

Administrative, technical, or material support: A. Clark, Taylor, Heuer, Wise, Dhanam, Wolf, Manoochehri, Forsberg, Darby, Domoto-Reilly, Ghoshal, Hsiung, Huey, Jones, Litvan, Massimo, Mendez, Miyagawa, Pascual, Pressman, Ramanan, Kramer, Boeve, Boxer.

Supervision: Geschwind, Miyagawa, Roberson, Kramer, Boxer.

Conflict of Interest Disclosures: Dr Staffaroni reported being a coinventor of 4 ALLFTD mobile application tasks (not analyzed in the present study) and receiving licensing fees from Datacubed Health; receiving research support from the National Institute on Aging (NIA) of the National Institutes of Health (NIH), Bluefield Project to Cure FTD, the Alzheimer’s Association, the Larry L. Hillblom Foundation, and the Rainwater Charitable Foundation; and consulting for Alector Inc, Eli Lilly and Company/Prevail Therapeutics, Passage Bio Inc, and Takeda Pharmaceutical Company. Dr Forsberg reported receiving research support from the NIH. Dr Rankin reported receiving research support from the NIH and the National Science Foundation and serving on the medical advisory board for Eli Lilly and Company. Dr Appleby reported receiving research support from the Centers for Disease Control and Prevention (CDC), the NIH, Ionis Pharmaceuticals Inc, Alector Inc, and the CJD Foundation and consulting for Acadia Pharmaceuticals Inc, Ionis Pharmaceuticals Inc, and Sangamo Therapeutics Inc. Dr Bayram reported receiving research support from the NIH. Dr Domoto-Reilly reported receiving research support from NIH and serving as an investigator for a clinical trial sponsored by Lawson Health Research Institute. Dr Bozoki reported receiving research funding from the NIH, Alector Inc, Cognition Therapeutics Inc, EIP Pharma, and Transposon Therapeutics Inc; consulting for Eisai and Creative Bio-Peptides Inc; and serving on the data safety monitoring board for AviadoBio. Dr Fields reported receiving research support from the NIH. Dr Galasko reported receiving research funding from the NIH; clinical trial funding from Alector Inc and Esai; consulting for Esai, General Electric Health Care, and Fujirebio; and serving on the data safety monitoring board of Cyclo Therapeutics Inc. Dr Geschwind reported consulting for Biogen Inc and receiving research support from Roche and Takeda Pharmaceutical Company for work in dementia. Dr Ghoshal reported participating in clinical trials of antidementia drugs sponsored by Bristol Myers Squibb, Eli Lilly and Company/Avid Radiopharmaceuticals, Janssen Immunotherapy, Novartis AG, Pfizer Inc, Wyeth Pharmaceuticals, SNIFF (The Study of Nasal Insulin to Fight Forgetfulness) study, and A4 (The Anti-Amyloid Treatment in Asymptomatic Alzheimer’s Disease) trial; receiving research support from Tau Consortium and the Association for Frontotemporal Dementia; and receiving funding from the NIH. Dr Graff-Radford reported receiving royalties from UpToDate; reported participating in multicenter therapy studies by sponsored by Biogen Inc, TauRx Therapeutics Ltd, AbbVie Inc, Novartis AG, and Eli Lilly and Company; and receiving research support from the NIH. Dr Grossman reported receiving grant support from the NIH, Avid Radiopharmaceuticals, and Piramal Pharma Ltd; participating in clinical trials sponsored by Biogen Inc, TauRx Therapeutics Ltd, and Alector Inc; consulting for Bracco and UCB; and serving on the editorial board of Neurology . Dr Hsiung reported receiving grant support from the Canadian Institutes of Health Research, the NIH, and the Alzheimer Society of British Columbia; participating in clinical trials sponsored by Anavax Life Sciences Corp, Biogen Inc, Cassava Sciences, Eli Lilly and Company, and Roche; and consulting for Biogen Inc, Novo Nordisk A/S, and Roche. Dr Huey reported receiving research support from the NIH. Dr Jones reported receiving research support from the NIH. Dr Litvan reported receiving research support from the NIH, the Michael J Fox Foundation, the Parkinson Foundation, the Lewy Body Association, CurePSP, Roche, AbbVie Inc, H Lundbeck A/S, Novartis AG, Transposon Therapeutics Inc, and UCB; serving as a member of the scientific advisory board for the Rossy PSP Program at the University of Toronto and for Amydis; and serving as chief editor of Frontiers in Neurology . Dr Masdeu reported consulting for and receiving research funding from Eli Lilly and Company; receiving personal fees from GE Healthcare; receiving grant funding and personal fees from Eli Lilly and Company; and receiving grant funding from Acadia Pharmaceutical Inc, Avanir Pharmaceuticals Inc, Biogen Inc, Eisai, Janssen Global Services LLC, the NIH, and Novartis AG outside the submitted work. Dr Mendez reported receiving research support from the NIH. Dr Miyagawa reported receiving research support from the Zander Family Foundation. Dr Pascual reported receiving research support from the NIH. Dr Pressman reported receiving research support from the NIH. Dr Ramos reported receiving research support from the NIH. Dr Roberson reported receiving research support from the NIA of the NIH, the Bluefield Project, and the Alzheimer’s Drug Discovery Foundation; serving on a data monitoring committee for Eli Lilly and Company; receiving licensing fees from Genentech Inc; and consulting for Applied Genetic Technologies Corp. Dr Tartaglia reported serving as an investigator for clinical trials sponsored by Biogen Inc, Avanex Corp, Green Valley, Roche/Genentech Inc, Bristol Myers Squibb, Eli Lilly and Company/Avid Radiopharmaceuticals, and Janssen Global Services LLC and receiving research support from the Canadian Institutes of Health Research (CIHR). Dr Wong reported receiving research support from the NIH. Dr Kornak reported providing expert witness testimony for Teva Pharmaceuticals Industries Ltd, Apotex Inc, and Puma Biotechnology and receiving research support from the NIH. Dr Kremers reported receiving research funding from NIH. Dr Kramer reported receiving research support from the NIH and royalties from Pearson Inc. Dr Boeve reported serving as an investigator for clinical trials sponsored by Alector Inc, Biogen Inc, and Transposon Therapeutics Inc; receiving royalties from Cambridge Medicine; serving on the Scientific Advisory Board of the Tau Consortium; and receiving research support from NIH, the Mayo Clinic Dorothy and Harry T. Mangurian Jr. Lewy Body Dementia Program, and the Little Family Foundation. Dr Rosen reported receiving research support from Biogen Inc, consulting for Wave Neuroscience and Ionis Pharmaceuticals, and receiving research support from the NIH. Dr Boxer reported being a coinventor of 4 of the ALLFTD mobile application tasks (not the focus of the present study) and previously receiving licensing fees; receiving research support from the NIH, the Tau Research Consortium, the Association for Frontotemporal Degeneration, Bluefield Project to Cure Frontotemporal Dementia, Corticobasal Degeneration Solutions, the Alzheimer’s Drug Discovery Foundation, and the Alzheimer’s Association; consulting for Aeovian Pharmaceuticals Inc, Applied Genetic Technologies Corp, Alector Inc, Arkuda Therapeutics, Arvinas Inc, AviadoBio, Boehringer Ingelheim, Denali Therapeutics Inc, GSK, Life Edit Therapeutics Inc, Humana Inc, Oligomerix, Oscotec Inc, Roche, Transposon Therapeutics Inc, TrueBinding Inc, and Wave Life Sciences; and receiving research support from Biogen Inc, Eisai, and Regeneron Pharmaceuticals Inc. No other disclosures were reported.

Funding/Support: This work was supported by grants AG063911, AG077557, AG62677, AG045390, NS092089, AG032306, AG016976, AG058233, AG038791, AG02350, AG019724, AG062422, NS050915, AG032289-11, AG077557, K23AG061253, and K24AG045333 from the NIH; the Association for Frontotemporal Degeneration; the Bluefield Project to Cure FTD; the Rainwater Charitable Foundation; and grant 2014-A-004-NET from the Larry L. Hillblom Foundation. Samples from the National Centralized Repository for Alzheimer’s Disease and Related Dementias, which receives government support under cooperative agreement grant U24 AG21886 from the NIA, were used in this study.

Role of the Funder/Sponsor: The funders had no role in the design and conduct of the study; collection, management, analysis, and interpretation of the data; preparation, review, or approval of the manuscript; and decision to submit the manuscript for publication.

Group Information: A complete list of the members of the ALLFTD Consortium appears in Supplement 2 .

Data Sharing Statement: See Supplement 3 .

Additional Contributions: We thank the participants and study partners for dedicating their time and effort, and for providing invaluable feedback as we learn how to incorporate digital technologies into FTLD research.

Additional Information: Dr Grossman passed away on April 4, 2023. We want to acknowledge his many contributions to this study, including data acquisition, and design and conduct of the study. He was an ALLFTD site principal investigator and contributed during the development of the ALLFTD mobile app.

  • Register for email alerts with links to free full-text articles
  • Access PDFs of free articles
  • Manage your interests
  • Save searches and receive search alerts

IMAGES

  1. Reliability vs. Validity in Research

    validity and reliability research instrument

  2. PPT

    validity and reliability research instrument

  3. How to establish the validity and reliability of qualitative research?

    validity and reliability research instrument

  4. Validity and reliability of research instrument example

    validity and reliability research instrument

  5. [PDF] Validity and Reliability of the Research Instrument; How to Test

    validity and reliability research instrument

  6. Research Reliability vs Validity

    validity and reliability research instrument

VIDEO

  1. Observation as a data collection technique (Urdu/Hindi)

  2. Validity and Reliability in Research: The Smaller and BIGGER Picture Conceptions

  3. Validity and Its Types

  4. Research Instrument, Validity Reliability, Intervention and Planning Data Collection Procedure ||PR2

  5. RELIABILITY AND VALIDITY IN RESEARCH

  6. Reliability and Validity in Research || Validity and Reliability in Research in Urdu and Hindi

COMMENTS

  1. Reliability vs. Validity in Research

    Reliability is about the consistency of a measure, and validity is about the accuracy of a measure.opt. It's important to consider reliability and validity when you are creating your research design, planning your methods, and writing up your results, especially in quantitative research. Failing to do so can lead to several types of research ...

  2. Validity and Reliability of the Research Instrument; How to Test the

    The module had good content validity with a content validity coefficient of 0.87 and excellent reliability with a module reliability coefficient of 0.90, according to the experts' consensus.

  3. Validity & Reliability In Research

    As with validity, reliability is an attribute of a measurement instrument - for example, a survey, a weight scale or even a blood pressure monitor. But while validity is concerned with whether the instrument is measuring the "thing" it's supposed to be measuring, reliability is concerned with consistency and stability.

  4. How to Determine the Validity and Reliability of an Instrument

    Below are a few additional, useful readings to further inform your understanding of validity and reliability. Resources for Understanding and Testing Reliability. American Educational Research Association, American Psychological Association, & National Council on Measurement in Education. (1985). Standards for educational and psychological ...

  5. Instrument, Validity, Reliability

    Validity is the extent to which an instrument measures what it is supposed to measure and performs as it is designed to perform. It is rare, if nearly impossible, that an instrument be 100% valid, so validity is generally measured in degrees. As a process, validation involves collecting and analyzing data to assess the accuracy of an instrument.

  6. Reliability and Validity

    Reliability refers to the consistency of the measurement. Reliability shows how trustworthy is the score of the test. If the collected data shows the same results after being tested using various methods and sample groups, the information is reliable. If your method has reliability, the results will be valid. Example: If you weigh yourself on a ...

  7. Reliability vs Validity: Differences & Examples

    Validity is more difficult to evaluate than reliability. After all, with reliability, you only assess whether the measures are consistent across time, within the instrument, and between observers. On the other hand, evaluating validity involves determining whether the instrument measures the correct characteristic.

  8. Measuring the Validity and Reliability of Research Instruments

    The application of the Rasch model in validity and reliability research instruments is valuable because the model able to define the constructs of valid items and provide a clear definition of the measurable constructs that are consistent with theoretical expectations. Interestingly, this model can be effectively used on items that can be ...

  9. Validity and Reliability of the Research Instrument; How to Test the

    are known as validity and reliability. Often new researchers are confused with selection and conducting of proper validity type to test their research instrument (questionnaire/survey). This review article explores and describes the validity and reliability of a questionnaire/survey and also discusses various forms of validity and reliability ...

  10. A Primer on the Validity of Assessment Instruments

    What is validity? 1. Validity in research refers to how accurately a study answers the study question or the strength of the study conclusions. ... Discuss whether the changes are likely to affect the reliability or validity of the instrument. Researchers who create novel assessment instruments need to state the development process, reliability ...

  11. Validity and Reliability of the Research Instrument; How to Test the

    Often new researchers are confused with selection and conducting of proper validity type to test their research instrument (questionnaire/survey). This review article explores and describes the validity and reliability of a questionnaire/survey and also discusses various forms of validity and reliability tests.

  12. Reliability Vs Validity

    In other words, validity is the extent to which a measurement instrument or research study measures or tests what it claims to measure or test. A valid measurement instrument or research study should produce results that accurately reflect the concept or construct being measured or tested. Difference Between Reliability Vs Validity. Here's a ...

  13. Reliability and validity: Importance in Medical Research

    Reliability and validity are among the most important and fundamental domains in the assessment of any measuring methodology for data-collection in a good research. Validity is about what an instrument measures and how well it does so, whereas reliability concerns the truthfulness in the data obtained and the degree to which any measuring tool ...

  14. Validity, reliability, and generalizability in qualitative research

    Hence, the essence of reliability for qualitative research lies with consistency.[24,28] A margin of variability for results is tolerated in qualitative research provided the methodology and epistemological logistics consistently yield data that are ontologically similar but may differ in richness and ambience within similar dimensions.

  15. Validity and reliability in quantitative studies

    Validity. Validity is defined as the extent to which a concept is accurately measured in a quantitative study. For example, a survey designed to explore depression but which actually measures anxiety would not be considered valid. The second measure of quality in a quantitative study is reliability, or the accuracy of an instrument.In other words, the extent to which a research instrument ...

  16. Validity and reliability of measurement instruments used in research

    Reliability estimates evaluate the stability of measures, internal consistency of measurement instruments, and interrater reliability of instrument scores. Validity is the extent to which the interpretations of the results of a test are warranted, which depends on the particular use the test is intended to serve.

  17. Measuring Reliability and Validity of Evaluation Instruments

    Evidence of reliability and/or validity are assessed for a specified particular demographic in a particular setting. Using an instrument that has evidence for reliability and/or validity does not mean that the evidence applies to your usage of the instrument. It can provide, however, a greater measure of confidence than an instrument that has ...

  18. Validity and Reliability of the Research Instrument

    There are two types of criterion validity namely; concurrent validity, predictive and postdictive validity. 6. Reliability. Reliability concerns the extent to which a measurement of a phenomenon provides stable and consist result (Carmines and Zeller [ 13] ). Reliability is also concerned with repeatability.

  19. Reliability and Validity of Research Instruments Correspondence to

    This paper primarily focuses explicitly on two terms namely; reliability and validity as used in. the field of educational research. When conducting any educa tional study it is worth noting that ...

  20. (PDF) Validity and Reliability in Quantitative Research

    Abstract and Figures. The validity and reliability of the scales used in research are important factors that enable the research to yield healthy results. For this reason, it is useful to ...

  21. Validity and Reliability of the Research Instrument; How to Test the

    Questionnaire is one of the most widely used tools to collect data in especially social science research. The main objective of questionnaire in research is to obtain relevant information in most reliable and valid manner. Thus the accuracy and consistency of survey/questionnaire forms a significant aspect of research methodology which are known as validity and reliability. Often new ...

  22. Reliability and Validity of Smartphone Cognitive Testing for

    More broadly, the scalability, ease of use, reliability, and validity of the ALLFTD-mApp suggest the feasibility and utility of remote digital assessments in dementia clinical trials. Future research should validate these results in diverse populations and evaluate the utility of these tests for longitudinal monitoring.