Validity and Reliability of the Research Instrument; How to Test the Validation of a Questionnaire/Survey in a Research

9 Pages Posted: 31 Jul 2018

Hamed Taherdoost

Hamta Group

Date Written: August 10, 2016

Questionnaire is one of the most widely used tools to collect data in especially social science research. The main objective of questionnaire in research is to obtain relevant information in most reliable and valid manner. Thus the accuracy and consistency of survey/questionnaire forms a significant aspect of research methodology which are known as validity and reliability. Often new researchers are confused with selection and conducting of proper validity type to test their research instrument (questionnaire/survey). This review article explores and describes the validity and reliability of a questionnaire/survey and also discusses various forms of validity and reliability tests.

Keywords: Research Instrument, Questionnaire, Survey, Survey Validity, Questionnaire Reliability, Content Validity, Face Validity, Construct Validity, Criterion Validity

Suggested Citation: Suggested Citation

Hamed Taherdoost (Contact Author)

Hamta group ( email ).

Vancouver Canada

Do you have a job opening that you would like to promote on SSRN?

Paper statistics, related ejournals, social sciences education ejournal.

Subscribe to this fee journal for more curated articles on this topic

Political Behavior: Voting & Public Opinion eJournal

Political methods: experiments & experimental design ejournal.

  • DOI: 10.2139/SSRN.3205040
  • Corpus ID: 197693640

Validity and Reliability of the Research Instrument; How to Test the Validation of a Questionnaire/Survey in a Research

  • Hamed Taherdoost
  • Published 10 August 2016

Figures and Tables from this paper

figure 1

1,098 Citations

A pilot study approach to assessing the reliability and validity of relevancy and efficacy survey scale, development and validation of survey questionnaire & experimental data – a systematical review-based statistical approach.

  • Highly Influenced

Reliability and validity: Importance in Medical Research.

Validity and reliability of test questions to measure the information literacy skills of prospective teacher students, evaluating construct validity and reliability of intention to transfer training conduct instrument using rasch model analysis, developing and validating a measurement tool to self-report perceived barriers in substance use treatment: the substance use treatment barriers questionnaire (sutbq), development of an instrument to measure self endurance, validity and reliability of the questionnaire on factors causing violent behavior against nurses in health services, methodological planning and data collection. ensuring a valid and credible research of a questionnaire survey in social work, testing the validity and reliability of a writing skill assessment, 31 references, reliability and validity assessment, validation guidelines for is positivist research.

  • Highly Influential

Validity and Reliability Issues in Educational Research

Investigating broadband diffusion in the household: towards content validity and pre-test of the survey instrument, organizational research: determining appropriate sample size in survey research, scaling procedures: issues and applications, a quantitative approach to content validity, a comparison of sampling methods, measurement error and research design, development and validation of critical factors of environmental management, related papers.

Showing 1 through 3 of 0 Related Papers

U.S. flag

An official website of the United States government

The .gov means it’s official. Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

The site is secure. The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

  • Publications
  • Account settings
  • My Bibliography
  • Collections
  • Citation manager

Save citation to file

Email citation, add to collections.

  • Create a new collection
  • Add to an existing collection

Add to My Bibliography

Your saved search, create a file for external citation management software, your rss feed.

  • Search in PubMed
  • Search in NLM Catalog
  • Add to Search

Validity and reliability of measurement instruments used in research

Affiliation.

  • 1 Department of Pharmaceutical Outcomes and Policy, College of Pharmacy, University of Florida, Gainesville, FL 32610, USA. [email protected]
  • PMID: 19020196
  • DOI: 10.2146/ajhp070364

Purpose: Issues related to the validity and reliability of measurement instruments used in research are reviewed.

Summary: Key indicators of the quality of a measuring instrument are the reliability and validity of the measures. The process of developing and validating an instrument is in large part focused on reducing error in the measurement process. Reliability estimates evaluate the stability of measures, internal consistency of measurement instruments, and interrater reliability of instrument scores. Validity is the extent to which the interpretations of the results of a test are warranted, which depends on the particular use the test is intended to serve. The responsiveness of the measure to change is of interest in many of the applications in health care where improvement in outcomes as a result of treatment is a primary goal of research. Several issues may affect the accuracy of data collected, such as those related to self-report and secondary data sources. Self-report of patients or subjects is required for many of the measurements conducted in health care, but self-reports of behavior are particularly subject to problems with social desirability biases. Data that were originally gathered for a different purpose are often used to answer a research question, which can affect the applicability to the study at hand.

Conclusion: In health care and social science research, many of the variables of interest and outcomes that are important are abstract concepts known as theoretical constructs. Using tests or instruments that are valid and reliable to measure such constructs is a crucial component of research quality.

PubMed Disclaimer

Similar articles

  • The measurement of collaboration within healthcare settings: a systematic review of measurement properties of instruments. Walters SJ, Stern C, Robertson-Malt S. Walters SJ, et al. JBI Database System Rev Implement Rep. 2016 Apr;14(4):138-97. doi: 10.11124/JBISRIR-2016-2159. JBI Database System Rev Implement Rep. 2016. PMID: 27532315 Review.
  • How valid and reliable are patient satisfaction data? An analysis of 195 studies. Sitzia J. Sitzia J. Int J Qual Health Care. 1999 Aug;11(4):319-28. doi: 10.1093/intqhc/11.4.319. Int J Qual Health Care. 1999. PMID: 10501602
  • The Trojan Lifetime Champions Health Survey: development, validity, and reliability. Sorenson SC, Romano R, Scholefield RM, Schroeder ET, Azen SP, Salem GJ. Sorenson SC, et al. J Athl Train. 2015 Apr;50(4):407-18. doi: 10.4085/1062-6050-50.2.10. Epub 2015 Jan 22. J Athl Train. 2015. PMID: 25611315 Free PMC article. Clinical Trial.
  • Errors of measurement affecting the reliability and validity of data acquired from self-assessed quality of life. Hanestad BR. Hanestad BR. Scand J Caring Sci. 1990;4(1):29-34. doi: 10.1111/j.1471-6712.1990.tb00004.x. Scand J Caring Sci. 1990. PMID: 2315568
  • Measurement of dyspnea and quality of life in advanced lung disease. Mahler DA, Jones PW. Mahler DA, et al. Clin Chest Med. 1997 Sep;18(3):457-69. doi: 10.1016/s0272-5231(05)70394-4. Clin Chest Med. 1997. PMID: 9329869 Review.
  • Which Psychosocial Strengths Could Combat the Adolescent Suicide Spectrum? Dissecting the Covitality Model. Falcó R, Falcon S, Moreno-Amador B, Piqueras JA, Marzo JC. Falcó R, et al. Psychosoc Interv. 2024 Sep 2;33(3):133-146. doi: 10.5093/pi2024a9. eCollection 2024 Sep. Psychosoc Interv. 2024. PMID: 39234357 Free PMC article.
  • A proposal to improve calibration and outlier detection in high-throughput mass spectrometry. Zabell APR, Lytle FE, Julian RK. Zabell APR, et al. Clin Mass Spectrom. 2017 Jan 3;2:25-33. doi: 10.1016/j.clinms.2016.12.003. eCollection 2016 Dec. Clin Mass Spectrom. 2017. PMID: 39192841 Free PMC article.
  • Measures Determining Dementia-Related Attitudes in Adolescents: A Scoping Review. Hassan E, Hicks B, Tabet N, Farina N. Hassan E, et al. J Intergener Relatsh. 2024 Jul 2;22(3):461-481. doi: 10.1080/15350770.2023.2229837. Epub 2023 Jun 28. J Intergener Relatsh. 2024. PMID: 39086663 Free PMC article.
  • Comparison of causes of stillbirth and child deaths as determined by verbal autopsy and minimally invasive tissue sampling. Assefa N, Scott A, Madrid L, Dheresa M, Mengesha G, Mahdi S, Mahtab S, Dangor Z, Myburgh N, Mothibi LK, Sow SO, Kotloff KL, Tapia MD, Onwuchekwa UU, Djiteye M, Varo R, Mandomando I, Nhacolo A, Sacoor C, Xerinda E, Ogbuanu I, Samura S, Duduyemi B, Swaray-Deen A, Bah A, El Arifeen S, Gurley ES, Hossain MZ, Rahman A, Chowdhury AI, Quique B, Mutevedzi P, Cunningham SA, Blau D, Whitney C. Assefa N, et al. PLOS Glob Public Health. 2024 Jul 29;4(7):e0003065. doi: 10.1371/journal.pgph.0003065. eCollection 2024. PLOS Glob Public Health. 2024. PMID: 39074089 Free PMC article.
  • Commentary on "Effects of upper limb vibratory stimulation training on motor symptoms in Parkinson's disease: an observational study". Dohtdong M, Sharma S, Kalia V. Dohtdong M, et al. J Rehabil Med. 2024 Jul 29;56:jrm40920. doi: 10.2340/jrm.v56.40920. J Rehabil Med. 2024. PMID: 39072428 Free PMC article. No abstract available.

Publication types

  • Search in MeSH

LinkOut - more resources

Full text sources.

  • Ovid Technologies, Inc.
  • Silverchair Information Systems
  • MedlinePlus Health Information

full text provider logo

  • Citation Manager

NCBI Literature Resources

MeSH PMC Bookshelf Disclaimer

The PubMed wordmark and PubMed logo are registered trademarks of the U.S. Department of Health and Human Services (HHS). Unauthorized use of these marks is strictly prohibited.

U.S. flag

An official website of the United States government

The .gov means it’s official. Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

The site is secure. The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

  • Publications
  • Account settings

Preview improvements coming to the PMC website in October 2024. Learn More or Try it out now .

  • Advanced Search
  • Journal List
  • J Grad Med Educ
  • v.3(2); 2011 Jun

A Primer on the Validity of Assessment Instruments

1. what is reliability 1.

Reliability refers to whether an assessment instrument gives the same results each time it is used in the same setting with the same type of subjects. Reliability essentially means consistent or dependable results. Reliability is a part of the assessment of validity.

2. What is validity? 1

Validity in research refers to how accurately a study answers the study question or the strength of the study conclusions. For outcome measures such as surveys or tests, validity refers to the accuracy of measurement. Here validity refers to how well the assessment tool actually measures the underlying outcome of interest. Validity is not a property of the tool itself, but rather of the interpretation or specific purpose of the assessment tool with particular settings and learners.

Assessment instruments must be both reliable and valid for study results to be credible. Thus, reliability and validity must be examined and reported, or references cited, for each assessment instrument used to measure study outcomes. Examples of assessments include resident feedback survey, course evaluation, written test, clinical simulation observer ratings, needs assessment survey, and teacher evaluation. Using an instrument with high reliability is not sufficient; other measures of validity are needed to establish the credibility of your study.

3. How is reliability measured? 2 – 4

Reliability can be estimated in several ways; the method will depend upon the type of assessment instrument. Sometimes reliability is referred to as internal validity or internal structure of the assessment tool.

For internal consistency 2 to 3 questions or items are created that measure the same concept, and the difference among the answers is calculated. That is, the correlation among the answers is measured.

Cronbach alpha is a test of internal consistency and frequently used to calculate the correlation values among the answers on your assessment tool. 5 Cronbach alpha calculates correlation among all the variables, in every combination; a high reliability estimate should be as close to 1 as possible.

For test/retest the test should give the same results each time, assuming there are no interval changes in what you are measuring, and they are often measured as correlation, with Pearson r.

Test/retest is a more conservative estimate of reliability than Cronbach alpha, but it takes at least 2 administrations of the tool, whereas Cronbach alpha can be calculated after a single administration. To perform a test/retest, you must be able to minimize or eliminate any change (ie, learning) in the condition you are measuring, between the 2 measurement times. Administer the assessment instrument at 2 separate times for each subject and calculate the correlation between the 2 different measurements.

Interrater reliability is used to study the effect of different raters or observers using the same tool and is generally estimated by percent agreement, kappa (for binary outcomes), or Kendall tau.

Another method uses analysis of variance (ANOVA) to generate a generalizability coefficient, to quantify how much measurement error can be attributed to each potential factor, such as different test items, subjects, raters, dates of administration, and so forth. This model looks at the overall reliability of the results. 6

5. How is the validity of an assessment instrument determined? 4 – 7 , 8

Validity of assessment instruments requires several sources of evidence to build the case that the instrument measures what it is supposed to measure. , 9,10 Determining validity can be viewed as constructing an evidence-based argument regarding how well a tool measures what it is supposed to do. Evidence can be assembled to support, or not support, a specific use of the assessment tool. Evidence can be found in content, response process, relationships to other variables, and consequences.

Content includes a description of the steps used to develop the instrument. Provide information such as who created the instrument (national experts would confer greater validity than local experts, who in turn would have more validity than nonexperts) and other steps that support the instrument has the appropriate content.

Response process includes information about whether the actions or thoughts of the subjects actually match the test and also information regarding training for the raters/observers, instructions for the test-takers, instructions for scoring, and clarity of these materials.

Relationship to other variables includes correlation of the new assessment instrument results with other performance outcomes that would likely be the same. If there is a previously accepted “gold standard” of measurement, correlate the instrument results to the subject's performance on the “gold standard.” In many cases, no “gold standard” exists and comparison is made to other assessments that appear reasonable (eg, in-training examinations, objective structured clinical examinations, rotation “grades,” similar surveys).

Consequences means that if there are pass/fail or cut-off performance scores, those grouped in each category tend to perform the same in other settings. Also, if lower performers receive additional training and their scores improve, this would add to the validity of the instrument.

Different types of instruments need an emphasis on different sources of validity evidence. 7 For example, for observer ratings of resident performance, interrater agreement may be key, whereas for a survey measuring resident stress, relationship to other variables may be more important. For a multiple choice examination, content and consequences may be essential sources of validity evidence. For high-stakes assessments (eg, board examinations), substantial evidence to support the case for validity will be required. 9

There are also other types of validity evidence, which are not discussed here.

6. How can researchers enhance the validity of their assessment instruments?

First, do a literature search and use previously developed outcome measures. If the instrument must be modified for use with your subjects or setting, modify and describe how, in a transparent way. Include sufficient detail to allow readers to understand the potential limitations of this approach.

If no assessment instruments are available, use content experts to create your own and pilot the instrument prior to using it in your study. Test reliability and include as many sources of validity evidence as are possible in your paper. Discuss the limitations of this approach openly.

7. What are the expectations of JGME editors regarding assessment instruments used in graduate medical education research?

JGME editors expect that discussions of the validity of your assessment tools will be explicitly mentioned in your manuscript, in the methods section. If you are using a previously studied tool in the same setting, with the same subjects, and for the same purpose, citing the reference(s) is sufficient. Additional discussion about your adaptation is needed if you (1) have modified previously studied instruments; (2) are using the instrument for different settings, subjects, or purposes; or (3) are using different interpretation or cut-off points. Discuss whether the changes are likely to affect the reliability or validity of the instrument.

Researchers who create novel assessment instruments need to state the development process, reliability measures, pilot results, and any other information that may lend credibility to the use of homegrown instruments. Transparency enhances credibility.

In general, little information can be gleaned from single-site studies using untested assessment instruments; these studies are unlikely to be accepted for publication.

8. What are useful resources for reliability and validity of assessment instruments?

The references for this editorial are a good starting point.

Gail M. Sullivan, MD, MPH, is Editor-in-Chief, Journal of Graduate Medical Education .

validity of the research instrument pdf

  • Get new issue alerts Get alerts
  • Submit a Manuscript

Secondary Logo

Journal logo.

Colleague's E-mail is Invalid

Your message has been successfully sent to your colleague.

Save my selection

Principles and Methods of Validity and Reliability Testing of Questionnaires Used in Social and Health Science Researches

Bolarinwa, Oladimeji Akeem

From the Department of Epidemiology and Community Health, University of Ilorin and University of Ilorin Teaching Hospital, Ilorin, Nigeria

Address for correspondence: Dr. Oladimeji Akeem Bolarinwa, E-mail: [email protected]

The importance of measuring the accuracy and consistency of research instruments (especially questionnaires) known as validity and reliability, respectively, have been documented in several studies, but their measure is not commonly carried out among health and social science researchers in developing countries. This has been linked to the dearth of knowledge of these tests. This is a review article which comprehensively explores and describes the validity and reliability of a research instrument (with special reference to questionnaire). It further discusses various forms of validity and reliability tests with concise examples and finally explains various methods of analysing these tests with scientific principles guiding such analysis.

INTRODUCTION

The different measurements in social science research require quantification of abstracts, intangible and construct that may not be observable. [ 1 ] However, these quantification will come in the different forms of inference. In addition, the inferences made will depend on the type of measurement. [ 1 ] These can be observational, self-report, interview and record review. The various measurements will ultimately require measurement tools through which the values will be captured. One of the most common tasks often encountered in social science research is ascertaining the validity and reliability of a measurement tool. [ 2 ] The researchers always wish to know if the measurement tool employed actually measures the intended research concept or construct (is it valid? or true measures?) or if the measurement tools used to quantify the variables provide stable or consistent responses (is it reliable? or repeatable?). As simple as this may seems, it is often omitted or just mentioned passively in the research proposal or report. [ 2 ] This has been adduced to the dearth of skills and knowledge of validity and reliability test analysis among social and health science researchers. From the author's personal observation among researchers in developing countries, most students and young researchers are not able to distinguish validity from reliability. Likewise, they do not have the prerequisite to understand the principles that underline validity and reliability testing of a research measurement tool.

This article therefore sets out to review the principles and methods of validity and reliability measurement tools used in social and health science researches. To achieve the stated goal, the author reviewed currents articles (both print and online), scientific textbooks, lecture notes/presentations and health programme papers. This is with a view to critically review current principles and methods of reliability and validity tests as they are applicable to questionnaire use in social and health researches.

Validity expresses the degree to which a measurement measures what it purports to measure. Several varieties have been described, including face validity, construct validity, content validity and criterion validity (which could be concurrent and predictive validity). These validity tests are categorised into two broad components namely; internal and external validities. [ 3 ][ 4 ][ 5 ] Internal validity refers to how accurately the measures obtained from the research was actually quantifying what it was designed to measure whereas external validity refers to how accurately the measures obtained from the study sample described the reference population from which the study sample was drawn. [ 5 ]

Reliability refers to the degree to which the results obtained by a measurement and procedure can be replicated. [ 3 ][ 4 ][ 5 ] Though reliability importantly contributes to the validity of a questionnaire, it is however not a sufficient condition for the validity of a questionnaire. [ 6 ] Lack of reliability may arise from divergence between observers or instruments of measurement such as a questionnaire or instability of the attribute being measured [ 3 4 ] which will invariably affect the validity of such questionnaire. There are three aspects of reliability, namely: Equivalence, stability and internal consistency (homogeneity). [ 5 ] It is important to understand the distinction between these three aspects as it will guide the researcher on the proper assessment of reliability of a research tool such as questionnaire. [ 7 ] [ Figure 1 ] shows graphical presentation of possible combinations of validity and reliability. [ 8 ]

F1-1

Questionnaire is a predetermined set of questions used to collect data. [ 2 ] There are different formats of questionnaire such as clinical data, social status and occupational group.[ 3 ] It is a data collection ‘tool’ for collecting and recording information about a particular issue of interest. [ 2 5 ] It should always have a definite purpose that is related to the objectives of the research, and it needs to be clear from the outset on how the findings will be used. [ 2 5 ] Structured questionnaires are usually associated with quantitative research, which means research that is concerned with numbers (how many? how often? how satisfied?). It is the mostly used data collection instrument in health and social science research. [ 9 ]

In the context of health and social science research, questionnaires can be used in a variety of survey situations such as postal, electronic, face-to-face (F2F) and telephone. [ 9 ] Postal and electronic questionnaires are known as self-completion questionnaires, i.e., respondents complete them by themselves in their own time. F2F and telephone questionnaires are used by interviewers to ask a standard set of questions and record the responses that people give to them. [ 9 ] Questionnaires that are used by interviewers in this way are sometimes known as interview schedules. [ 9 ] It could be adapted from an already tested one or could be developed as a new data tool specific to measure or quantify a particular attribute. These conditions therefore warrant the need to test validity and reliability of questionnaire. [ 2 5 9 ]

METHODS USED FOR VALIDITY TEST OF A QUESTIONNAIRE

A drafted questionnaire should always be ready for establishing validity. Validity is the amount of systematic or built-in error in questionnaire. [ 5 9 ] Validity of a questionnaire can be established using a panel of experts which explore theoretical construct as shown in Figure 2 . This form of validity exploits how well the idea of a theoretical construct is represented in an operational measure (questionnaire). This is called a translational or representational validity. Two subtypes of validity belongs to this form namely; face validity and content validity. [ 10 ] On the other hand, questionnaire validity can be established with the use of another survey in the form of a field test and this examines how well a given measure relates to one or more external criterion, based on empirical constructs as shown in Figure 2 . These forms could be criterion-related validity [ 10 11 ] and construct validity. [ 11 ] While some authors believe that criterion-related validity encompasses construct validity, [ 10 ] others believe both are separate entities. [ 11 ] According to the authors who put the 2 as separate entities, predictive validity and concurrence validity are subtypes of criterion-related validity while convergence validity, discriminant validity, known-group validity and factorial validity are sub-types of construct validity [ Figure 2 ]. [ 10 ] In addition, some authors included hypothesis-testing validity as a form of construct validity. [ 12 ] The detailed description of the subtypes are described in the next paragraphs.

F2-1

FACE VALIDITY

Some authors [ 7 13 ] are of the opinion that face validity is a component of content validity while others believe it is not. [ 2 14 15 ] Face validity is established when an individual (and or researcher) who is an expert on the research subject reviewing the questionnaire (instrument) concludes that it measures the characteristic or trait of interest. [ 7 13 ] Face validity involves the expert looking at the items in the questionnaire and agreeing that the test is a valid measure of the concept which is being measured just on the face of it. [ 15 ] This means that they are evaluating whether each of the measuring items matches any given conceptual domain of the concept. Face validity is often said to be very casual, soft and many researchers do not consider this as an active measure of validity. [ 11 ] However, it is the most widely used form of validity in developing countries. [ 15 ]

CONTENT VALIDITY

Content validity pertains to the degree to which the instrument fully assesses or measures the construct of interest. [ 7 15 ][ 16 ][ 17 ] For example, a researcher is interested in evaluating employees' attitudes towards a training program on hazard prevention within an organisation. He wants to ensure that the questions (in the questionnaire) fully represent the domain of attitudes towards the occupational hazard prevention. The development of a content valid instrument is typically achieved by a rational analysis of the instrument by raters (experts) familiar with the construct of interest or experts on the research subject. [ 15 ][ 16 ][ 17 ] Specifically, raters will review all of the questionnaire items for readability, clarity and comprehensiveness and come to some level of agreement as to which items should be included in the final questionnaire. [ 15 ] The rating could be a dichotomous where the rater indicates whether an item is ‘favourable’ (which is assign a score of +1) or ‘unfavourable’ (which is assign score of +0). [ 15 ] Over the years however, different ratings have been proposed and developed. These could be in Likert scaling or absolute number ratings. [ 18 ][ 19 ][ 20 21 ] Item rating and scale level rating have been proposed for content validity. The item-rated content validity indices (CVI) are usually denoted as I-CVI. [ 15 ] While the scale-level CVI termed S-CVI will be calculated from I-CVI. [ 15 ] S-CVI means the level of agreement between raters. Sangoseni et al. [ 15 ] proposed a S-CVI of ≥0.78 as significant level for inclusion of an item into the study. The Fog Index, Flesch Reading Ease, Flesch-Kincaid readability formula and Gunning-Fog Index are formulas that have also been used to determine readability in validity. [ 7 12 ] Major drawback of content validity is that it is also adjudged to be highly subjective like face validity. However, in some cases, researchers could combine more than one form of validity to increase validity strength of the questionnaire. For instance, face validity has been combined with content validity [ 15 22 23 ] criterion validity. [ 13 ]

CRITERION-RELATED VALIDITY

Criterion-related validity is assessed when one is interested in determining the relationship of scores on a test to a specific criterion. [ 24 25 ] It is a measure of how well questionnaire findings stack up against another instrument or predictor. [ 5 25 ] Its major disadvantage is that such predictor may not be available or easy to establish. There are 2 variants of this validity type as follows:

Concurrence

This assesses the newly developed questionnaire against a highly rated existing standard (gold standard). When the criterion exists at the same time as the measure, we talk about concurrent validity. [ 24 ][ 25 ][ 26 27 ] Concurrent validity refers to the ability of a test to predict an event in the present form. For instance, in a simplest form, a researcher may use questionnaire to elucidate diabetic patients' blood sugar level reading in the last hospital follow-up visits and compare this response to laboratory reading of blood glucose for such patient.

It assesses the ability of the questionnaire (instrument) to forecast future events, behaviour, attitudes or outcomes. This is assessed using correlation coefficient. Predictive validity is the ability of a test to measure some event or outcome in the future. [ 24 28 ] A good example of predictive validity is the use of hypertensive patients' questionnaire on medication adherence to medication to predict their future medical outcome such as systolic blood pressure control. [ 28 29 ]

CONSTRUCT VALIDITY

Construct validity is the degree to which an instrument measures the trait or theoretical construct that it is intended to measure. [ 5 16 30 ][ 31 ][ 32 ][ 33 ][ 34 ] It does not have a criterion for comparison rather it utilizes a hypothetical construct for comparison. [ 5 11 30 ][ 31 ][ 32 ][ 33 ][ 34 ] It is the most valuable and most difficult measure of validity. Basically, it is a measure of how meaningful the scale or instrument is when it is in practical use. [ 5 24 ] There are four types of evidence that can be obtained for the purpose of construct validity depending on the research problem, as discussed below:

Convergent validity

There is evidence that the same concept measured in different ways yields similar results. In this case, one could include two different tests. In convergent validity where different measures of the same concept yield similar results, a researcher uses self-report versus observation (different measures). [ 12 33 34 ][ 35 ][ 36 ] The 2 scenarios given below illustrate this concept.

Scenario one

A researcher could place meters on respondent's television (TV) sets to record the time that people spend with certain health programmes on TV. Then, this record can be compared with survey results on ‘exposure to health program on televised’ using questionnaire.

Scenario two

The researcher could send someone to observe respondent's TV use at their home and compare the observation results with the survey results using questionnaire.

Discriminant validity

There is evidence that one concept is different from other closely related concepts. [ 12 34 36 ] Using the scenarios of TV health programme exposure above, the researcher can decide to measure the exposure to TV entertainment programmes and determine if they differ from TV health programme exposure measures. In this case, the measures of exposure to TV health programme should not be highly related to the measures of exposure to TV entertainment programmes.

Known-group validity

In known-group validity, a group with already established attribute of the outcome of construct is compared with a group in whom the attribute is not yet established. [ 11 37 ] Since the attribute of the two groups of respondents is known, it is expected that the measured construct will be higher in the group with related attribute but lower in the group with unrelated attribute. [ 11 36 ][ 37 ][ 38 ] For example, in a survey that used questionnaire to explore depression among two groups of patients with clinical diagnosis of depression and those without. It is expected (in known-group validity) that the construct of depression in the questionnaire will be scored higher among the patients with clinically diagnosed depression than those without the diagnosis. Another example was shown in a study by Singh et al. [ 38 ] where cognitive interview study was conducted among school pupils in 6 European countries.

Factorial validity

This is an empirical extension of content validity. This is because it validates the contents of the construct employing the statistical model called factor analysis. [ 11 39 40 ][ 41 ][ 42 ] It is usually employed when the construct of interest is in many dimensions which form different domains of a general attribute. In the analysis of factorial validity, the several items put up to measure a particular dimension within a construct of interest is supposed to be highly related to one another than those measuring other dimensions. [ 11 39 40 ][ 41 ][ 42 ] For instance, using health-related quality of life questionnaire using short form - 36 version 2 (SF-36v2). This tool has 8 dimensions and it is therefore expected that all the items of SF-36v2 questionnaire measuring social function (SF), which is one of the 8 dimension, should be highly related than those items measuring mental health domain which measure another dimension. [ 43 ]

Hypothesis-testing validity

Evidence that a research hypothesis about the relationship between the measured concept (variable) or other concepts (variables), derived from a theory, is supported. [ 12 44 ] In the case of TV viewing, for example, there is a social learning theory stating how violent behaviour can be learned from observing and modelling televised physical violence. From this theory, we could derive a hypothesis stating a positive correlation between physical aggression and the amount of televised physical violence viewing. If the evidence collected supports the hypothesis, we can conclude that there is a high degree of construct validity in the measurements of physical aggression and viewing of televised physical violence since the two theoretical concepts are measured and examined in the hypothesis-testing process.

METHODS USED FOR RELIABILITY TEST OF A QUESTIONNAIRE

Reliability is an extent to which a questionnaire, test, observation or any measurement procedure produces the same results on repeated trials. In short, it is the stability or consistency of scores over time or across raters. [ 7 ] Keep in mind that reliability pertains to scores not people. Thus, in research, one would never say that someone was reliable. As an example, consider judges in a platform diving competition. The extent to which they agree on the scores for each contestant is an indication of reliability. Similarly, the degree to which an individual's responses (i.e., their scores) on a survey would stay the same over time is also a sign of reliability. [ 7 ] It is worthy to note that lack of reliability may arise from divergences between observers or instruments of measurement or instability of the attribute being measured. [ 3 ] Reliability of the questionnaire is usually carried out using a pilot test. Reliability could be assessed in three major forms; test-retest reliability, alternate-form reliability and internal consistency reliability. These are discussed below.

TEST-RETEST RELIABILITY (OR STABILITY)

Test-retest correlation provides an indication of stability over time. [ 5 12 27 37 ] This aspect of reliability or stability is said to occur when the same or similar scores are obtained with repeated testing with the same group of respondents. [ 5 25 35 37 ] In other words, the scores are consistent from 1 time to the next. Stability is assessed through a test-retest procedure that involves administering the same measurement instrument such as questionnaire to the same individuals under the same conditions after some period of time. It is the most common form in surveys for reliability test of questionnaire.

Test-rest reliability is estimated with correlations between the scores at time 1 and those at time 2 (to time x ). Two assumptions underlie the use of the test-retest procedure; [ 12 ]

  • The first required assumption is that the characteristic that is measured does not change over the time period called ‘testing effect’ [ 11 ]
  • The second assumption is that the time period is long enough yet short in time that the respondents' memories of taking the test at time 1 do not influence their scores at time 2 and subsequent test administrations called ‘memory effect’.

It is measured by having the same respondents complete a survey at two different points in time to see how stable the responses are. In general, correlation coefficient ( r ) values are considered good if r ≥ 0.70. [ 38 45 ]

If data are recorded by an observer, one can have the same observer make two separate measurements. The comparison between the two measurements is intra-observer reliability. In using this form of reliability, one needs to be careful with questionnaire or scales that measure variables which are likely to change over a short period of time, such as energy, happiness and anxiety because of maturation effect. [ 24 ] If the researcher has to use such variables, then he has to make sure that test-retest is done over very short periods of time. Potential problem with test-retest in practice effect is that the individuals become familiar with the items and simply answer based on their memory of the last answer. [ 45 ]

ALTERNATE-FORM RELIABILITY (OR EQUIVALENCE)

Alternate form refers to the amount of agreement between two or more research instruments such as two different questionnaires on a research construct that are administered at nearly the same point in time. [ 7 ] It is measured through a parallel form procedure in which one administers alternative forms of the same measure to either the same group or different group of respondents. It uses differently worded questionnaire to measure the same attribute or construct. [ 45 ] Questions or responses are reworded or their order is changed to produce two items that are similar but not identical. This administration of the various forms occurs at the same time or following some time delay. The higher the degree of correlation between the two forms, the more equivalent they are. In practice, the parallel forms procedure is seldom implemented, as it is difficult, if not impossible, to verify that two tests are indeed parallel (i.e., have equal means, variances and correlations with other measures). Indeed, it is difficult enough to have one well-developed instrument or questionnaire to measure the construct of interest let alone two. [ 7 ]

Another situation in which equivalence will be important is when the measurement process entails subjective judgements or ratings being made by more than one person. [ 5 7 ] Say, for example, that we are a part of a research team whose purpose is to interview people concerning their attitudes towards health educational curriculum for children. It should be self-evident to the researcher that each rater should apply the same standards towards the assessment of the responses. The same can be said for a situation in which multiple individuals are observing health behaviour. The observers should agree as to what constitutes the presence or absence of a particular health behaviour as well as the level to which the behaviour is exhibited. In these scenarios, equivalence is demonstrated by assessing inter-observer reliability which refers to the consistency with which observers or raters make judgements. [ 7 ]

The procedure for determining inter-observer reliability is:

No of agreements/no of opportunities for agreement ×100.

Thus, in a situation in which raters agree in a total of 75 times out of 90 opportunities (i.e. unique observations or ratings) produces 83% agreement that is 75/90 = 0.83 × 100 = 83%.

INTERNAL CONSISTENCY RELIABILITY (OR HOMOGENEITY)

Internal consistency concerns the extent to which items on the test or instrument are measuring the same thing. The appeal of an internal consistency index of reliability is that it is estimated after only one test administration and therefore avoids the problems associated with testing over multiple time periods. [ 5 ] Internal consistency is estimated via the split-half reliability index [ 5 ] and coefficient alpha index [ 22 ][ 23 ][ 25 ][ 37 ][ 42 46 47 ][ 48 ][ 49 ] which is the most common used form of internal consistency reliability. Sometimes, Kuder-Richardson formula 20 (KR-20) index was used. [ 7 50 ]

The split-half estimate entails dividing up the test into two parts (e.g. odd/even items or first half of the items/second half of the items), administering the two forms to the same group of individuals and correlating the responses. [ 7 10 ] Coefficient alpha and KR-20 both represent the average of all possible split-half estimates. The difference between the two is when they would be used to assess reliability. Specifically, coefficient alpha is typically used during scale development with items that have several response options (i.e., 1 = strongly disagree to 5 = strongly agree) whereas KR-20 is used to estimate reliability for dichotomous (i.e., yes/no; true/false) response scales. [ 7 ]

The formula to compute KR-20 is:

KR-20 = n /( n − 1)[1 − Sum(piqi)/Var(X)].

n = Total number of items

Sum(piqi) = Sum of the product of the probability of alternative responses

Var(X) = Composite variance.

And to calculate coefficient alpha ( a ) by Allen and Yen, 1979: [ 51 ]

a = n /( n − 1)[1 − Sum Var (Yi)/Var (X)].

Where n = Number of items

Sum Var(Yi) = Sum of item variances

It should be noted that KR-20 and Cronbach alpha can easily be estimated using several statistical analysis software these days. Therefore, researchers do not have to go through the laborious exercise of memorising the mathematical formula given above. As a rule of thumb, the higher the reliability value, the more reliable the measure. The general convention in research has been prescribed by Nunnally and Bernstein, [ 52 ] which states that one should strive for reliability values of 0.70 or higher. It is worthy of note that reliability values increase as test length increases. [ 53 ] That is, the more items we have in our scale to measure the construct of interest, the more reliable our scale will become. However, the problem with simply increasing the number of scale items when performing applied research is that respondents are less likely to participate and answer completely when confronted with the prospect of replying to a lengthy questionnaire. [ 7 ] Therefore, the best approach is to develop a scale that completely measures the construct of interest and yet does so in as parsimonious or economical manner as is possible. A well-developed yet brief scale may lead to higher levels of respondent participation and comprehensiveness of responses so that one acquires a rich pool of data with which to answer the research question.

SHORT NOTE ON SPSS AND RELIABILITY TEST

Reliability can be established using a pilot test by collecting data from 20 to 30 subjects not included in the sample. Data collected from pilot test can be analysed using SPSS (Statistical Package for Social Sciences, by IBM incorporated) or any other related software. SPSS provides two key pieces of information in the output viewer. These are ‘correlation matrix’ and ‘view alpha if item deleted’ columns. [ 54 55 ] Cronbach alpha ( a ) is the most commonly used measure of internal consistency reliability [ 45 ] and so it will be discussed here. Conditions that could affect Cronbach values are [ 54 55 ]

  • Numbers of items; scale of <10 variables could cause Cronbach alpha to be low
  • Distribution of score; normality increases Cronbach alpha value while skewed data reduces it
  • Timing; Cronbach alpha does not indicate the stability or consistency of the test over time
  • Wording of the items; negative-worded questionnaire should be reversed before scoring
  • Items with 0, 1 and negative scores: Ensure that items/statements that have 0 s, 1 s and negatives are eliminated.

The detailed step by step procedure for the reliability analysis using SPSS can be found on internet and standard tests. [ 54 55 ] But, note that the reliability coefficient (alpha) can range from 0 to 1, with 0 representing a questionnaire that is not reliable and 1 representing absolutely reliable questionnaire. A reliability coefficient (alpha) of 0.70 or higher is considered acceptable reliability in SPSS.

This article reviewed validity and reliability of questionnaire as an important research tool in social and health science research. The article observed the importance of validity and reliability tests in research and gave both literary and technical meanings of these tests. Various forms and methods of analysing validity and reliability of questionnaire were discussed with the main aim of improving the skills and knowledge of these tests among researchers in developing countries.

Financial support and sponsorship

Conflicts of interest.

There are no conflicts of interest.

  • Cited Here |
  • Google Scholar
  • PubMed | CrossRef |
  • View Full Text | PubMed | CrossRef |

Questionnaire; reliability; social and health; validity

  • + Favorites
  • View in Gallery

Readers Of this Article Also Read

Blood pressure pattern and prevalence of hypertension amongst apparently..., individual-level predictors of birth preparedness and complication readiness:..., coronavirus disease 2019 vaccination coverage and seropositivity amongst..., vaccine safety: assessing the prevalence and severity of adverse events..., serum copper, zinc and selenium levels in women with unexplained infertility in ....

  • Probability
  • Mathematics
  • Probability Theory
  • Reliability

Reliability and Validity of Research Instruments Correspondence to [email protected]

  • September 2019
  • Conference: NMK conference

Kubai Edwin at Ministry of Education

  • Ministry of Education

Discover the world's research

  • 25+ million members
  • 160+ million publication pages
  • 2.3+ billion citations

Moh. Yamin

  • Enock Mulekano Were

Ikhsanudin Ikhsanudin

  • Erica Ikramunnisa
  • Ratri x Ratri Paramita

Ali Saad

  • Dukuzimana Olivier
  • Mugiraneza Faustin (PhD)

Janice A. Bolen

  • Rafael P. Vergara
  • Leo Franco A. Majaducon

Phillip Raymund Reyes De Oca

  • Chipo Katsande

Roda Villones

  • Lovely Laurel
  • Marilyn Nueva

Ghilson Amor

  • Recruit researchers
  • Join for free
  • Login Email Tip: Most researchers use their institutional email address as their ResearchGate login Password Forgot password? Keep me logged in Log in or Continue with Google Welcome back! Please log in. Email · Hint Tip: Most researchers use their institutional email address as their ResearchGate login Password Forgot password? Keep me logged in Log in or Continue with Google No account? Sign up

Have a language expert improve your writing

Run a free plagiarism check in 10 minutes, generate accurate citations for free.

  • Knowledge Base

Methodology

  • The 4 Types of Validity in Research | Definitions & Examples

The 4 Types of Validity in Research | Definitions & Examples

Published on September 6, 2019 by Fiona Middleton . Revised on June 22, 2023.

Validity tells you how accurately a method measures something. If a method measures what it claims to measure, and the results closely correspond to real-world values, then it can be considered valid. There are four main types of validity:

  • Construct validity : Does the test measure the concept that it’s intended to measure?
  • Content validity : Is the test fully representative of what it aims to measure?
  • Face validity : Does the content of the test appear to be suitable to its aims?
  • Criterion validity : Do the results accurately measure the concrete outcome they are designed to measure?

In quantitative research , you have to consider the reliability and validity of your methods and measurements.

Note that this article deals with types of test validity, which determine the accuracy of the actual components of a measure. If you are doing experimental research, you also need to consider internal and external validity , which deal with the experimental design and the generalizability of results.

Table of contents

Construct validity, content validity, face validity, criterion validity, other interesting articles, frequently asked questions about types of validity.

Construct validity evaluates whether a measurement tool really represents the thing we are interested in measuring. It’s central to establishing the overall validity of a method.

What is a construct?

A construct refers to a concept or characteristic that can’t be directly observed, but can be measured by observing other indicators that are associated with it.

Constructs can be characteristics of individuals, such as intelligence, obesity, job satisfaction, or depression; they can also be broader concepts applied to organizations or social groups, such as gender equality, corporate social responsibility, or freedom of speech.

There is no objective, observable entity called “depression” that we can measure directly. But based on existing psychological research and theory, we can measure depression based on a collection of symptoms and indicators, such as low self-confidence and low energy levels.

What is construct validity?

Construct validity is about ensuring that the method of measurement matches the construct you want to measure. If you develop a questionnaire to diagnose depression, you need to know: does the questionnaire really measure the construct of depression? Or is it actually measuring the respondent’s mood, self-esteem, or some other construct?

To achieve construct validity, you have to ensure that your indicators and measurements are carefully developed based on relevant existing knowledge. The questionnaire must include only relevant questions that measure known indicators of depression.

The other types of validity described below can all be considered as forms of evidence for construct validity.

Here's why students love Scribbr's proofreading services

Discover proofreading & editing

Content validity assesses whether a test is representative of all aspects of the construct.

To produce valid results, the content of a test, survey or measurement method must cover all relevant parts of the subject it aims to measure. If some aspects are missing from the measurement (or if irrelevant aspects are included), the validity is threatened and the research is likely suffering from omitted variable bias .

A mathematics teacher develops an end-of-semester algebra test for her class. The test should cover every form of algebra that was taught in the class. If some types of algebra are left out, then the results may not be an accurate indication of students’ understanding of the subject. Similarly, if she includes questions that are not related to algebra, the results are no longer a valid measure of algebra knowledge.

Face validity considers how suitable the content of a test seems to be on the surface. It’s similar to content validity, but face validity is a more informal and subjective assessment.

You create a survey to measure the regularity of people’s dietary habits. You review the survey items, which ask questions about every meal of the day and snacks eaten in between for every day of the week. On its surface, the survey seems like a good representation of what you want to test, so you consider it to have high face validity.

As face validity is a subjective measure, it’s often considered the weakest form of validity. However, it can be useful in the initial stages of developing a method.

Criterion validity evaluates how well a test can predict a concrete outcome, or how well the results of your test approximate the results of another test.

What is a criterion variable?

A criterion variable is an established and effective measurement that is widely considered valid, sometimes referred to as a “gold standard” measurement. Criterion variables can be very difficult to find.

What is criterion validity?

To evaluate criterion validity, you calculate the correlation between the results of your measurement and the results of the criterion measurement. If there is a high correlation, this gives a good indication that your test is measuring what it intends to measure.

A university professor creates a new test to measure applicants’ English writing ability. To assess how well the test really does measure students’ writing ability, she finds an existing test that is considered a valid measurement of English writing ability, and compares the results when the same group of students take both tests. If the outcomes are very similar, the new test has high criterion validity.

Prevent plagiarism. Run a free check.

If you want to know more about statistics , methodology , or research bias , make sure to check out some of our other articles with explanations and examples.

  • Normal distribution
  • Degrees of freedom
  • Null hypothesis
  • Discourse analysis
  • Control groups
  • Mixed methods research
  • Non-probability sampling
  • Quantitative research
  • Ecological validity

Research bias

  • Rosenthal effect
  • Implicit bias
  • Cognitive bias
  • Selection bias
  • Negativity bias
  • Status quo bias

Face validity and content validity are similar in that they both evaluate how suitable the content of a test is. The difference is that face validity is subjective, and assesses content at surface level.

When a test has strong face validity, anyone would agree that the test’s questions appear to measure what they are intended to measure.

For example, looking at a 4th grade math test consisting of problems in which students have to add and multiply, most people would agree that it has strong face validity (i.e., it looks like a math test).

On the other hand, content validity evaluates how well a test represents all the aspects of a topic. Assessing content validity is more systematic and relies on expert evaluation. of each question, analyzing whether each one covers the aspects that the test was designed to cover.

A 4th grade math test would have high content validity if it covered all the skills taught in that grade. Experts(in this case, math teachers), would have to evaluate the content validity by comparing the test to the learning objectives.

Criterion validity evaluates how well a test measures the outcome it was designed to measure. An outcome can be, for example, the onset of a disease.

Criterion validity consists of two subtypes depending on the time at which the two measures (the criterion and your test) are obtained:

  • Concurrent validity is a validation strategy where the the scores of a test and the criterion are obtained at the same time .
  • Predictive validity is a validation strategy where the criterion variables are measured after the scores of the test.

Convergent validity and discriminant validity are both subtypes of construct validity . Together, they help you evaluate whether a test measures the concept it was designed to measure.

  • Convergent validity indicates whether a test that is designed to measure a particular construct correlates with other tests that assess the same or similar construct.
  • Discriminant validity indicates whether two tests that should not be highly related to each other are indeed not related. This type of validity is also called divergent validity .

You need to assess both in order to demonstrate construct validity. Neither one alone is sufficient for establishing construct validity.

The purpose of theory-testing mode is to find evidence in order to disprove, refine, or support a theory. As such, generalizability is not the aim of theory-testing mode.

Due to this, the priority of researchers in theory-testing mode is to eliminate alternative causes for relationships between variables . In other words, they prioritize internal validity over external validity , including ecological validity .

It’s often best to ask a variety of people to review your measurements. You can ask experts, such as other researchers, or laypeople, such as potential participants, to judge the face validity of tests.

While experts have a deep understanding of research methods , the people you’re studying can provide you with valuable insights you may have missed otherwise.

Cite this Scribbr article

If you want to cite this source, you can copy and paste the citation or click the “Cite this Scribbr article” button to automatically add the citation to our free Citation Generator.

Middleton, F. (2023, June 22). The 4 Types of Validity in Research | Definitions & Examples. Scribbr. Retrieved September 14, 2024, from https://www.scribbr.com/methodology/types-of-validity/

Is this article helpful?

Fiona Middleton

Fiona Middleton

Other students also liked, reliability vs. validity in research | difference, types and examples, construct validity | definition, types, & examples, external validity | definition, types, threats & examples, get unlimited documents corrected.

✔ Free APA citation check included ✔ Unlimited document corrections ✔ Specialized in correcting academic texts

Click through the PLOS taxonomy to find articles in your field.

For more information about PLOS Subject Areas, click here .

Loading metrics

Open Access

Study Protocol

Systematic review and meta-analysis of developmental assets scales: A study protocol for psychometric properties

Roles Conceptualization, Investigation, Methodology, Project administration, Supervision, Validation, Writing – original draft, Writing – review & editing

* E-mail: [email protected]

Affiliation Department of Psychology, Norwegian University of Science and Technology, Trondheim, Norway

ORCID logo

Roles Investigation, Methodology, Project administration, Validation, Writing – original draft, Writing – review & editing

Affiliation Health Promotion Research Center, Iran University of Medical Science, Tehran, Iran

Roles Investigation, Writing – review & editing

Affiliations Department of Health Psychology, School of Behavioral Sciences and Mental Health (Tehran Psychiatric Institute), Iran University of Medical Sciences, Tehran, Iran, Human Development & Family Sciences, Texas Tech University, Lubbock, Texas, United States of America

Roles Validation, Writing – original draft, Writing – review & editing

Affiliation Department of Psychology, Faculty of Education and Psychology, University of Tabriz, Tabriz, Iran

Affiliation Department of Psychosocial Science, Faculty of Psychology, University of Bergen, Bergen, Norway

  • Mojtaba Habibi Asgarabad, 
  • Pardis Salehi Yegaei, 
  • Elizabeth Trejos-Castillo, 
  • Nazanin Seyed Yaghoubi Pour, 

PLOS

  • Published: September 10, 2024
  • https://doi.org/10.1371/journal.pone.0309909
  • Peer Review
  • Reader Comments

Table 1

Background and aims

Application of developmental assets, one of existing Positive Youth Development (PYD) frameworks, has gained momentum in research, policy formulations, and interventions, necessitating the introduction of the most efficient scales for this framework. The present study protocol aims to conduct a systematic review and meta-analysis of developmental assets scales to document the underlying logic, objectives, and methodologies earmarked for the identification, selection, and critical evaluation of these scales.

Methods and materials

In accordance with the Preferred Reporting Items for Systematic Review and Meta-Analysis Protocols (PRISMA-P), the intended search will encompass databases of PubMed, Scopus, Web of Knowledge, and PsycINFO, spanning from the inception of 1988 to 1st of April 2024. The review will include articles published published in English language focusing on individuals aged 10 to 29 years and reporting at least one type of reliability or validity of developmental assets scales. The review process will be in compliance with the COnsensus-based Standards for the selection of health Measurement INstruments (COSMIN), and the overall quality of evidence will be determined using the Grading of Recommendations, Assessment, Development and Evaluations (GRADE) guidelines.

This comprehensive assessment aims to identify potential biases in prior research and offer guidance to scholars regarding the optimal scales for developmental assets in terms of validity, reliability, responsiveness, and interpretability The evidence-based appraisal of the scales strengths and limitations is imperative in shaping future research, enhancing their methodological rigor, and proposing refinements to existing instruments for developmental assets.

Citation: Habibi Asgarabad M, Salehi Yegaei P, Trejos-Castillo E, Seyed Yaghoubi Pour N, Wiium N (2024) Systematic review and meta-analysis of developmental assets scales: A study protocol for psychometric properties. PLoS ONE 19(9): e0309909. https://doi.org/10.1371/journal.pone.0309909

Editor: Chrysanthi Lioupi, University of Nicosia, CYPRUS

Received: December 8, 2023; Accepted: August 18, 2024; Published: September 10, 2024

Copyright: © 2024 Habibi Asgarabad et al. This is an open access article distributed under the terms of the Creative Commons Attribution License , which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited.

Data Availability: Deidentified research data will be made publicly available when the study is completed and published.

Funding: Thanks go to the Norwegian University of Science and Technology (NTNU) for providing financial support for the publication of this article as an open-access article. The funders played no role in the design of the study, the collection of data, the analysis, the decision to publish, or the preparation of the manuscript.

Competing interests: The authors have no commercial or financial relationships that could be construed as a potential conflict of interest.

1. Introduction

Positive youth development (PYD) is a strength-based developmental perspective that puts emphasis on meaningful and constructive involvement of young individuals in their communities, educational institutions, social circles, and families with the aim of empowering and enabling them to achieve their maximum capabilities [ 1 ]. PYD concentrates on boosting young people’s strengths, establishing supportive contexts, and promoting reciprocal and constructive youth ↔ context interactions [ 2 , 3 ].

PYD has gained ground in organizations, community-based services, and youth-serving programs providing opportunities for young people to foster their competencies [ 4 ]. Thus, an array of frameworks over the past decades has been conceptualized and designed to define and capture PYD. The Search Institute in collaboration with Benson et al. [ 5 ] proposed the conceptual framework of PYD that includes 40 developmental assets, comprising 20 individual assets and contextual assets (e.g., home, school, community). To assess this model, Benson et al. [ 6 ] developed the 58-item developmental assets profile (DAP) which includes two asset types: a) the “asset category” perspective that organizes items into measures representing eight internal and external developmental asset categories ( Table 1 ); and b) “asset context” perspective that regroup items based on how young individuals experience these assets in various ecological contexts (i.e., personal, family, school, social, and community). DAP items are scored using a scale ranging from “ not at all/rarely ” to “ extremely/almost always .” In terms of psychometric properties, the DAP is found to possesses invariance over time [ 7 ], and has acceptable to good reliability and validity in both individual projects and group aggregates [ 6 ]. For instance, higher scores in the developmental assets have been linked to adolescent achievement [ 8 ], avoidance of risky behaviors [ 9 ], and improvement of pro-social behavior, resiliency, and leadership [ 7 , 10 ].

thumbnail

  • PPT PowerPoint slide
  • PNG larger image
  • TIFF original image

https://doi.org/10.1371/journal.pone.0309909.t001

Another scale, Youth Asset Survey (YAS; [ 11 ]), was designed to assess the associations of developmental assets with risky behaviors in a prospective study on adolescents and their parents. This 37-item survey encompasses eight subscales corresponding to eight developmental assets of family communication, peer role models, general future aspirations, responsible choices, community involvement, non-parental role models, use of time on groups/sports, and use of time on religion, along with two one-item subscales of cultural respect and good health practices (exercise/nutrition). Although the construct validity and internal consistency of YAS has been supported, subscales of family communication and future aspirations have reported low alpha coefficients (< .68). Additionally, the number of items was limited for two additional assets; cultural respect and good health practices, including exercise and nutrition [ 11 ]. In an attempt to modify this scale, Oman, Lensch [ 12 ] conducted a longitudinal cohort study and provided an improved 68-item scale, Youth Asset Survey -Revised (YAS-R). The YAS-R appraises seven additional developmental assets, namely, religiosity, school connectedness, relationship with father, relationship with mother, general self-confidence, parental monitoring, and educational aspirations for the future.

Taken together, as PYD programs and frameworks grow in popularity, there is a compelling need to globally and culturally adapt appropriate and relevant measures of developmental assets that are psychometrically sound across various contexts [ 13 ]. The extensive utilization of these scales warrants a systematic review to examine the caliber of their psychometric properties, describe their plausible psychometric shortcomings and strengths, and identify the best measures for researchers. Despite the availability of numerous developmental assets scales, the comparison of psychometric characteristics of the most employed scales have been mostly understudied. Therefore, there is a pressing need for a summary of available developmental assets scales and their psychometric robustness to serve as a guide for choosing the right measurement tool when conducting investigations and implementing youth programs.

Furthermore, a growing body of evidence is concerned with the prevalence of publication bias in the context of systematic reviews [e.g., 14 ] suggesting that the failure to publish complete studies might pose a challenge for systematic reviews. For instance, Silagy, Middleton [ 15 ]’ study underscored the potential bias towards favoring positive findings in published systematic reviews. Given these observations, it becomes paramount that a predefined protocol is established before the review, articulating whether the review outcomes align with the original study plan, and enhancing the transparency in both the execution and eventual reporting of systematic review [ 16 ]. Thus, a review protocol that is subjected to peer review contributes to preventing ad hoc decisions in the review process, as well as reducing publication bias and selective reporting [ 17 , 18 ].

The proposed systematic review is the first-of-its-kind study that will fill the existing gap on the characteristics and psychometric properties of measures of developmental assets. In particular, our objectives are to: 1) prepare a comprehensive list of available tools developed for developmental assets, 2) summarize the characteristics of these tools/questionnaires (e.g., number of components/items, assessment method, language, and scoring type), 3) identify the most commonly used psychometric indexes for the evaluation of these tools/measures (e.g., reliability, validity, measurement error, responsiveness, and interpretability), 4) appraise the extent to which the measurement properties of these tools/questionnaires possess the methodological quality in accordance with the COnsensus-based Standards for the selection of health Measurement INstruments (COSMIN) criteria [ 19 ], and 5) compare the quality of the measurement properties related to these tools based on the results of COSMIN (if applicable).

Reporting the proposed review aligns with the guidelines of the Preferred Reporting Items for Systematic Review and Meta-Analyses (PRISMA) [ 20 ]. See S1 Table for a completed PRISMA-P-checklist.

2.1. Eligibility criteria

The three main eligibility criteria for papers containing developmental assets scales are as follows: a) being published in English, b) targeting young people aged 10–29 in line with Catalano, Skinner [ 21 ] suggestion, and c) being published after 1998 (when developmental assets were first conceptualized by Benson, Leffert [ 5 ]). Besides these primary criteria, studies must have followed at least one of the following aims: 1) reporting quantitative information on the appropriateness or acceptability of the measurement tools for developmental assets, 2) providing at least one of the reliability indexes of these tools, 3) containing information on the validity of these tools, or 4) using these tools to evaluate risk factors (i.e., predictor variable) and/or study outcomes. Papers with different methodological designs, such as cohort studies, randomized/non-randomized controlled trials, cross-sectional, post-intervention, and case-control studies, as well as grey literature will be included if they meet eligibility criteria. Studies will be excluded if they: 1) contain no empirical evidence (i.e., theoretical framework discussions and editorials) and 2) are literature reviews.

2.2. Information sources and strategy for search

A quick literature review based on Medical Subject Heading (MeSH) [ 22 ] will be used to identify keywords in two domains, namely, “developmental assets,” and “tools/ questionnaire” [the extended keywords are presented in syntax search in Table 2 ]. A preliminary search strategy will be developed with the aid of a senior librarian. A sample search strategy for PubMed is presented in Table 2 (This syntax is subject to vary based on the final search strategy). This search strategy has been adjusted according to the second version of the COSMIN initiative’s search filter [ 23 ]. Each domain will be searched individually to launch pilot searches. Following that, a comprehensive search will be conducted by combining all domains to ensure that an appropriate search strategy is implemented. Subsequently, the databases Scopus, PubMed, PsycINFO, and Web of Science will be searched starting on the establishment date of the developmental assets framework (1998) through the 1st of April 2024 (This date is subject to vary based on the final date of coverage). To include additional studies and explore references further, we will check the studies’ references and conduct a broader search across scientific journals covering related fields. Furthermore, we will contact specialists in the field to obtain unpublished or under-review papers, where possible. Ultimately, we will search gray literature through the Healthcare Management Information Consortium (HMIC) and the European Association for Grey Literature Exploitation (EAGLE).

thumbnail

https://doi.org/10.1371/journal.pone.0309909.t002

2.3. Data screening

The Rayyan QCRI online software [ 24 ] will be utilized to organizing references, titles, and abstracts of the papers and identifying duplicates. The titles and abstracts will be examined in the screening stage; papers that are not compatible with the proposed study’s purpose will be excluded. In the eligibility phases, the entire articles will be reviewed to determine if they fit the inclusion criteria to be considered for the meta-analysis. An independent reviewer will assess the manuscripts (PSY) and a senior researcher (MHA) will review the results. The details of the screening procedure are displayed in the flowchart of the PRISMA extension for systematic reviews ( Fig 1 ).

thumbnail

https://doi.org/10.1371/journal.pone.0309909.g001

2.4. Data extraction

A data extraction form will be designed in Microsoft Excel 2016 (a sample form is presented in Table 3 ). In the following step, data from three papers will be extracted to identify and modify possible flaws and deficiencies in the data extraction form. Two expert researchers (PSY and MHA) will extract the information from chosen papers individually, and, in case of any ambiguity, a senior researcher will further evaluate the extraction process (NW). In cases where data in the papers are missing, we will contact the authors and ask for original data via Email. The final form of extracted data will contain details on authors, year of publication, country, language, sample size, participant’s age and gender, classification of country and family by income level (ranging from low to high income), minority group, study setting (e.g., family/home, community, or school), measurement tool title, measurement tool development (with objectives) or adaptation, how initial questions were generated (e.g., theory- and literature review-derived, expert panel, focus group discussion, combining previous measurement tools), assessment method (self/proxy/teen/teacher-report questionnaire, interview, observation), scoring type (multiple-choice, Likert, etc.), number of subscales and items, reliability (i.e., internal consistency and measurement error), validity (i.e., face, content, construct, structural, cross-cultural, criterion, known-group, and longitudinal validity), responsiveness, and interpretability.

thumbnail

https://doi.org/10.1371/journal.pone.0309909.t003

2.5. Outcomes

Expected outcomes of this study will include offering an exhaustive and clear description of the measurement tools accessible for developmental assets and discovering possible shortcomings and strengths of these measurement tools. Additional outcomes may include: a) aiding researchers to select appropriate measurement tools in future investigations, and b) assisting researchers in selecting appropriate tool by assessing the utility and adaptability of the chosen instrument in their region and cultural context.

2.6. Potential biases in single studies

To facilitate the evaluation of risk of bias in every research study, we will apply the COSMIN criteria for systematic reviews of PROMs [ 19 , 25 ]. The 116-item checklist of COSMIN’s Risk of Bias consists of ten criteria, as detailed in Table 4 , including validity (e.g., content, structural, and criterion validity), reliability (e.g., stability and measurement error), responsiveness, and interpretability. The assessment uses a four-point rating scale: “ very good ”, “ adequate ”, “ doubtful ”, and “ inadequate ”. The quality of measurement properties will be graded as sufficient (+), insufficient (-), or indeterminate (?) based on the COSMIN criteria for good measurement properties.

thumbnail

https://doi.org/10.1371/journal.pone.0309909.t004

2.7. Data synthesis and meta-analysis

Prior to data synthesis and if feasible, data pooling will be conducted. A standard psychometric meta-analysis approach, posited by Hunter and Schmidt [ 28 , 29 ] will be performed based on the psychometrics principles. This approach suggests that measurement errors (caused by unreliable measures), errors in sampling process, and range limitations are some of the sources that cause artifact variability and account for a large portion of the observed variation in the relationship between two variables in original studies. Consequently, it is crucial to conduct meta-analyses to identify potential moderating factors influencing these relationships and address artifact variability across studies. By identifying these factors, researchers can control artifact variability through either selecting appropriate study designs or subtracting artifact validity from the overall observed variability.

Meta-analysis will be based on Fisher’s Z transformed partial correlation coefficient, known to have the lowest root mean square error and bias [ 30 ] in comparison with the partial correlation coefficients of meta-analysis [ 31 ]. The standardized effect size will be Fisher’s Z that ranges from − ∞ to + ∞ and the standards used to interpret them are similar to those used for a correlation coefficient. If intraclass, Pearson, or Spearman correlations are provided, we will apply the Fisher’s variance stabilizing transformation [ 32 , 33 ] to convert them into Fisher’s Z scores. If the coefficients of unstandardized beta and F -ratios are provided, we will primarily convert them to r and afterwards to Fisher’s Z score [ 32 , 33 ]. If only p values are provided, we will convert them to Z-score, r , and Fisher’s Z, respectively [ 33 ]. We will extract the overall effect size for each psychometric property and the effect sizes for each follow-up interval from studies that include follow-up assessments.

Data analysis will be carried out utilizing Comprehensive Meta-Analysis v.3 software [ 34 ]. With the presumption that heterogeneity is probable and that the mean of the effect size is not stable across studies, random-effects models will be applied. To assess heterogeneity, the Cochrane’s Q test (the presence of heterogeneity) and the I 2 statistic (diversity in heterogeneity effect estimates) will be utilized [ 35 ]. Based on the standard interpretation [ 36 ], I 2 statistic will be defined as “not important” (0–40%), “moderate” (30–60%), “substantial” (50–90%), and “considerable” (75–100%). Furthermore, where appropriate, funnel plots will be provided to identify reporting bias and the effects size of small studies [ 36 ]. To ensure that the meta-analysis’s findings are robust in case of considerable heterogeneity, a sensitivity analysis will be conducted.

We will create a qualitative abstract based on studies’ outcomes concerning the measurement properties of each measurement tool when it is not feasible to pool the data. If there are discrepancies between the findings across various studies, possible explanations will be provided. If a consistent pattern appears, we will consolidate the results for each subgroup with consistent findings. In case that no clear justifications or discernible patterns emerge, the majority of the findings will be employed to assess the results.

2.8. Quality assessment of individual studies

As described in Table 5 , the quality of results of every study will be evaluated via the COSMIN criteria for good measurement properties. These properties will be rated as “ insufficient , (-)” “ indeterminate , (?)” or “ sufficient (+)”. The total score of a property will be equal to the lowest score it obtains, and its interpretation will be on the basis of the COSMIN criteria: 50% (high quality), 30–50% (moderate quality), and less than 30% (low quality) [ 19 , 37 , 38 ]. Four domains of reliability, validity, responsiveness, and interpretability will be included in the quality assessment taxonomy. This study will assess the quality of Exploratory Factor Analysis (EFA) using the guidelines outlined by Terwee, Bot [ 39 ]. According to these guidelines, in the absence of a theoretical or empirically emerged structural model, EFA is preferable. In contrast, when a model has already been theoretically proposed and/or has empirically emerged in the literature, should Confirmatory Factor Analysis (CFA) be tested [ 40 , 41 ]. The results of EFA’s quality assessment will be interpreted as follows: (+) the chosen factors can explain at least 50% of the variance OR they can explain less than 50% of the variance but a justification for this selection is proposed by authors; (?) the vague or incomplete information (e.g., failure to mention the explained variance) prevents scoring the EFA’s quality; and (-) criteria for a “plus” rating was not met [ 39 ].

thumbnail

https://doi.org/10.1371/journal.pone.0309909.t005

2.9. Confidence in consolidated findings

The Grading of Recommendations, Assessment, Development, and Evaluation (GRADE) working group approach [ 43 ] will be used to test the credibility of each piece of research posterior to provide a summary of general ratings on each psychometric property. The quality of evidence will be examined using five categories: risk of bias, publication bias, imprecision, inconsistency, and indirectness. Two researchers (PSY and MHA) will independently appraise the overall quality of the summarized findings. In the event of any disagreement, a third researcher (ETC) will further review the findings. The quality of outcomes will be classified as high (indicating a high degree of confidence in the measurement property’s estimate being close to the true value), moderate (indicating a reasonable belief that the true estimate of the measurement property is likely close to the estimated value), low (suggesting a substantial potential for a significant difference between the true estimate and the estimated property), or very low (indicating a high likelihood of a substantial deviation between the actual measurement property and its estimated value).

2.10. Recommendations for instrument choice

The assessment of the psychometric soundness of the measurement tools for developmental assets, along with providing recommendations for their forthcoming applications will be carried out in accordance with a composite of general ratings for each psychometric property and the grading outcomes, as outlined by Prinsen, Mokkink [ 37 ]. The outcomes derived from all included studies pertaining to each measurement property will be stratified into three distinct recommendation classifications, as articulated by Mokkink, Prinsen [ 42 ] and Mokkink, De Vet [ 19 ]: (a) the developmental assets scale has the potential to be introduced as the most suitable instrument for evaluating its intended theoretical model; (b) the scale may be tentatively endorsed, though further investigations into its validation are imperative; and (c) the scale should not be endorsed. The rationale underlying the assignment of each measurement tool to the aforementioned categories will be further explained. Subsequently, prospective trajectories for research will be delineated, where pertinent.

2.11. Ethics and dissemination

As this project did not recruit participants, it is not a prerequisite to obtain ethical approval. The findings of the intended review will be reported in a peer-reviewed journal.

3. Discussion

The present study protocol aims to systematically review the underlying logic and aims of developmental assets scales, and to specify the methods that will be employed to critically evaluate primary studies on these scales. The main objective of the proposed systematic review and meta-analysis is to compile, summarize, and critically evaluate the psychometric properties (i.e., validity, reliability, measurement error, interpretability, and responsiveness) of measurement tools including but not limited to: a) the 58-item developmental assets profile (DAP) [ 6 ] derived from the developmental assets model proposed by Benson, Leffert [ 5 ], Youth Asset Survey (YAS; [ 11 ]), and Youth Asset Survey -Revised (YAS-R) [ 12 ].

Given that the PYD frameworks including developmental assets have gained traction in various areas, such as youth research, policy formulations [ 44 ], school-based organization programs [ 45 ], adolescents behavior interventions [ 46 ], community-based health services [ 47 ], socializing systems [ 48 ], and cross-sectoral interventions in education [ 49 ], it is crucial to identify the most efficient and preferred measures for developmental assets. To the best of our knowledge, no systematic review of the psychometric properties of developmental assets scales has been carried out to date. Thus the proposed systematic review will be the first. In general, we strive to incorporate an in-depth summary and collection of the most used measurement tools in the developmental assets framework, the purpose of their development, their characteristics, and psychometric indexes. In addition, we aim to delve into the methodological quality of their measurement properties based on COSMIN criteria. Consequently, by discerning the strength of those measurement tools, we wish to assist researchers, clinicians, and community-health specialists in making informed choices and select more optimal measures aligned with their desired objectives. Besides, through the detection of potential shortcomings, researchers may underscore the need for enhancing current measures. Furthermore, considering the rigorous quality assessments of the aforementioned scales, the level of validity, reliability, feasibility, productivity, as well as potential risk of bias of earlier published studies will become evident.

The present study protocol has several implications. First, the rigor and reliability of a systematic review on the developmental assets scales are contingent upon the meticulous pre-planning that preemptively identifies potential challenges [ 18 ]. In the process of publication, a documented protocol undergoes the rigorous peer review that scrutinize it and guarantees its appropriateness prior to its publication [ 16 ]. The second factor that ensures the trustworthiness of a systematic review is the thorough documentation of the methodology used in the review process prior to commencing the review. This documentation permits other researchers to compare the protocol with the finalized review, thus identifying instances of selective reporting or deviations and assess the proposed methodologies’ validity [ 17 ]. Third, a study protocol prevents making arbitrary decisions regarding inclusion of studies and data extraction, and mitigates publication bias in favor of only reporting “positive” findings (or those findings in line with authors’ hypotheses) [ 50 ]. Finally, a properly conducted research protocol allows for the potential replication of (revised) review methods. Given that the current protocol does not include measurement tools and studies in other languages, local researchers will receive a published protocol, which they can adapt for use in their context and language, with some modifications.

The present study protocol is not without potential limitations. The first drawback is the exclusion of locally standardized/developed developmental assets scales that are not accessible in English. As a result, we will possibly miss significant related research published in other languages. Second, given that existing studies using developmental assets scales have been carried out in different locations and periods, the accuracy of the assessment may have been inadvertently affected by the external/environmental factors. Third, since research incorporates a wide range of study methods and samples, the assessment of statistical heterogeneity is frequently overlooked or reported insufficiently. Hence, data synthesis and meta-analysis may be impacted by high rates of heterogeneity.

Supporting information

S1 table. prisma-p (preferred reporting items for systematic review and meta-analysis protocols) 2015 checklist: recommended items to address in a systematic review protocol..

https://doi.org/10.1371/journal.pone.0309909.s001

Acknowledgments

The authors gratefully thank those who kindly participated in this research.

  • 1. Lerner RM, editor Promoting positive youth development: Theoretical and empirical bases. White paper prepared for the workshop on the science of adolescent health and development, national research council/institute of medicine Washington, DC: National Academies of Science; 2005: Citeseer.
  • 2. Lerner JV, Phelps E, Forman Y, Bowers EP. Positive youth development: John Wiley & Sons Inc; 2009.
  • View Article
  • PubMed/NCBI
  • Google Scholar
  • 13. Dukakis K, London RA, McLaughlin M, Williamson D. Positive youth development: Individual, setting and system level indicators. Issue Brief. 2009.
  • 16. Higgins J, Green S. Rationale for protocols. Cochrane Handbook for Systematic Reviews of Interventions: The Cochrane Collaboration. www.handbook.cochrane.org. ; 2011.
  • 28. Hunter JE, Schmidt FL. Methods of meta-analysis: Correcting error and bias in research findings. Newbury Park, CA: Sage; 2004.
  • 29. Schmidt FL, Hunter JE. Methods of meta-analysis: Correcting error and bias in research findings: Sage publications; 2014.
  • 30. van Aert RC. Meta-analyzing partial correlation coefficients using Fisher’s z transformation. 2023.
  • 33. Rosenberg MJ, Adams DC, Gurevitch J. Metawin 2.0 User’s manual: statistical software for meta-analysis. 2000.
  • 36. Higgins JP, Thomas J, Chandler J, Cumpston M, Li T, Page MJ, et al. Cochrane handbook for systematic reviews of interventions: John Wiley & Sons; 2019.
  • 40. Bollen KA. Structural equations with latent variables: John Wiley & Sons; 1989.
  • 44. Dimitrova R, Wiium N. Handbook of positive youth development: Advancing the next generation of research, policy and practice in global contexts: Springer; 2021.
  • Study Protocol
  • Open access
  • Published: 11 September 2024

A new patient-reported outcome measure for the evaluation of ankle instability: description of the development process and validation protocol

  • Pietro Spennacchio 1 , 2 ,
  • Eric Hamrin Senorski 3 ,
  • Caroline Mouton 1 , 2 ,
  • Jan Cabri 2 ,
  • Romain Seil 1 , 2 &
  • Jon Karlsson 4  

Journal of Orthopaedic Surgery and Research volume  19 , Article number:  557 ( 2024 ) Cite this article

47 Accesses

Metrics details

Acute ankle sprains represent one of the most common traumatic injuries to the musculoskeletal system. Many individuals with these injuries experience unresolved symptoms such as instability and recurrent sprains, leading to chronic ankle instability (CAI), which affects their ability to maintain an active lifestyle. While rehabilitation programs focusing on sensorimotor, neuromuscular, strength and balance training are primary treatments, some patients require surgery when rehabilitation fails. A critical analysis of the patient-reported outcome tools (PROs) used to assess CAI surgical outcomes raises some concerns about their measurement properties in CAI patients, which may ultimately affect the quality of evidence supporting current surgical practice. The aim of this research is to develop and validate a new PRO for the assessment of ankle instability and CAI treatment outcomes, following recent methodological guidelines, with the implicit aim of contributing to the generation of scientifically meaningful evidence for clinical practice in patients with ankle instability.

Following the COnsensus-based Standards for the selection of Health Measurement Instruments (COSMIN), an Ankle Instability Treatment Index (AITI) will be developed and validated. The process begins with qualitative research based on face‒to‒face interviews with CAI individuals to explore the subjective experience of living with ankle instability. The data from the interviews will be coded following an inductive approach and used to develop the AITI content. The preliminary version of the scale will be refined through an additional round of face‒to‒face interviews with a new set of CAI subjects to define the AITI content coverage, relevance and clarity. Once content validity has been examined, the AITI will be subjected to quantitative analysis of different measurement properties: construct validity, reliability and responsiveness.

The development of AITI aims to address the limitations of existing instruments for evaluating surgical outcomes in patients with CAI. By incorporating patient input and adhering to contemporary standards for validity and reliability, this tool seeks to provide a reliable and meaningful assessment of treatment effects.

Trial registration

Not applicable.

Acute ankle ligament injuries are among the most common musculoskeletal injuries in both the general and sports populations [ 14 ]. A significant number of people who have suffered a first ankle ligament injury have unresolved posttraumatic symptoms lasting more than 1 year, such as feelings of instability (33–55%), recurrent episodes of ‘giving way’ and sprains (3–35%), and, in some cases, pain [ 17 , 39 ]. This condition is referred to as chronic ankle instability (CAI), a multifaceted syndrome that is associated with functional and/or structural deficiencies and impaired quality of life and decreased physical activity [ 2 , 15 ].

CAI patients are primarily treated with a comprehensive rehabilitation program that emphasizes ankle sensorimotor, strength and balance training. Rehabilitation has been reported to improve subjective symptoms and functional limitations and reduce the risk of ankle reinjury in CAI patients [ 1 , 9 ]. However, despite prolonged functional rehabilitation, some patients with CAI continue to experience significant activity restrictions due to ankle problems. When rehabilitation fails, surgery appears to be a viable therapeutic option for restoring joint function by targeting and correcting the mechanical deficiency of the injured ankle–ligament complex [ 5 ]. The current scientific literature supports the use of different surgical strategies for treating ankle instability, ranging from anatomical repair of the native ligamentous complex to the use of different graft reconstruction techniques, performed via open surgery, minimally invasive and arthroscopic techniques [ 13 , 22 , 37 ]. Unfortunately, there is poor agreement on the surgical standard of care for CAI, and guidelines for determining the surgeon’s choice are still lacking [ 23 , 41 ].

To compare and select appropriate surgical options for treating CAI properly, a combination of reliable tools, including both patient-related and clinician-generated parameters, must be considered [ 35 ]. Patient-reported outcome measures (PROs) are recognized modalities for accounting for the patient’s perspective on his/her current condition. This subjective view is of primary importance for the evaluation of any given treatment and should also ideally contribute in a positive manner to the clinical decision-making process. The ability of a PRO to produce clinically meaningful data is embodied in the multifaceted concept of validity, which can generally be defined as the ability of the instrument to measure the construct it purports to measure [ 6 ]. However, a critical analysis of the literature reveals that validity is an issue for current PROs used to assess CAI surgical outcomes [ 7 , 20 ], raising some concerns about the quality of the evidence supporting clinical practice in patients with ankle instability.

The primary aim of this research was to address this knowledge gap by developing a new patient-reported outcome tool, following methodological guidelines, specifically designed to assess ankle instability and changes following therapeutic interventions. This study protocol describes the process of developing and validating a new tool to evaluate ankle instability, the Ankle Instability Treatment Index (AITI).

Why a new scale?

The best available evidence about the clinimetric properties of PROs in the specific CAI population suggests the use of the Foot and Ankle Ability Measure (FAAM), the Foot and Ankle Outcome Score (FAOS), and the Karlsson score as the most appropriate PROs for evaluating surgical outcomes in CAI patients [ 7 , 12 , 16 ]. The FAAM and the FAOS were originally conceived as region-specific scores to evaluate functional limitations associated with a variety of foot and ankle problems [ 21 , 31 ]. Only retrospective evidence of validation has been obtained for patients suffering from ankle instability [ 3 , 11 , 30 ]. Both PROs thus do not specifically assess symptoms of ankle joint instability, which raises concerns about their ability to tap an essential disease-specific feature representing a primary target of any ankle stabilization procedure [ 40 ].

The Karlsson score was developed in 1991 to assess joint function after treatment for lateral ankle ligament injuries [ 16 ]. Since its inception, the scale has served as a useful tool in research dealing with the treatment of ankle instability, as evidenced by the frequent use of the scale to report the results of CAI surgery [ 34 ]. However, a systematic review published in 2007 on the available PROs in foot and ankle research area highlighted that the scale lacked evidence on important aspects of validity, such as content validity, reliability and responsiveness [ 20 ]. Since this observation, to the best of the authors’ knowledge, there has been no further analysis of the scale’s validity.

On the basis of these observations, the authors believe that the development of a new PRO for the evaluation of CAI surgical outcome, following the most recent guidelines on PRO properties, is justified by the current state of knowledge.

What should the scale measure?

A focus group consisting of all the authors of this publication (Dr. Pietro Spennacchio, Professor Jon Karlsson, Professor Romain Seil, Dr. Caroline Mouton and Dr. Eric Hamrin Senorski) with recognized expertise and previous publications in the field of ankle instability and outcome tools met initially to discuss the purpose and basic concepts of the new scale. The experts agreed that the main purpose of the project would be to develop an evaluative tool capable of assessing, through direct patient feedback, the symptomatic state of the CAI subject as well as its modification with treatment, according to what is most important to the patient.

The described development procedure adheres to the minimum requirements of validity and reliability as set forth by the latest version of the COnsensus based Standards for the selection of Health Status Measurements INstruments [COSMIN] [ 24 ]. The process of developing the new rating scale is shown in Fig.  1 . It is a multistage process that involves iteratively and interactively, experts and patients in various qualitative and quantitative stages of development to produce a clinically meaningful scale [ 4 ]. To ensure the development of an instrument with high content validity, the process begins with a qualitative research phase aimed at exploring the subjective feelings and formulations of CAI subjects through individual face‒to-face interviews. The qualitative part of the research belongs to the “phenomenology” design type and can be related to the following question: “What do people with chronic ankle instability experience? “, with the aim of allowing participants to provide an insightful perspective on their subjective experience of living with ankle instability [ 18 , 33 ]. The subjective feedback from the CAI subjects will then be used to support the definition of the construct to be assessed by the new scale.

figure 1

Flow diagram showing the multiphase process of AITI development. AITI: Ankle Instability Treatment Scale. CAI: Chronic ankle instability. PROs: patient-reported outcomes

Participants and recruitment

The inclusion criteria for participation in the development and validation of the new score are detailed in Table  1 . The clinical diagnosis of chronic ankle instability reported in this study is consistent with the Position Statement on Selection Criteria for CAI subjects in Research defined by the International Ankle Consortium [ 12 ]. Recruitment will be conducted in a single center by a member of the focus group, who is an experienced foot and ankle surgeon (PS). In line with the stated phenomenological qualitative study design, sampling will be carried out via a criterion sampling strategy, with the most prominent criterion being the participant’s experience of the phenomenon of ankle instability, as supported by the diagnostic criteria outlined in Table  1 [ 18 ].

Patient interviews and data collection

Written informed consent will be obtained from all participants before the face-to-face interviews begin. A preliminary list of clinical features of ankle instability, derived from the experience of the developers and the content of PROs commonly used in the research dealing with ankle instability, will be defined to provide prior theoretical knowledge that will serve as a testing ground for the information emerging from the interviews. The interviews follow a framework of open-ended questions designed to encourage discussion of the patient’s subjective experience of the different dimensions of the pathology, as well as the change expected from a treatment designed to improve their current disease (Table  2 ). The interviewer will take special care to avoid any specific guidance or influence on the answers, to allow the participant to express his/her own feelings, perceptions and thoughts, using his/her own words as freely as possible.

The qualitative interviews will be conducted, transcribed verbatim and progressively coded by one researcher (PS). The raw data will be analyzed repeatedly from the first interviews onwards via an inductive coding scheme [ 8 ]. The aim is to define data with labels that will allow them to be grouped into preliminary categories that will allow the progressive coding of all the content collected during the interviews. The emerging categories will be analyzed for similarities in content and finally grouped into higher categories to establish a preliminary framework of the phenomenon of ankle instability, which comprises the different dimensions of the condition experienced by CAI patients [ 18 ]. The emerging categories and their content will be reported to the focus group. Any missing points suggested by comparisons with existing knowledge and the developers’ experience in treating ankle instability will be explored further with additional questions in subsequent interviews to iteratively configure the conceptual framework of ankle instability of the new scale.

The interviews will continue until saturation is reached, defined as the point at which no additional codes or insights emerge in three consecutive interviews, confirming clear data redundancy. On the basis of practical guidelines and estimates from previous qualitative phenomenological studies, a minimum of 10 face‒to‒face interviews are expected [ 24 , 32 ].

Item generation, scale refinement and content validity

The conceptual framework developed will be used to design the domains and items of the new scale in its preliminary version. The information from the previous interviews will be used to generate relevant items, paying particular attention to the wording spontaneously evoked by the patient to ensure clarity and the patient-reported nature of the instrument.

The preliminary scale will be tested through a new round of face‒to‒face interviews with a minimum of 30 new participants not involved in the previous qualitative interviews who meet the same inclusion criteria, as described in Table  1 [ 24 ]. The purpose of the interviews will be to confirm the clarity of each instruction, item and response option. In the case of any unclear item or wording, the participant will be asked to explain his uncertainties and to suggest modifications that are able to improve the clarity of the question. Any possible missed aspects of the ankle instability construct will be further investigated through dedicated probing to explore the patient’s perspective on the content coverage of the scale. During the interview, a quantitative assessment of the content relevance of the scale will be carried out to confirm the instrument’s ability to analyze what matters to patients diagnosed with CAI [ 24 , 28 ]. The respondents will be asked to rate the relevance of the items on a 4-point scale to calculate the content validity ratio for the item’s relevance and appropriateness of the scaling options [ 19 ].

The relevance of the items and the comprehensiveness of the instrument will be further investigated from the perspective of professionals (Orthopaedics and Physiotherapists) )with established experience in the treatment of ankle instability outside the development team. The AITI with a dedicated rating form will be emailed to these professionals, and they will be asked to rate the relevance of each item to the construct of ankle instability. The raters will also be asked to comment on whether any aspects of the construct of instability have been omitted.

How does the instrument work?

After content validity has been examined, the AITI will be subjected to an analysis of different measurement properties, as outlined below.

Construct validity

The construct validity of the AITI will be examined by defining its internal consistency, which is the extent to which the scale items are correlated with each other, thus measuring the same construct and supporting the derivation of a composite score from the sum of the items [ 38 ]. The cohort size of the subjects required to adequately determine the construct validity will be further defined when preliminary data will be available on a sample of 20 CAI patients to ensure statistical power for each analysis.

The correlation between items is defined by calculating the Cronbach’s α. Internal consistency between the items between 0.70 and 0.95 is considered acceptable [ 26 ]. To ensure a clear interpretation of the internal consistency statistics, the dimensionality of the scale will be tested with a confirmatory factor analysis [ 25 ]. The number of items making up the scale will also determine the appropriate recruitment size for internal consistency analysis, with a sample size of at least six times the number of items retained [ 24 ].

The construct validity of the AITI will be further explored by testing the hypothesis of an expected relationship with 2 scores commonly selected by researchers to assess CAI surgical outcomes in a minimum of 50 CAI patients [ 24 ]: the Karlsson scale [ 16 ] and the FAAM sports subscale [ 3 ]. The available evidence for the validation of the comparator instruments in CAI population subjects supports an expected relationship in the midrange of 0.4–0.8, as defined by the calculation of Pearson’s product‒moment correlation coefficients (parametric data) or Spearman’s r (rank correlation) coefficients (nonparametric correlation) [ 36 ].

  • Reliability

In addition to the definition of internal consistency described above, the reliability of the AITI will be further investigated by determining test reproducibility and measurement error in a sample size of CAI participants, which will be further defined once preliminary data are available with the new instrument. In accordance with the COSMIN guidelines, a minimum of 50 CAI subjects will be included in this analysis [ 24 ]. Reproducibility (test‒retest reliability) is the extent to which repeated measurements in stable individuals yield similar responses [ 38 ]. Patients participating in this step of validation will complete the new outcome scale twice, with a 10–14-day interval between the two administrations. In line with the definition of a PRO as information that comes directly from patients without interpretation by a clinician [ 27 ], the questionnaire will be administered in a strict self-administered mode without external support, which may introduce bias related to caregiver interpretation.

Evaluation of the test–retest reliability of the scale will be performed by calculating the intraclass correlation coefficient (ICC-agreement) with 95% confidence intervals (CI) [ 10 ]. On the basis of the ICC values, the standard error of measurement (SEM) and the minimal detectable change (MDC) will be calculated.

Responsiveness

Responsiveness is defined as the ability of a questionnaire to detect clinically important changes over time, even if these changes are small [ 36 ]. This is a fundamental property for any instrument purporting to evaluate the effect of a therapeutic intervention (evaluative instrument). The instrument responsiveness will be the last property to be analyzed, only after all the facets of validity outlined above have finally been proven to be at least as adequate [ 24 ]. A new group of minimum 30 CAI patients [ 24 ] will be analyzed using the instrument before and after an ankle stabilization procedure at a minimum follow-up of 1 year, a time point that is expected to show a modification of the preoperative patient’s health state. The effect size (ES) and the standardized response mean (SRM) will be determined as indicators of the ability of the new instrument to detect real changes [ 36 ].

The most important direct patient perspective on a given treatment, captured through valid and reliable PROs, is considered an essential outcome for generating the data necessary to incorporate effective and meaningful treatment strategies into clinical practice [ 29 ]. The authors noted that the existing evidence on CAI surgical outcomes is mainly based on PROs with limited evidence of validity, which casts doubt on the consistency and reliability of the data supporting current treatment algorithms [ 20 , 34 ].

This study protocol describes the process of developing and validating a new disease-specific patient-reported tool for the evaluation of ankle instability treatment, the AITI. The focus on patient input in defining scale content and adherence to the latest consensus-based standards for PRO validity and reliability represent the strategy for developing an instrument with appropriate measurement properties in CAI patients. The authors believe that this process is a necessary step in the search for scientifically sound data ensuring a reliable, evidence-based standard of care for patients suffering from ankle instability.

Data availability

No datasets were generated or analysed during the current study.

Abbreviations

Ankle Instability Treatment Index

Chronic Ankle Instability

Effect Size

Foot and Ankle Ability Measure

Foot and Ankle Outcome Score

Intraclass Correlation Coefficient

Minimal Detectable Change

Patient-reported Outcome Measures

Standard Error of Measurement

Standardized Response Mean

Al Attar WSA, Bakhsh JM, Khaledi EH, Ghulam H, Sanders RH. Injury prevention programs that include plyometric exercises reduce the incidence of anterior cruciate ligament injury: a systematic review of cluster randomised trials. J Physiother. 2022;68:255–61.

Article   PubMed   Google Scholar  

Arnold BL, Wright CJ, Ross SE. Functional ankle instability and health-related quality of life. J Athl Train. 2011;46:634–41.

Article   PubMed   PubMed Central   Google Scholar  

Carcia CR, Martin RL, Drouin JM. Validity of the foot and ankle ability measure in athletes with chronic ankle instability. J Athl Train. 2008;43:179–83.

Creswell JWPCV. Designing and conducting mixed methods research. 2nd edn. Los Angeles (CA): Sage, 2011.; 2011.

de Vries JS, Krips R, Sierevelt IN, Blankevoort L, van Dijk CN. (2011) Interventions for treating chronic ankle instability. Cochrane Database Syst Rev;10.1002/14651858.CD004124.pub3CD004124.

Deshpande PR, Rajan S, Sudeepthi BL, Abdul Nazir CP. Patient-reported outcomes: a new era in clinical research. Perspect Clin Res. 2011;2:137–44.

Eechaute C, Vaes P, Van Aerschot L, Asman S, Duquet W. The clinimetric qualities of patient-assessed instruments for measuring chronic ankle instability: a systematic review. BMC Musculoskelet Disord. 2007;8:6.

Elo S, Kyngas H. The qualitative content analysis process. J Adv Nurs. 2008;62:107–15.

Fakontis C, Iakovidis P, Kasimis K, Lytras D, Koutras G, Fetlis A, et al. Efficacy of resistance training with elastic bands compared to proprioceptive training on balance and self-report measures in patients with chronic ankle instability: a systematic review and meta-analysis. Phys Ther Sport. 2023;64:74–84.

Fitzpatrick R, Davey C, Buxton MJ, Jones DR. Evaluating patient-based outcome measures for use in clinical trials. Health Technol Assess. 1998;2:i–iv.

Article   CAS   PubMed   Google Scholar  

Goulart Neto AM, Maffulli N, Migliorini F, de Menezes FS, Okubo R. Validation of foot and ankle ability measure (FAAM) and the foot and ankle outcome score (FAOS) in individuals with chronic ankle instability: a cross-sectional observational study. J Orthop Surg Res. 2022;17:38.

Gribble PA, Delahunt E, Bleakley CM, Caulfield B, Docherty CL, Fong DT, et al. Selection criteria for patients with chronic ankle instability in controlled research: a position statement of the International Ankle Consortium. J Athl Train. 2014;49:121–7.

Guelfi M, Zamperetti M, Pantalone A, Usuelli FG, Salini V, Oliva XM. Open and arthroscopic lateral ligament repair for treatment of chronic ankle instability: a systematic review. Foot Ankle Surg. 2018;24:11–8.

Herzog MM, Kerr ZY, Marshall SW, Wikstrom EA. Epidemiology of ankle sprains and chronic ankle instability. J Athl Train. 2019;54:603–10.

Houston MN, Hoch JM, Hoch MC. Patient-reported outcome measures in individuals with chronic ankle instability: a systematic review. J Athl Train. 2015;50:1019–33.

Karlsson JPL. (1991) Evaluation of the ankle joint function: the use of a scoring scale. Foot 15–9.

Kemler E, Thijs KM, Badenbroek I, van de Port IG, Hoes AW, Backx FJ. Long-term prognosis of acute lateral ankle ligamentous sprains: high incidence of recurrences and residual symptoms. Fam Pract. 2016;33:596–600.

Korstjens I, Moser A. Series: practical guidance to qualitative research. Part 2: context, research questions and designs. Eur J Gen Pract. 2017;23:274–9.

Lynn MR. Determination and quantification of content validity. Nurs Res. 1986;35:382–5.

Martin RL, Irrgang JJ. A survey of self-reported outcome instruments for the foot and ankle. J Orthop Sports Phys Ther. 2007;37:72–84.

Martin RL, Irrgang JJ, Burdett RG, Conti SF, Van Swearingen JM. Evidence of validity for the foot and ankle ability measure (FAAM). Foot Ankle Int. 2005;26:968–83.

Matsui K, Burgesson B, Takao M, Stone J, Guillo S, Glazebrook M, et al. Minimally invasive surgical treatment for chronic ankle instability: a systematic review. Knee Surg Sports Traumatol Arthrosc. 2016;24:1040–8.

Michels F, Pereira H, Calder J, Matricali G, Glazebrook M, Guillo S, et al. Searching for consensus in the approach to patients with chronic lateral ankle instability: ask the expert. Knee Surg Sports Traumatol Arthrosc. 2018;26:2095–102.

Mokkink LB, Prinsen CA, Patrick DL, Alonso J, Bouter LM, De Vet HC, Terwee CB. (2019) COSMIN Study Design checklist for Patient-reported outcome measurement instruments. Amsterdam, The Netherlands:1–32.

Mokkink LB, Terwee CB, Knol DL, Stratford PW, Alonso J, Patrick DL, et al. The COSMIN checklist for evaluating the methodological quality of studies on measurement properties: a clarification of its content. BMC Med Res Methodol. 2010;10:22.

Mokkink LB, Terwee CB, Patrick DL, Alonso J, Stratford PW, Knol DL, et al. The COSMIN study reached international consensus on taxonomy, terminology, and definitions of measurement properties for health-related patient-reported outcomes. J Clin Epidemiol. 2010;63:737–45.

Patrick DL, Burke LB, Powers JH, Scott JA, Rock EP, Dawisha S, et al. Patient-reported outcomes to support medical product labeling claims: FDA perspective. Value Health. 2007;10(Suppl 2):S125–137.

Polit DF, Beck CT. The content validity index: are you sure you know what’s being reported? Critique and recommendations. Res Nurs Health. 2006;29:489–97.

Porter ME. What is value in health care? N Engl J Med. 2010;363:2477–81.

Roos EM, Brandsson S, Karlsson J. Validation of the foot and ankle outcome score for ankle ligament reconstruction. Foot Ankle Int. 2001;22:788–94.

Roos EM, Roos HP, Lohmander LS, Ekdahl C, Beynnon BD. Knee Injury and Osteoarthritis Outcome score (KOOS)--development of a self-administered outcome measure. J Orthop Sports Phys Ther. 1998;28:88–96.

Sandelowski M. Theoretical Saturation. 2008.

Sierakowski K, Dean NR, Pusic AL, Cano SJ, Griffin PA, Bain GI, et al. International multiphase mixed methods study protocol to develop a cross-cultural patient-reported outcome and experience measure for hand conditions (HAND-Q). BMJ Open. 2019;9:e025822.

Spennacchio P, Meyer C, Karlsson J, Seil R, Mouton C, Senorski EH. Evaluation modalities for the anatomical repair of chronic ankle instability. Knee Surg Sports Traumatol Arthrosc. 2020;28:163–76.

Spennacchio P, Seil R, Mouton C, Scheidt S, Cucchi D. Anatomic reconstruction of lateral ankle ligaments: is there an optimal graft option? Knee Surg Sports Traumatol Arthrosc. 2022. https://doi.org/10.1007/s00167-022-07071-7 .

Streiner DL, Norman GR, Cairney J. Health measurment scales: a practical guide to their development and use. 2015 fifth edition, Oxford University Press.

Takao M, Oae K, Uchio Y, Ochi M, Yamamoto H. Anatomical reconstruction of the lateral ligaments of the ankle with a Gracilis autograft: a new technique using an interference fit anchoring system. Am J Sports Med. 2005;33:814–23.

Terwee CB, Bot SD, de Boer MR, van der Windt DA, Knol DL, Dekker J, et al. Quality criteria were proposed for measurement properties of health status questionnaires. J Clin Epidemiol. 2007;60:34–42.

van Rijn RM, van Os AG, Bernsen RM, Luijsterburg PA, Koes BW, Bierma-Zeinstra SM. What is the clinical course of acute ankle sprains? A systematic literature review. Am J Med. 2008;121:324–31. e326.

Vuurberg G, Kluit L, van Dijk CN. The Cumberland Ankle Instability Tool (CAIT) in the Dutch population with and without complaints of ankle instability. Knee Surg Sports Traumatol Arthrosc. 2018;26:882–91.

Wilke AJ, Martin R, Bates NA, Jastifer JR, Martin KD. (2023) Technique variation in the Surgical treatment of lateral ankle instability. Foot Ankle Spec;10.1177/1938640023120202919386400231202029.

Download references

Acknowledgements

Author information, authors and affiliations.

Department of Orthopaedic Surgery, Centre Hospitalier de Luxembourg – Clinique d’Eich, 78 Rue d’Eich, Luxembourg, L-1460, Luxembourg

Pietro Spennacchio, Caroline Mouton & Romain Seil

Luxembourg Institute of Research in Orthopaedics, Sports Medicine and Science (LIROMS), Luxembourg, Luxembourg

Pietro Spennacchio, Caroline Mouton, Jan Cabri & Romain Seil

Unit of Physiotherapy, Department of Health and Rehabilitation, Institute of Neuroscience and Physiology, Sahlgrenska Academy, University of Gothenburg, Gothenburg, Sweden

Eric Hamrin Senorski

Department of Orthopaedics, Sahlgrenska Academy, University of Gothenburg, Gothenburg, Sweden

Jon Karlsson

You can also search for this author in PubMed   Google Scholar

Contributions

All authors contributed to the study protocol. PS wrote this study protocol manuscript with assistance from JK. All authors read and approved the final version.

Corresponding author

Correspondence to Pietro Spennacchio .

Ethics declarations

Ethics approval and consent to participate, consent for publication, competing interests.

The authors declare no competing interests.

Additional information

Publisher’s note.

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Open Access This article is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License, which permits any non-commercial use, sharing, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if you modified the licensed material. You do not have permission under this licence to share adapted material derived from this article or parts of it. The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by-nc-nd/4.0/ .

Reprints and permissions

About this article

Cite this article.

Spennacchio, P., Senorski, E.H., Mouton, C. et al. A new patient-reported outcome measure for the evaluation of ankle instability: description of the development process and validation protocol. J Orthop Surg Res 19 , 557 (2024). https://doi.org/10.1186/s13018-024-05057-4

Download citation

Received : 24 June 2024

Accepted : 03 September 2024

Published : 11 September 2024

DOI : https://doi.org/10.1186/s13018-024-05057-4

Share this article

Anyone you share the following link with will be able to read this content:

Sorry, a shareable link is not currently available for this article.

Provided by the Springer Nature SharedIt content-sharing initiative

  • Chronic ankle instability
  • Patient-reported outcomes
  • Surgical treatment
  • Scale development

Journal of Orthopaedic Surgery and Research

ISSN: 1749-799X

validity of the research instrument pdf

Academia.edu no longer supports Internet Explorer.

To browse Academia.edu and the wider internet faster and more securely, please take a few seconds to  upgrade your browser .

Enter the email address you signed up with and we'll email you a reset link.

  • We're Hiring!
  • Help Center

paper cover thumbnail

Measuring the Validity and Reliability of Research Instruments

Profile image of Nor Lisa Sulaiman

2015, Procedia - Social and Behavioral Sciences

Related Papers

Procedia - Social and Behavioral Sciences

Othman Jaafar

validity of the research instrument pdf

Journal of Counseling and Educational Technology

Izwah Ismail

Development Questionnaire II (Student) is to obtain feedback from respondents, the students associated with the Program Implementation Evaluation Diploma in Mechatronics Engineering at the Polytechnic towards Industrial Requirements in Malaysia. This study was conducted to produce empirical evidence about the validity and reliability of the questionnaire II (Student) using Rasch Measurement Model. A pilot study was conducted at the Department of Mechanical Engineering, Polytechnic Kota Kinabalu, Sabah on 38 students in their final semester of a diploma program in Mechatronic Engineering. Validity and reliability of the questionnaire II (Students) were measured using Rasch Measurement Model Winsteps Version 3.69.1.11. Rasch analysis showed respondents reliability index is 0.97 and reliability index is 0.91.From the point of polarity items, each item can contribute to the measurement because the PTMEA CORR each item above 0.30, which is between 0.30 to 0.81. Appropriateness test shows...

Journal of Educational Research and Evaluation

Habib M M Adi

NOR ZATUL-IFFA ISMAIL

Abdul Rahim

This study was conducted to analyze the test instrument used to measure the ability of students in the odd semester final exam in mathematics. Sampling using purposive sampling technique. These students consist of 67 people. The questions given are in the form of multiple choice questions totaling 40 items related to the odd semester final exam material. The data analysis technique used quantitative descriptive analysis. The Rasch model is used to obtain fit items. This analysis was carried out with the help of Winsteps 3.73 software. From the output of the Winsteps program, 35 items were obtained according to the Rasch model with an average value of Outfit MNSQ for persons and items of 1.09 and 1.09, respectively. While the Outfit ZSTD values for person and item are -0,1 and -0,2 respectively. Meanwhile, the instrument reliability expressed in Cronbach&#39;s alpha is 0.77.

Dr. Mamun Albnaa

Measurement theories are important to practice in educational measurement because they provide a background for addressing measurement problems. One of the most important problems is dealing with the Measurement Errors. A good theory can help in understanding the role of errors they play in measurement; (a) To evaluate the examinee's ability to minimize errors and (b) Correlations between variables. There are two theories addressing measurement problems such as test construction, and identification of biased test items: Classical Test Theory (CT) and Item Response Theory (IRT) (1950). As a result of a number of problems associated with the Classical Theory of Measurement, which cause inaccuracy in results i.e. methods and tools of measurement. There appeared a need to develop the methods of measuring behavior in a manner consistent with the Physical Measurement Methods. Based on the Philosophy of this measurement and assumption, which achieves the quality and safety of these methods, and acceptance of their results with a high Degree of Confidence. There were many research studies by professionals and those interested in behavioral measures, aimed and try to overcome some of the Behavioral Problems of Measurement. These studies have resulted in the emergence of Item Response Theory. Item response theory is a Statistical Theory about Items, Test Performance and abilities that are measured by Items. Item responses can be discrete or continuous and can be dichotomous and the item score categories can be ranked or non ranked . There can be one ability underlying test, and there are many models in which the relationship between item responses and the underlying ability can be specified. Within the IRT there are many models that have been applied to test data really but most famous among them is Racsh model. In this paper, both the theories i.e. Classical Test Theory and Item Response Theory (lRT) will be described in relation to approaches to measure the validity and reliability. The intent of this module is to provide a comparison of classical theory and item response theory

Asia Proceedings of Social Sciences

Mazlili Suhaini

Malaysia is considered one of the developing countries undergoing rapid economic development over the past five decades. As a developing country with a rapidly growing population, providing the citizens with comprehensive and updated knowledge is crucial for the country, particularly in vocational training. A number of vocational and technical training have been developed. However, the success of vocational education relies on the capability of instructors or teacher’s approach to achieve the goals. It is important to create appropriate method that take into consideration their students’ learning styles to get better outcome. Therefore, the purpose of this paper is to develop vocational learning styles instrument. Empirical evidence on the validity and reliability of modified items has been done. A survey of 57 Electrical Technology students were distributed. The Rasch measurement model was used to examine the functional items and detect the item and respondent reliability and index...

Zuhaira Zain

Exam has been used enormously as an assessment tool to measure students’ academic performance in most of the higher institutions in KSA. A good quality of a set of constructed items/questions on mid and final exam would be able to measure both students’ academic performance and their cognitive skills. We adopt Rasch Model to evaluate the reliability and quality of the first mid exam questions for Object-oriented Design course. The result showed that the reliability and quality of the exam questions constructed were relatively good and calibrated with students’ learned ability. Key-Words: Rasch Model, Item Constructions, Reliability, Quality, Students’ Academic Performance, Information Systems, Bloom’s Taxonomy

Education Research International

Amir Mohamed Talib

This paper describes a measurement model that is used to measure the student performance in the final examination of Information Technology (IT) Fundamentals (IT280) course in the Information Technology (IT) Department, College of Computer & Information Sciences (CCIS), Al-Imam Mohammad Ibn Saud Islamic University (IMSIU). The assessment model is developed based on students’ mark entries of final exam results for the second year IT students, which are compiled and tabulated for evaluation using Rasch Measurement Model, and it can be used to measure the students’ performance towards the final examination of the course. A study on 150 second year students (male = 52; female = 98) was conducted to measure students’ knowledge and understanding for IT280 course according to the three level of Bloom’s Taxonomy. The results concluded that students can be categorized as poor (10%), moderate (42%), good (18%), and successful (24%) to achieve Level 3 of Bloom’s Taxonomy. This study shows that...

zunaira fatima

The specific purpose of the research was to construct an achievement test in the area of philosophy of education for master level students in universities of Punjab (Bahauddin Zakariya University Multan, The Islamia University of Bahawalpur and University of Sargodha, Sargodha. Test comprises 60 multiple-choice items, selected from the item bank constructed by researcher. This test was administered to 231 male and female students of M.A. Education and M.Ed. selected randomly. Data were analysed through Rasch model. As the result of Rasch calibration, three items were tossed out of the test. Figure latent continuum showing position of items and persons was made. This study suggests that to cover the whole syllabus items should be increased. An effort should be made to take a greater number of samples for the study, so that item analysis through Rasch Model can show its probability.

Loading Preview

Sorry, preview is currently unavailable. You can download the paper by clicking the button above.

RELATED PAPERS

Haliza Othman

Uzma Shahzadi

Kenneth Royal

Procedia-Social and Behavioral …

Rosadah Majid

Kalamatika: Jurnal Pendidikan Matematika

Cut Khairunnisak

XIE Qin , Cher Ping LIM

Humanities & social sciences reviews

enas abulibdeh

Leela Waheed

International journal of health sciences

munira mohammad

International Journal of Sciences: Basic and Applied Research

Josephine Manapsal

Nazlena Mohamad Ali

International Journal of Research

Rommel de Gracia

Psychology and Education: A Multidisciplinary Journal

Psychology and Education

Cogent Education

Mitra Zeraatpishe

Journal of Physics: Conference Series

Zamalia Mahmud

Nangkula Utaberta

eliyana othman

  •   We're Hiring!
  •   Help Center
  • Find new research papers in:
  • Health Sciences
  • Earth Sciences
  • Cognitive Science
  • Mathematics
  • Computer Science
  • Academia ©2024

COMMENTS

  1. (PDF) Validity and Reliability of the Research Instrument; How to Test

    Download full-text PDF Read ... Often new researchers are confused with selection and conducting of proper validity type to test their research instrument (questionnaire/survey). ... To test the ...

  2. (PDF) Validity and Reliability in Quantitative Research

    PDF | The validity and reliability of the scales used in research are important factors that enable the research to yield healthy results. ... that a content va lidity of a measuring instrument is ...

  3. Validity and Reliability of the Research Instrument; How to Test the

    Validity basically means "measure what is intended to be measured" (Field, 2005). In this paper, main types of validity namely; face validity, content validity, construct validity, criterion validity and reliability are discussed. Figure 1 shows the subtypes of various forms of validity tests exploring and describing in this article.

  4. PDF Validity and reliability in quantitative studies

    2 Divergent validity—shows that an instrument is poorly correlated to instruments that measure differ-ent variables. In this case, for example, there should be a low correlation between an instrument that mea-sures motivation and one that measures self-efficacy. 3 Predictive validity—means that the instrument

  5. Validity and Reliability of the Research Instrument; How to Test the

    Often new researchers are confused with selection and conducting of proper validity type to test their research instrument (questionnaire/survey). This review article explores and describes the validity and reliability of a questionnaire/survey and also discusses various forms of validity and reliability tests.

  6. Quantitative Research Excellence: Study Design and Reliable and Valid

    Learn how to design and measure quantitative research with excellence and validity from this comprehensive article.

  7. [PDF] Validity and Reliability of the Research Instrument; How to Test

    Questionnaire is one of the most widely used tools to collect data in especially social science research. The main objective of questionnaire in research is to obtain relevant information in most reliable and valid manner. Thus the accuracy and consistency of survey/questionnaire forms a significant aspect of research methodology which are known as validity and reliability. Often new ...

  8. Validity and reliability of measurement instruments used in research

    In health care and social science research, many of the variables of interest and outcomes that are important are abstract concepts known as theoretical constructs. Using tests or instruments that are valid and reliable to measure such constructs is a crucial component of research quality. ... and interrater reliability of instrument scores ...

  9. A Primer on the Validity of Assessment Instruments

    What is validity? 1. Validity in research refers to how accurately a study answers the study question or the strength of the study conclusions. For outcome measures such as surveys or tests, validity refers to the accuracy of measurement. Here validity refers to how well the assessment tool actually measures the underlying outcome of interest.

  10. (PDF) Two Criteria for Good Measurements in Research: Validity and

    Abstract. Reliability and validity are the two most important and fundamental. features in the evaluation of any measur ement instrument or tool for a good. resear ch. The purpose of this re ...

  11. PDF Two Criteria for Good Measurements in Research: Validity and Reliability

    3. Research Objectives The aim of this study is to discuss the aspects of reliability and validity in research. The objectives of this research are: • To indicate the errors the researchers often face. • To show the reliability in a research. • To highlight validity in a research. 4. Methodology

  12. PDF CHAPTER 3 VALIDITY AND RELIABILITY

    3.1 INTRODUCTION. In Chapter 2, the study's aims of exploring how objects can influence the level of construct validity of a Picture Vocabulary Test were discussed, and a review conducted of the literature on the various factors that play a role as to how the validity level can be influenced. In this chapter validity and reliability are ...

  13. Measuring the Validity and Reliability of Research Instruments

    Abstract. This paper discussed how the applying of Rasch Model in validity and reliability of research instruments. Three sets of research instruments were developed in this study. The Felder-Solomon Index of Learning Styles (ILS) is essential to find out the learning style abilities of learners. Students' Perception in Cognitive Dimension ...

  14. PDF Establishing survey validity: A practical guide

    two areas of importance for establishing evidence of validity: a validated model that provides the basis for an instrument, and the items composing an instrument. 2.1. Foundational Model A valid survey requires a theoretical model of what it is the researcher wants to find out by having people respond to survey items.

  15. A Practical Guide to Instrument Development and Score ...

    an existing instrument is potentially psychometrically flawed (e.g., lacking reliability or validity evidence, step 7). An instrument development study might also be necessary if a researcher determines that an existing instrument is inappropriate for use with their target population (e.g., cross-cultural fairness issues). In some

  16. PDF Reliability and Validity

    GRE, SAT, etc. are often based on the lack of predictive validity of these tests). 2. Construct Validity Three types of evidence can be obtained for the purpose of construct validity, depending on the research problem. a) Convergent validity. Evidence that the same concept measured in different ways yields similar results.

  17. PDF Mixed Method Research: Instruments, Validity, Reliability and Reporting

    Index Terms―research methods, instruments, reliability, validity. I. INTRODUCTION There are various procedures of collecting data: tests, questionnaires, interviews, classroom observations, diaries, journals, etc. Quite often, quantitative designs use tests and closed-ended questionnaires in order to gather, analyze and interpret the data.

  18. PDF Validity of Measurement Instruments Used in Research

    Validity and reliability of measurement instruments used in research. Am J Health Syst Pharm. 2008 Dec 1;65(23):2276-84. doi: 10.2146/ajhp070364. PMID: 19020196. Taherdoost, Hamed. (2016). Validity and Reliability of the Research Instrument; How to Test the Validation of a Questionnaire/Survey in a Research. International Journal of Academic ...

  19. Principles and Methods of Validity and Reliability Testing o ...

    FACE VALIDITY. Some authors [7 13] are of the opinion that face validity is a component of content validity while others believe it is not.[2 14 15] Face validity is established when an individual (and or researcher) who is an expert on the research subject reviewing the questionnaire (instrument) concludes that it measures the characteristic or trait of interest.

  20. (PDF) Reliability and Validity of Research Instruments Correspondence

    Abstract. This paper clearly explains the concepts of reliability and validity as used in educational research. The paper outlines different types of reliability and validity and significance in ...

  21. The 4 Types of Validity in Research

    Construct validity. Construct validity evaluates whether a measurement tool really represents the thing we are interested in measuring. It's central to establishing the overall validity of a method. What is a construct? A construct refers to a concept or characteristic that can't be directly observed, but can be measured by observing other indicators that are associated with it.

  22. PDF CHAPTER III RESEARCH METHODOLOGY validity and reliability are

    This chapter describes the methodology of this study that incorporates the. research approach, model of the study, procedure of the study, location and subjects, time. allocation of the study, research instruments. Data analysis, validity and reliability are. also presented.

  23. Systematic review and meta-analysis of developmental assets scales: A

    Discussion. This comprehensive assessment aims to identify potential biases in prior research and offer guidance to scholars regarding the optimal scales for developmental assets in terms of validity, reliability, responsiveness, and interpretability The evidence-based appraisal of the scales strengths and limitations is imperative in shaping future research, enhancing their methodological ...

  24. A new patient-reported outcome measure for the evaluation of ankle

    Once content validity has been examined, the AITI will be subjected to quantitative analysis of different measurement properties: construct validity, reliability and responsiveness. The development of AITI aims to address the limitations of existing instruments for evaluating surgical outcomes in patients with CAI.

  25. (PDF) Measuring the Validity and Reliability of Research Instruments

    The application of the Rasch model in validity and reliability research instruments is valuable because the model able to define the constructs of valid items and provide a clear definition of the measurable constructs that are consistent with theoretical expectations.

  26. Full article: Unveiling the Drivers of Social Entrepreneurship

    Finally, the research instrument was translated into the local language by following the back-and-forth method to increase the respondents' understanding of the concepts that confirmed the scales' content, face, and construct validity (Brislin Citation 1980; Hinkin Citation 1998).

  27. Assessing the pedagogical effectiveness of the web-based cooperative

    The statistical validity of pre- and post-reading tests was assessed by criterion-reference validity. This was done by computing the coefficient between the two tests. The validity coefficient, calculated using the Pearson method, was 0.835, which was statistically significant at the 0.01 level. The pre- and post-reading examinations were valid.