Validation & Norming
Have the NIH Toolbox tests been validated?
Though tests are often described as being “valid” or having been “validated,” what is most important is thinking about and understanding how the test will be used (its purpose for the user). It is the use of the scores, not the test itself, which must be considered when evaluating whether a test is valid for its intended purpose. As a simple example, scores on a test of math computation would not likely be valid for diagnosing a reading disability, but they could be valid in identifying students who could benefit from math tutoring. The math test has been “validated,” but only for one of the two purposes listed. The same can be said for NIH Toolbox tests. The role of validity analyses is to collect a body of evidence that reveals for what and with whom the scores on tests are valid.
What evidence is there for the validity of NIH Toolbox tests?
Substantial qualitative and quantitative evidence has been gathered that supports the validity of NIH Toolbox tests. It is important to remember that the NIH Toolbox encompasses a broad set of tests across many functional domains, including Cognition, Motor, Sensation, and Emotion. Validity evidence for these tests has been gathered in ways that are appropriate for the type and mode of measurement.
Content Validity
As Anastasi stated, “content validity is built into a test from the outset through the choice of appropriate items.”1 Although the NIH Toolbox is comprised of a variety of tests covering a broad range of functioning and utilizing many unique types of questions and performance tasks, a common approach to establishing content validity was used. For each NIH Toolbox test across every domain and content area, panels of experts were convened to make recommendations about the appropriate content to be assessed. Moreover, a systematic review of the literature was conducted in each domain and content area to identify relevant and appropriate constructs for measurement with the NIH Toolbox. Once this content “blueprint” was in place, the expert panels developed new (or identified existing) items or tasks to build each test. For additional external subject matter, experts were sought during the test development process to vet the items and tasks for quality, to match the intended constructs and appropriateness for important subgroups (e.g., young children, older adults), and to ensure items did not unintentionally disadvantage any population subgroups or raise sensitivity concerns. Once items/tasks for tests were selected or developed, each test underwent significant additional evaluation for validity. Details on content validity for each NIH Toolbox test have been published.
1Anastasi A, 1988. Psychological Testing, New York, Macmillan Publishing Company, p. 122-127.
Construct Validity Evidence
In selected cases, NIH Toolbox adopted an existing test for a specific content area. In such cases, strong evidence of the test’s validity for use in large-scale research or clinical trials had been established and published. In other cases, where NIH Toolbox content was newly developed or significantly adapted from existing items, formal studies were conducted to evaluate each test’s construct validity. Concurrent validity is typically established by comparing a carefully drawn sample’s performance on a new test (“experimental” test) with the same sample’s performance on other, well-established tests of the same construct(s) (sometimes referred to as “gold standard” tests because of their common use and acceptance). If one can establish that the new test is sufficiently correlated with a “gold standard,” one can reasonably assume that the new test can also be used effectively with the population(s) on which the well-established test was used. NIH Toolbox tests performed well in such validity evaluations. Additional construct validity evidence for NIH Toolbox tests varies by test and domain, but includes factor analytic studies and comparisons of group performance by age (to ensure expected trajectories of performance). Detailed descriptions of NIH Toolbox validation studies have been published for all domains and content areas. In addition, a number of NIH Toolbox tests have even more published documentation of utility and validation for different ages in the normal, community-dwelling population. Examples of tests that have collected such evidence include the NIH Toolbox Grip Strength Test, NIH Toolbox Flanker Inhibitory Control and Attention Test and Dimensional Change Card Sort Test, and the NIH Toolbox Dynamic Visual Acuity Test.
Clinical Validity Evidence
Some initial studies have gathered additional validity evidence with clinical or rehabilitation population samples for a number of NIH Toolbox tests. These studies are important to assure that one can reasonably interpret scores in these target populations as well. For example, significant work has been done to validate the NIH Toolbox for use with those suffering traumatic brain injury. A number of studies have been published that establish validity of specific NIH Toolbox tests in special, targeted groups.
V2 – Validation for iPad
Validation studies were conducted for all NIH Toolbox tests to ensure that the iPad app met rigorous scientific standards. Studies were conducted across the entire age range, typically included 450-500 subjects, and were statistically compared against “gold standard” tests whenever available. For tests using Item Response Theory approaches to scoring, calibration samples generally included several thousand participants, ensuring robust models. In total, data was collected from more than 16,000 subjects as part of field-test, calibration and validation activities.
V3 – Renormed and validation of select tests; addition of new tests
NIH Toolbox Cognition tests and Balance test have been re-normed against the U.S. 2020 Census. A validation study has been conducted against gold standard tests and clinical groups.
Making the Case for Using the NIH Toolbox
If you are considering one or more NIH Toolbox tests (or domain batteries) in a clinic or for a study, there is substantial evidence to evaluate. The questions you ask about that evidence can also serve as the framework for supporting your choice to others (e.g., administrators, granting agency). Here are some questions you should consider:
- Why is it important to test this construct or these constructs in my study or clinic? Describe the relevance of the symptom or outcome to the population of interest.
- What psychometric evidence has accumulated when this test was used in my targeted population? If you are not able to find a study in your population, weigh the evidence that exists for the test across populations.
- What are the alternatives to the NIH Toolbox tests? State clearly why you believe an NIH Toolbox test is a good choice, particularly for your population and for your particular purpose. Remember, validity resides in the use of the scores.