Why IQ test scores can differ: Applied Psychometrics 101 Report #1--Understanding global IQ test correlations

Announcing Applied Psychometrics 101: IQ Test Score Difference Series--#1 Understanding global IQ test correlations. (click here to view and/or download)

Toady I'm announcing the first in what I hope is a series of applied psychometric brief reports. The goal of this project is to explain basic psychometric issues to help professionals and the public better understand psychological measurement, IQ testing, etc. Above is the title of the first report (and a link where it can be accessed). Below is the abstract, followed by some thoughts and questions the report might generate. This report (and future reports) are accessible via a section [(Applied Psychometric 101 (AP101) Reports] on the side bar of this blog.

Despite reported evidence of strong concurrent correlations among IQ tests (concurrent validity), different IQ tests often produce different IQ scores for the same individual. This may be due to a number of factors. Prior to discussing the various factors, one must first understand the basic language of typical IQ-IQ comparison research. In the first of this series, IQ-IQ test correlations are explained. Statistically significant high correlations between different IQ tests, although providing strong concurrent validity evidence for tests, do not guarantee similar or identical IQ scores for all individuals tested.
Blogmaster comments

After reading the report, I would encurage readers to come back and reflect on the comments below. I would like to thank Dr. Dale Watson for comments on an earlier draft of the report. Most all of the ideas generated below are thoughts he shared (and that I had been contemplating) after reading the report. I will shortly be adding Dr. Watson to the "experts" blog roll on the blog sidebar.

Some Post "AP101: IQ Score Difference Series--# 1 Understanding global IQ test correlations" thoughts for consideration

I (the blogmaster) assume that most laypersons and, more importantly, agencies that have developed strict prescriptive guidelines for IQ cut scores for service eligibility and/or life or death decisions (e.g., U.S. Supreme Court Atkins ruling that no one with intellectual disabilities/mental retardation can be executed), are unaware of the variability in IQ scores that can arise simply by using different IQ tests (see report). Based on the IAP AP101 report, one should reach the conclusion that the selection of which IQ test to adminster (to determine if an individual is mentally retarded--esp. mild MR) can be a life-or-death decision (i.e., Atkins death penalty cases)! Furthermore, given the adversarial nature of a court hearings/trials and due process hearings, it is clear that a wide variety of questions could arise regarding how to determine which test battery is the "best" measure of intelligence (when different IQ tests used by different psychologists and experts produce significantly different scores). A few examples are listed below:

1. What does “best” mean? Is “best” relative to the purpose for the testing (e.g., best for service eligibility; best for developing instructional education programs; best for making formal legal diagnosis, etc.)? Might certain IQ tests be “best” for certain purposes and other IQ test “best” for other purposes? Is it possible for one test to be “best” for all purposes, all ages, all cultures, etc. ?

2. Could a scenario occur where the courts request that a standard be used to identify the potentially “best” test when IQ-IQ differences are reported by experts? This raises extremely complex questions. For example:
  • Does the popularity of an IQ measure determine which IQ test battery is “best”? Historically the Wechsler series of tests have been considered the “gold standard of IQ tests” largely because of their popularity. Does popularity + more sales = “best?”
  • Can (should) an empirical standard be developed? Is it even possible?
  • Some in the field of intelligence testing have suggested that an IQ tests g (general intelligence) saturation (“g-ness”; amount of variance attributed to the first principal component extracted in principal component analysis—PCA) is a good criterion.
  • Or, is the “best” IQ indicator a composite score that differentially weights the tests in the global IQ score as determined by PCA?
  • Or, is the “best” IQ battery one that only includes tests that have high g-ness?
  • Or, is the “best” IQ battery the one that provides the broadest coverage of the major cognitive abilities established by the most accepted psychometric model of intelligence?
3. Is the amount of g-ness measured in an individual central to the definition of intelligence and/or different diagnostic categories (MR, LD, gifted, etc.). Is g-ness more central to a diagnosis of mental retardation and less (or equally) relevant to a diagnosis of specific learning disability?

4. If it were even possible (which the current author doubts) to establish a consensus on a “best-ness” criterion, would assessment personal be required to administer the so designated test? Who would make the judgment regarding which test battery (or batteries) are the best—would it be in the hands of individual psychologists, professional association, a judge, or…..?

It is hoped the above cited IAP AP101 report has clarified the reality that different IQ tests will often provide different IQ scores for the same individual. IQ-IQ difference scores will occur, and if the IQ tests are properly administered to a cooperative individual, the resultant IQ-IQ score differences are reliable and valid. The “why” of psychometrically sound IQ-IQ score differences is due to a number of possible factors, factors that will be explored in future reports in the Applied Psychometrics 101 series. The potential policy implications, as briefly illustrated by the above set of hypothetical questions, are many, complex, and will not have an easy answer. There may not be a suitable answer and the use of IQ scores in legal and/or adversarial settings may need to change to become more nuanced (i.e., allow for more expert interpretation of the meaning of IQ test scores and IQ-IQ difference scores) and less rigid and prescriptive.

The issues raised in the report do not reflect problems in the state-of-the-art of psychometrically sound IQ tests, but in the use (and misuse) of IQ test scores to make important decisions about individuals and to create public policy and law.

  1. More a question than a comment, but to what extent does considering the CI of the IQ scores (at a rigorous level, like 95%) address the problem? That is (if my understanding is correct), where those two bands overlap, the scores within the bands are statistically the same and are not interpretable as separate entities?

    Karis Post, CAGS
    School Psychologist