PS~CHOM~TRIKA~VOL. 11, NO. 2 JUNE~1946 THE TEST-RETEST RELIABILITY OF QUALITATIVE DATA LOUIS GUTTMAN CORNELL UNIVerSITY The test-retest reliability of qualitative items, such as oeeur in achievement tests, attitude questionnaires, public opinion surveys, and elsewhere, requires a different technique of analysis from that of quantitative variables. Definitions appropriate to the qualitative case are made both for the reliability coefficient of an individual on an item and for the reliability coefficient of a population on the iter~ From but a single trial of a large population on the item, ~t is possible to compute a lower bound to the group reliability coefficient. Two kinds of lower bounds are presented. From two experimentally independent trials bf the population on the item, it is possible to compute an upper bound to the group reliability coefficient. Two upper bounds are presented. The computations for the lower and upper bounds are all very simple. Numerical examples are given. 1. I~troduction An item is said to be unreliable for a person to the extent to which his response to it would vary in repeated experiments under the same conditions. For qua~t#ative items, the method of studying the variation has been to use arithmetic means as the "true" responses and variances as measures of dispersion. But if an item is qualitative, the only average (of those commonlyin use) that it can have is the mode, and a measure of variation is the relative frequency of the non-modal values.* Therefore, the study of test-retest reliability of qualitative items requires a different technique from that of quantitative items. The present paper is devoted to presenting appropriate definitions and technique for the qualitative case. Qualitative items are of considerable importance for the social and psychological sciences. Qt~estions in an achievement test are qualitative dichotomies when answers are either "right" or "wrong." If numerical values are assigned these two categories and are added up over a set of items to yield a total test score,~ then the reliability of * For a discussion of tbe distinction between qualitative arid quantitative variables with respect to prediction and correlation, see (2), An adaptation (2) is also available in Guilford’s textbook (1, Chap. 10). t The problem of whether a total score is meaningful i~ not raised here ; that is the problem of scale analysis (3). The analysis of te,~t-retest reliability remains the same, regardless of the meaning of items or to~al scores. 81 82 PSYCHOMETRIKA the total score shoul~d, of course, be studied by the technique of quantitative variables; but this does not change the fact that the original items were qualitative and can have their separate reliabilities studied by the technique to be developed here. In attitude and opinion research, the reliability of each qualitative ~tem by itself is often of interest, so that the qualitative technique is especially appropriate in this field, In general, the technique is appropriate in any study where qualitative data are enumerated, be they census returns, the eye colers of fruit flies, commercialcommodities, etc., etc. In Part I of this paper is given a description of the definitions used, and illustrations of applications of the qualitative technique are preserited. The actual derivations and proofs are developed in Part II. PART I: APPLICATIONS~ 2. Definitions and Terminology Consider a universe of indefinitely manytrials of a person on an item which has m categories. By the probability that the person will full in a particular category, we mean the proportion of trials in which this will happen for him. The category with the highest probability will be called the modal value for the person, and the highest probability itself wi|l be called the "modal probability. If the modal probability is attained by two or more categories,, then any one of Chese can arbitrarily be selected as the modal value. The modal value plays the role in the qualitative theory ~hat the expected or mean value plays in the quantitative theory; and the complement of the modal probability plays the role of the error variance. If all the categories have the same probabili~,y for the person (this commonprobability is then the modal probability and is equal to l/m), then the item will be said to have zero reliability for that person. If the modal probability is unity (the non-modal categories each having zero probability) then the item is said t.o have perfect reliability for the person, or a reliability of unity. Let P.~ be the modal probability for the i-th person. The reliability coefficient of the item for that person will be defined as R~ ~ -- _ P~-m-- 1 m R~ varies between zero and unity according as P~ varies between 1Ira and unity. The meanof P~ over all people in the population will be denoted by a; this is the mean~odal probability for the population. The reli- LOUIS G~JTTI~AN 83 ability coefficient of the item for the ~o~dation is :defined to be o m--I a----m " p is actually the meanof the R~; that is, the population reliability, of the item is the meanof the individual reliabilities. If p is zero, then all* of the R~ must be zero; and if p is unity, then all* of the R~must be unity. Zero group reliability implies zero individual reliabilities, and perfect group reliability implies perfect individual reliabilities, An intermediate group reliability coefficient is a value around which the individual reliabilities can cluster with various degrees of dispersion; it is not assumed that all individuals are equally reliable. It. should be clear, though, that the .closer p is to zero or to unity, the greater the restraint to the variance of the Ri about ~. The experimental problem is to obtain information about the R~. To obtain this directly would require actual repeated experiments under the same conditions, wh.ich is a difficult, if not impossible, thing to do empirically in many cases. But fortunately it is possible to obtain information about ~ from only a single trial. Lower bounds to ~ can be determined from only a single trial, so that we can say the group reliability of the item is at lea~t a cert. ain amount. The notion of a lower bound to a reliability coefficient.was introduced in (4) for the case of quantitative variables. To derive lower bound from a single trial for the quantitative case requires either that (a) the variable be the sum of at least two experimentally independent items,~ or that (b) the item be correlated with one more experimentally independent items.$ Condition (a) of course cannot be met by a qualitative item and has no analogue for the qualitative case. Instead, we ~have the far simpler phenomenonthat only the freq~ency distribution of the population on the item itself is needed to determine a lower bound, as is described in the next section. An analogue to condition (b) does hold and provides the lower bound based on sub-frequencies described in §4. If two experimen~al~y independent trials are actually made ~or an item, then also an ~,pper bound can be computedfor the reliability coefficient,, as is illustrated in §5. In order to obtain a lswer bound from a single trial, or an upper bound from two trials, it is assumed here that the population of * Exceptpossiblyfor an infinitesimal proportionof the people. ~ This is true for the first five boundsin (4). .J:Seethe theoremin §17 of (4} cn whichthe sixth beund~s based. 84 PSYCHOMETRIKA people is indefinii~Iy large. If the population is finite or if a sample from an indefinitely large population is used for the computations, then the bounds derived here will of course be subject to sampling error. The sampling theory of the bounds has not yet been completely workedout, but il~ seems clear that they should be quite reliable when computed from s,~mples of the size ordinarily used in public opinion polls. 3. The Marginal Lower Bound Let A~ , As, ..., A,n be the relative frequencies of the respective m categories of 1Lhe item on a single trial of the population. In §7 and §10 below it is proved that* a is not less than the largest of the A’s. For example, suppose each person in a large population were asked if he agreed to a certain political proposition, and suppose the results were yes 60% undecided 15 no 25 100% Here, m -- 3; the largest A is .60; so we can state that a ->_ .60; and we can further state that .60 --3 ’ or If the responses had been: yes 33 1/3% undecided 33 1/3 no 83 1/3 100% then we would state that a ->= .33 1/3 and that which provides no information about p. The responses can have anything from perfect reliability to zero reliability if the population is equally divided over the m categories. But if there are uneq,~d frequencies, then there must be at least some group reliability. * Except possibly for an infinitesimal proportion of the trials. LOUIS GUTTMAN 85 Such a lower bound, based on marginal frequencies, cannot approach unity, of course, unless the population piles up in one categorT. For example, if, on a true-false question, 90%of a large class of students answered correctly (or incorrectly), then for that question we can state that .8-<o_-<l. It is of interest to notice that, for a fixed maximum marginal, the lower bound increases with the number of categories. If 90%of a large class of students answered a multiple-choice question correctly, where there were four choices, then for that question the marginal lower bound would be .87, which is larger than the bound of .8 for the true-false question in the preceding paragraph. Similarly, a multiple-choice question which was answered correctly by 50% of the population can be said to have some reliability, whereas nothing can be said about the reliability of a true-false question which 50%answered correctly, if this per cent is the only information available. Manyan item will actually have a reliability coefficient that is far greater than its marginal lower bound, so that it is desirable to seek a more efficient lower bound. A better bound is described in the next section. 4. The Lower Bound From Joint. Occurrences By relating the item to another experimentally independent item or set of items, it is possible to improve on the marginal lower bound. It is proved in §8 and §10 below that a for an item is* not smaller than the sum of its largest sub-frequencies (expressed as proportions) in the joint occurrence table of that item with any other item obtained from a single trial, provided the table is based on a large population of individuals and the items are experimentally independent. As an example, consider the joint occurrences in a single trial of two items, U and. V, for a large population. U has three categories: U~, U~, and U3 ; and V has two categories: V~ and V,_. Suppose the joint occurrences in a single trial are as follows: .6O -- .10!.30 .40 .40 .30 .30 1.00 * Exceptpossiblyfor an infinitesimal proportionof the trials. 86 PSYCHOMETRIKA V has three columns of sub-frequencies, one for each value of U. The largest sub-frequencies in each columnare, respectively, .40, .20, and .30. Then we can .,~tate that a for V is not less than .40 ÷ .20 + .30 -- .90 , and that 2( p >- .90 2 ’ or p ->_- .80. This is a great improvement over the marginal lower bound for V, which is 1 Similarly, we obtain a better lower bound for U from the same data. Fromthe table of joint frequencies, Wecan state that a ’for U is not less than the sumof the largest sub, frequencies of U: .40 + .30 -- .70, so that for U, 1) , 3 or p => .55. This is not extremely high, but still over the mar~nal lower bound for U, which is ~ >- ~ .70--- "4023( ---31) it is an improvement =.10. If a lower bound obtained from a bivariate table is not ve~T high and if it is believed[ that an item is muchmore reliable than this lower bound, then it may be worth while to seek a better louver bound by e, onsidering relationships with more than one other item, provided the additional items are also experimentally independent of the one whose reliability is sought. The sum of *,he highest sub-frequencies of lhe item in the multivariate table is* again a lower bound to a. In particular, if an item is perfectly r(~h~te~ ~o any set whatsoever of items experimentally independent of it, ~t-~en we can state that the item is perfectly reliable. * Exceptpossiblyfcr an infinitesirra! proportion(,f !he trials. ¯ LOUIS 87 GUTTMAN 5. The Reliability Coefficient and Two Independent Trials From a single trial, only a lower bound .can be established for the group reliability coefficient. But if it is possible to maketwo experimentally independent trials of the same item, then it is possible also to set an upper bound to the group reliability coefficient. In practice, the lower and upper bounds will often be very close to each other, so that the coefficient will be determined within a relatively small interval. [In the quantitative case, it maybe remembered, two trials suffice to dete~nine*the reliability coefficient exactly. See (4), 267-268.] Two¯ upper bounds are established in §9 below, the second being somewhatbetter than the first, bu.t illvolving slightly more arithmetic. As an example of the use of the bounds, consider a trichotomous item repeated twice--but experimentally independently--on a large population, with the following results: Second Trial First Trial U1 U~ U~ .10 .15 .05 .30 .15 .25 .05 .45 .05 .O5 .15 .25 .30 .45 .25 1.00 The joint occurrence table must* be symmetric (indeed, it must* be Gramian) for two independent trials, if the population is very large. A lower bound for a is obtained as in §4, by adding up the highest sub-frequency in each row (column) a ~ .15 + .25 + .15 --.55. It is important to notice that a is not the same as the average probability of remaining in the same category on two trials. If y is the average probability of remaining in the same category on two trials, then, as is shownin §9 and §10 below, ? can be set equal to the sum of the principal diagonal elements: y ~- .10 + .25 + .15 ~-~ .50. a is always greater than ~, unless both are equal to unity (in which case p ---- 1). This i~mquality is true even if all the largest sub-fre* Except possibly for an infinitesimal proportion of the pairs ~f trials. 88 PS¥CHOMETRIKA quencies are in the principal diagonal. A simple upper bound for a is the square root of ~,: Hence, for our example we can state that a _-< .71. A better upper bound is 1 ÷ ~/(m-- 1) (mr--l) m where m is the number of categories in the item. For the present example m -- 3.and we can set r ~ .50, so we can state that Et = .67. Hence, combining both the lower and upper bounds into one statement: .55 =< a = .67, which implies .33 =< p = .50. As a final comment, it should be noted that p is bounded from two trials, even though an individual R~ cannot be estimated very well. A larger series of trials on a particular person would be necessary to estimate his R~. PART II: DERIVATIONS 6. Analytical Definitions Consider an item U with the m categories U~, Us, ..., U~. An indefinitely large: population of individuals is supposed to be given indefinitely many trials on the item. In this section and in §7, we assume nothing about experimental independence of the trials. The definition of, and lower bounds for, the reliability coefficient of U will be established in terms of parameters defined over all persons and trials. That the lower bounds can be observed from a single trial is shown in §10, using the hypothesis of experimental, independence. The response of each person to the item on each trial can be represented by a~, where t l if individual i is in U#on trial k a~ = 0 otherwise. (1) The probability over trials that individual i will be in U~. is ~ =- Ea~, (2) LOUIS 89 GUTTMAN where E signifies the expected (i.e., mean) value over the indicated subscript. Let P~ be the largest of the P~i for individual i; this will be called his modal pvobab~iity. Then p~j<-P~ (1-- 1,2,..., m; i-- 1,2,-..) (3) and, since ~ p~j = 1, we have 1 P,>--- (i-~1,2,...). (4) For a given individual, if for a certain ] it is true that p~ : P~, then U~ is a modal category, for that person. Whenthere are two or more modal categories for a person, one of these can be arbitrarily designated to be the modal category to represent a "true" value for the person. P~ is the probability that a person will be at his modal value. The reliability coefficient for the i-th person is defined to be R~=-~ P~---m--1 . m (5) From (4) and the fact that P~ =< 1, we have 0 --< Ri=<1. (6) The reliability of the item for the population is defined as follows. Let a be the meanmodal probability for the population: a -- EP~. The reliability (7) coefficient of U for the population is then defined as: 0-- m--1 a---m . (8) From (4) and (7), I < < (9) m so that Taking expectations we find that 0 ~ p -< 1. (10) over i for both members of (5)’and using (7), p--ER~, (11) that is, the reliability coefficient for the population is the meanof the individual reliability coefficients. 90 PSYCHOMETRIKA 7. Derivation of the Marginal Lower Bound The marginal relative frequency of a category is the proportion of observations :falling into that category. Let At be the marginal r~elative frequency of Us over all trials and persons: A~=-- EEaij~. (12) ~ k From (2), (12) can be written A# ~ Epic, (13) so that from (3), Aj <=EP~for all ~’; or, from (7.), A~<=a(j-- 1,2,.-. ,m). (14) Inequality (14) is the basis for our first lower bound. It states that is not less than any of the marginals, so that in particular a is not less than the largest maq’ginal. Let ~ be the largest of.the A~. Then from (8) and (14), we a lower bound to the group reliability coefficient: O >~ m--1 ~-- m ¯ (15) Each A~ is defin’ed over all trials as well as persons, and it is desirable to bound a from but a single trial. In §10 it is shownthat the marginal for Uj on a sing~le trial is actually equal to At (except possibly for an iinfinitesimal proportion of the trials), provided that the variation from trial to trial of each person is independent of the variation of everyone else. In practice, then, the marginal from a single trial can be used as being equal to Aj ; and the largest marginal from a single trial can be used for A. 8. Derivat’,ion of the Lowev Boundfrom Joint Occurve~ces Let V be an: item with the n categories V~,, V~, ... , V~, and let 1 if individual i is in V~on trial k b~ = (16) 0 otherwise. Let q~ be the probability over trials that individual i will be in V~: q~ =-- Eb~. (17) k Then clearly ~ q~ --- 1. g=l (18) LOUIS GUTTMAN ,.91 The proportion of joint occurrences of Uj and V~ over all trials and persons is C~g~ EEai~ b~g~. (19) ~k The two items are said to be experimentally .independent* far the ith person if, for the given i, Ea~ b~----p~ ~ q~ ($’~- 1,2, ..., g~l,2,...,n). m; (20) Equation (20) states that, for individual i, the probability of the joint occurrence of U~ and V~ (over the trials) is the product of the separate probabilities. If (20) holds for all i, then taking expectations over i and using (19), we have C~¢=- Ep~ q~. (21) From (3) and (21), C~g<=EP.~q~ ~ (] ~-- 1,2, ..-, m g~l,2,...,n). (22) Let ~g be the largest value of C~ for fixed g, so that C~.g~ C~ Now, (22) states in particular (] ~- 1,2,..., g--1,2,...,n). (23) that ~ <=EP~ q~ (g "- 1,2,..., n) (24) Summingboth members of (24) over g and using (18), we ~ ~ <= EP~--a, (25) which provides ,the desired lower bound: That each of the ~, and hence their sum, is observable from but a single trial is proved in §10. The lower bound in (26) is better than the marginal lower bound in (15), a fact which can be.seen as follows. Summingboth members of (19) over g and using (18) and (12), we * For further discussion of experimental independence, see (4, 263-265). PSYCHOMETRIKA 92 = ~ C~. (27) :Hence, from (23) (i-- 1,2,..., m), (28) so that in particular (29) 9. :The Upper Bounds From Two Trials The problem of two trials of a single item can be handled by regarding the item V of the preceding section to be a re.trial of U, so that expectations over k are interpreted as over a universe of pairs of trials. Thenm ----- n, U~---- V~, and p~# --~ q~#, so that (21) becomes C~ --- Ep~ p~. (30) i Equation (30) shows that the table of joint occurrences for the tes~retest of an item ’.must be symmetric and Gramian. The probabili’Lty that person i will be in Us on both trials is, by the multiplicative law of independent probabilities, p2i~. Let ~ be the probability that person i will be in some one category on both trials. Then, by the additive law for mutually exclusive events, n, ---- ~ p2~. Let r be the mean probability category on both trials: (31) that a person will remain in the same ? ----- E~. (32) 7 -- C~#. (33) From(30)and (:31), Hence, r is observable from a single pair of trials, since the diagonal elements C~ are observable, as is shownin the next section, r is the basis for our two upper bounds. From (31), since one of the p~ is P~, we have ~ _->/~ (i----- 1,2,...). Taking expectations over i, and using (32), (34) LOUIS GUTTMAN 93 y >=Ep2~. (35) Since the meanof squares is not less than the square of the mean, EP2~>=(EP~)~ -- 2. a (36) Hence, from (35) and (36), we have the first upper bound: a -~ Yr. (37) To improve on (37), we write (31) ~ = P~ + ~’ p~,~, (38) where Y/indicates summation over ~he m -- 1 non-modal categories. Since the meanof squares is not less than the square of the mean, (Z’ Z’p2~t --~ J m--1 (i-----1,2, "-’). ’ (39) Using (38) and (39) and the fact that ~’ p~j ---- 1 -- P~, we n~ _>- P~ + (1 -- P~)~ m--1 (i-- 1,2, ..-). Taking expectations over i in (40) and using (36) and (32) ma~ -- 2a + 1 --> r , m--1 whence m~2-2a+ [1-- (m--l) y]-<0. (40) (41) (42) From (42) we obtain the second upper bound for < 1 + x/(m-- 1) (mr-- 10. Observabiliby from a Single Trial The proportion of the population observed in U~ on trial A~,~---- Ea~. (43) k is (44) Weshall prove that A~,~ ---- A~except possibly for an infinitesimal proportion of the trials, so that each of the total marginals is observable 94 PSYCHOMETRIKA from but a single triai. In particular, then, A will be equal to the largest marginal observed on a single trial. The proof consists of showing the variance of/Ij,~ is zero over all trials. Taking expectations over k for both membersof (44) and using (12) we have EAj,~ -~ At. (45) Then the v~riance over trials of A~,~ is ~ ~ E A ~,~ -- A ~ . (46) To evaluate the right member, it is convenient first to consider a finite ~pulation of N individuals and then to ~ke the li~t as N ~ ~ For fini~ N, the operation E is the operation ~ ~. Then from (44), ~ ~=~ we Can write Weassume that what one person does on the item is experimen~lly i~dependent 0 f what anyb~y else does oa the item: Ea~a.;i~= ~ (Ea,~) (Ea~) ~ ~ (h~i;h,i:l,2,...,N; ]=l,2,-.-,m). (4s) If h : i, then of course, since a~ ~ a~, Ea~ ~ p~. (49) k Taking executions over k in (47) and using (48) and (49) we ( ~ ) ’1 or, using (13) and combining the last two terms, 1 EA~,~ ~ A~/ + : Ep~ (1 -- p~). (50) The expiation in the s~ond tern on the right is always finiS, and in f~t is nopnegative and not ~ea~r than 1/4, so that the ~ond t~ has the limit z~o as ~ ~ ~ ~ Hence, lira EA~,~ ~ A~, so that from ~ (46), LOUIS GUTTMAN 95 ~j = O. (51) Therefore, A~.k is equal to its expectation Aj except possibly in an infinitesimal proportion of the trials. This observability from a single trial implies that the marginal proportions of any item based on an indefinitely large population have perfect test-retest reliability, regardless o] the reliability of the individuals in the population. Moregenerally, in the joint occurrence table of two or more experimentally independent items, each of the joint frequencies observed on a single trial of a large population is perfectly reliable. That is, if Cjg,k =- Ea~b~gk, (52) and if a2~ is the variance of Cj~,~ over trials, (~ = EC~~,~ -- C~ =- 0 , then (53) k and C~g,~ ~ C.~.~ (54) The except possibly for an infinitesimal proportion of the trials. proof is identical in form for that of the marginals. Therefore, both the marginal lower bound and the lower bound based on joint occurrences are observable from but a single trial ; and the upper bounds are observable from but a single pair of trials. REFERENCES 1. 2. 3. 4. Guilford, J. P. Fundamental statistics in psychology and education. New York: McGraw-Hill, 19’42. Guttman, Louis. An outline of the statistical theory of prediction, in P. Horst, et al., The prediction of personal adj~,~tm~mt. NewYork: Social Science Research Council, 1941. Guttman, Louis. A basis for scaling qualitative data. Amer. Sociol. Rev. 1944, 9, 139-150. Guttman, Louis. A basis for analyzing test-retest reliability. Psyvhametrilta, 1945, 10, 255-282.