11, NO. 2
THE TEST-RETEST RELIABILITY OF QUALITATIVE DATA
The test-retest reliability of qualitative items, such as oeeur in
achievement tests, attitude questionnaires, public opinion surveys,
and elsewhere, requires a different technique of analysis from that
of quantitative variables. Definitions appropriate to the qualitative
case are made both for the reliability coefficient of an individual on
an item and for the reliability coefficient of a population on the iter~
From but a single trial of a large population on the item, ~t is possible to compute a lower bound to the group reliability
Two kinds of lower bounds are presented. From two experimentally
independent trials bf the population on the item, it is possible to compute an upper bound to the group reliability
bounds are presented. The computations for the lower and upper
bounds are all very simple. Numerical examples are given.
An item is said to be unreliable for a person to the extent to
which his response to it would vary in repeated experiments under
the same conditions. For qua~t#ative items, the method of studying
the variation has been to use arithmetic means as the "true" responses and variances as measures of dispersion. But if an item is
qualitative, the only average (of those commonlyin use) that it can
have is the mode, and a measure of variation is the relative frequency
of the non-modal values.* Therefore, the study of test-retest
reliability of qualitative items requires a different technique from that
of quantitative items. The present paper is devoted to presenting
appropriate definitions and technique for the qualitative case.
Qualitative items are of considerable importance for the social
and psychological sciences. Qt~estions in an achievement test are qualitative dichotomies when answers are either "right" or "wrong." If
numerical values are assigned these two categories and are added up
over a set of items to yield a total test score,~ then the reliability of
* For a discussion of tbe distinction between qualitative arid quantitative
variables with respect to prediction and correlation,
see (2), An adaptation
(2) is also available in Guilford’s textbook (1, Chap. 10).
t The problem of whether a total score is meaningful i~ not raised here ; that
is the problem of scale analysis (3). The analysis of te,~t-retest
remains the same, regardless of the meaning of items or to~al scores.
the total score shoul~d, of course, be studied by the technique of quantitative variables; but this does not change the fact that the original
items were qualitative and can have their separate reliabilities
studied by the technique to be developed here. In attitude and opinion
research, the reliability of each qualitative ~tem by itself is often of
interest, so that the qualitative technique is especially appropriate
in this field, In general, the technique is appropriate in any study
where qualitative data are enumerated, be they census returns, the
eye colers of fruit flies, commercialcommodities, etc., etc.
In Part I of this paper is given a description of the definitions
used, and illustrations of applications of the qualitative technique are
preserited. The actual derivations and proofs are developed in Part II.
2. Definitions and Terminology
Consider a universe of indefinitely manytrials of a person on an
item which has m categories. By the probability that the person will
full in a particular category, we mean the proportion of trials in
which this will happen for him. The category with the highest probability will be called the modal value for the person, and the highest
probability itself wi|l be called the "modal probability. If the modal
probability is attained by two or more categories,, then any one of
Chese can arbitrarily be selected as the modal value. The modal value
plays the role in the qualitative theory ~hat the expected or mean
value plays in the quantitative theory; and the complement of the
modal probability plays the role of the error variance.
If all the categories have the same probabili~,y for the person
(this commonprobability is then the modal probability and is equal
to l/m), then the item will be said to have zero reliability for that
person. If the modal probability is unity (the non-modal categories
each having zero probability) then the item is said t.o have perfect
reliability for the person, or a reliability of unity. Let P.~ be the modal
probability for the i-th person. The reliability coefficient of the item
for that person will be defined as
R~ ~ -- _ P~-m-- 1
R~ varies between zero and unity according as P~ varies between 1Ira
The meanof P~ over all people in the population will be denoted
by a; this is the mean~odal probability for the population. The reli-
ability coefficient of the item for the ~o~dation is :defined to be
p is actually the meanof the R~; that is, the population reliability, of
the item is the meanof the individual reliabilities.
If p is zero, then all* of the R~ must be zero; and if p is unity,
then all* of the R~must be unity. Zero group reliability implies zero individual reliabilities, and perfect group reliability implies perfect individual reliabilities,
An intermediate group reliability coefficient is a
value around which the individual reliabilities
can cluster with various degrees of dispersion; it is not assumed that all individuals are
equally reliable. It. should be clear, though, that the .closer p is to
zero or to unity, the greater the restraint to the variance of the Ri
The experimental problem is to obtain information about the
R~. To obtain this directly would require actual repeated experiments under the same conditions, wh.ich is a difficult, if not impossible, thing to do empirically in many cases. But fortunately it is
possible to obtain information about ~ from only a single trial. Lower
bounds to ~ can be determined from only a single trial, so that we
can say the group reliability of the item is at lea~t a cert. ain amount.
The notion of a lower bound to a reliability coefficient.was introduced in (4) for the case of quantitative variables. To derive
lower bound from a single trial for the quantitative case requires
either that (a) the variable be the sum of at least two experimentally
independent items,~ or that (b) the item be correlated with one
more experimentally independent items.$
Condition (a) of course cannot be met by a qualitative item and
has no analogue for the qualitative case. Instead, we ~have the far
simpler phenomenonthat only the freq~ency distribution of the population on the item itself is needed to determine a lower bound, as is described in the next section. An analogue to condition (b) does hold
and provides the lower bound based on sub-frequencies described
If two experimen~al~y independent trials are actually made ~or
an item, then also an ~,pper bound can be computedfor the reliability
coefficient,, as is illustrated in §5.
In order to obtain a lswer bound from a single trial, or an upper bound from two trials, it is assumed here that the population of
* Exceptpossiblyfor an infinitesimal proportionof the people.
~ This is true for the first five boundsin (4).
.J:Seethe theoremin §17 of (4} cn whichthe sixth beund~s based.
people is indefinii~Iy large. If the population is finite or if a sample
from an indefinitely large population is used for the computations,
then the bounds derived here will of course be subject to sampling
error. The sampling theory of the bounds has not yet been completely
workedout, but il~ seems clear that they should be quite reliable when
computed from s,~mples of the size ordinarily used in public opinion
3. The Marginal Lower Bound
Let A~ , As, ..., A,n be the relative frequencies of the respective
m categories of 1Lhe item on a single trial of the population. In §7
and §10 below it is proved that* a is not less than the largest of the
A’s. For example, suppose each person in a large population were
asked if he agreed to a certain political proposition, and suppose the
Here, m -- 3; the largest A is .60; so we can state that a ->_ .60; and
we can further state that
.60 --3 ’
If the responses had been:
yes 33 1/3%
undecided 33 1/3
no 83 1/3
then we would state that a ->= .33 1/3 and that
which provides no information about p. The responses can have anything from perfect reliability to zero reliability if the population is
equally divided over the m categories. But if there are uneq,~d frequencies, then there must be at least some group reliability.
* Except possibly
for an infinitesimal
of the trials.
Such a lower bound, based on marginal frequencies, cannot approach unity, of course, unless the population piles up in one categorT. For example, if, on a true-false question, 90%of a large class
of students answered correctly (or incorrectly), then for that question we can state that
It is of interest to notice that, for a fixed maximum
the lower bound increases with the number of categories. If 90%of
a large class of students answered a multiple-choice question correctly, where there were four choices, then for that question the marginal
lower bound would be .87, which is larger than the bound of .8 for
the true-false question in the preceding paragraph. Similarly, a multiple-choice question which was answered correctly by 50% of the
population can be said to have some reliability,
whereas nothing can
be said about the reliability of a true-false question which 50%answered correctly, if this per cent is the only information available.
Manyan item will actually have a reliability coefficient that is
far greater than its marginal lower bound, so that it is desirable to
seek a more efficient lower bound. A better bound is described in
the next section.
4. The Lower Bound From Joint. Occurrences
By relating the item to another experimentally independent item
or set of items, it is possible to improve on the marginal lower bound.
It is proved in §8 and §10 below that a for an item is* not smaller
than the sum of its largest sub-frequencies (expressed as proportions) in the joint occurrence table of that item with any other item
obtained from a single trial, provided the table is based on a large
population of individuals and the items are experimentally independent.
As an example, consider the joint occurrences in a single trial of
two items, U and. V, for a large population. U has three categories:
U~, U~, and U3 ; and V has two categories: V~ and V,_. Suppose the
joint occurrences in a single trial are as follows:
.40 .30 .30
* Exceptpossiblyfor an infinitesimal proportionof the trials.
V has three columns of sub-frequencies, one for each value of U. The
largest sub-frequencies in each columnare, respectively, .40, .20, and
.30. Then we can .,~tate that a for V is not less than
.40 ÷ .20 + .30 -- .90 ,
or p ->_- .80. This is a great improvement over the marginal lower
bound for V, which is
Similarly, we obtain a better lower bound for U from the same
data. Fromthe table of joint frequencies, Wecan state that a ’for U
is not less than the sumof the largest sub, frequencies of U:
.40 + .30 -- .70,
so that for U,
or p => .55. This is not extremely high, but still
over the mar~nal lower bound for U, which is
~ >- ~ .70---
it is an improvement
If a lower bound obtained from a bivariate table is not ve~T high
and if it is believed[ that an item is muchmore reliable than this lower
bound, then it may be worth while to seek a better louver bound by
e, onsidering relationships with more than one other item, provided
the additional items are also experimentally independent of the one
whose reliability is sought. The sum of *,he highest sub-frequencies
of lhe item in the multivariate table is* again a lower bound to a.
In particular, if an item is perfectly r(~h~te~ ~o any set whatsoever of items experimentally independent of it, ~t-~en we can state that
the item is perfectly reliable.
* Exceptpossiblyfcr an infinitesirra! proportion(,f !he trials.
5. The Reliability Coefficient and Two Independent Trials
From a single trial, only a lower bound .can be established for
the group reliability coefficient. But if it is possible to maketwo experimentally independent trials of the same item, then it is possible
also to set an upper bound to the group reliability coefficient. In practice, the lower and upper bounds will often be very close to each
other, so that the coefficient will be determined within a relatively
small interval. [In the quantitative case, it maybe remembered, two
trials suffice to dete~nine*the reliability coefficient exactly. See (4),
Two¯ upper bounds are established in §9 below, the second being
somewhatbetter than the first, bu.t illvolving slightly more arithmetic.
As an example of the use of the bounds, consider a trichotomous item
repeated twice--but experimentally independently--on a large population, with the following results:
.30 .45 .25
The joint occurrence table must* be symmetric (indeed, it must* be
Gramian) for two independent trials, if the population is very large.
A lower bound for a is obtained as in §4, by adding up the highest sub-frequency in each row (column)
a ~ .15 + .25 + .15 --.55.
It is important to notice that a is not the same as the average probability of remaining in the same category on two trials. If y is the
average probability of remaining in the same category on two trials,
then, as is shownin §9 and §10 below, ? can be set equal to the sum
of the principal diagonal elements:
y ~- .10 + .25 + .15 ~-~ .50.
a is always greater than ~, unless both are equal to unity (in which
case p ---- 1). This i~mquality is true even if all the largest sub-fre* Except possibly
for an infinitesimal
of the pairs ~f trials.
quencies are in the principal diagonal.
A simple upper bound for a is the square root of ~,:
Hence, for our example we can state that a _-< .71. A better upper
1 ÷ ~/(m-- 1) (mr--l)
where m is the number of categories in the item. For the present
example m -- 3.and we can set r ~ .50, so we can state that
Et = .67.
Hence, combining both the lower and upper bounds into one statement:
.55 =< a = .67,
.33 =< p = .50.
As a final comment, it should be noted that p is bounded from
two trials, even though an individual R~ cannot be estimated very
well. A larger series of trials on a particular person would be necessary to estimate his R~.
6. Analytical Definitions
Consider an item U with the m categories U~, Us, ..., U~. An
indefinitely large: population of individuals is supposed to be given
indefinitely many trials on the item. In this section and in §7, we
assume nothing about experimental independence of the trials. The
definition of, and lower bounds for, the reliability coefficient of U
will be established in terms of parameters defined over all persons
and trials. That the lower bounds can be observed from a single trial
is shown in §10, using the hypothesis of experimental, independence.
The response of each person to the item on each trial can be represented by a~, where
l if individual i is in U#on trial k
The probability over trials that individual i will be in U~. is
~ =- Ea~,
where E signifies the expected (i.e., mean) value over the indicated
subscript. Let P~ be the largest of the P~i for individual i; this will
be called his modal pvobab~iity. Then
p~j<-P~ (1-- 1,2,...,
m; i-- 1,2,-..)
and, since ~ p~j = 1, we have
For a given individual, if for a certain ] it is true that p~ : P~, then
U~ is a modal category, for that person. Whenthere are two or more
modal categories for a person, one of these can be arbitrarily designated to be the modal category to represent a "true" value for the
person. P~ is the probability that a person will be at his modal value.
The reliability coefficient for the i-th person is defined to be
From (4) and the fact that P~ =< 1, we have
0 --< Ri=<1.
The reliability of the item for the population is defined as follows. Let a be the meanmodal probability for the population:
a -- EP~.
coefficient of U for the population is then defined as:
From (4) and (7),
we find that
0 ~ p -< 1.
over i for both members of (5)’and using (7),
that is, the reliability coefficient for the population is the meanof the
individual reliability coefficients.
7. Derivation of the Marginal Lower Bound
The marginal relative frequency of a category is the proportion
of observations :falling into that category. Let At be the marginal
r~elative frequency of Us over all trials and persons:
From (2), (12) can be written
A# ~ Epic,
so that from (3), Aj <=EP~for all ~’; or, from (7.),
Inequality (14) is the basis for our first lower bound. It states that
is not less than any of the marginals, so that in particular a is not
less than the largest maq’ginal.
Let ~ be the largest of.the A~. Then from (8) and (14), we
a lower bound to the group reliability coefficient:
Each A~ is defin’ed over all trials as well as persons, and it is
desirable to bound a from but a single trial. In §10 it is shownthat
the marginal for Uj on a sing~le trial is actually equal to At (except
possibly for an iinfinitesimal proportion of the trials), provided that
the variation from trial to trial of each person is independent of the
variation of everyone else. In practice, then, the marginal from a
single trial can be used as being equal to Aj ; and the largest marginal from a single trial can be used for A.
8. Derivat’,ion of the Lowev Boundfrom Joint Occurve~ces
Let V be an: item with the n categories V~,, V~, ... , V~, and let
1 if individual i is in V~on trial k
Let q~ be the probability over trials
that individual i will be in V~:
q~ =-- Eb~.
~ q~ --- 1.
The proportion of joint occurrences of Uj and V~ over all trials and
C~g~ EEai~ b~g~.
The two items are said to be experimentally .independent* far the
ith person if, for the given i,
($’~- 1,2, ...,
Equation (20) states that, for individual i, the probability of the
joint occurrence of U~ and V~ (over the trials) is the product of the
separate probabilities. If (20) holds for all i, then taking expectations over i and using (19), we have
C~¢=- Ep~ q~.
From (3) and (21),
(] ~-- 1,2, ..-, m
Let ~g be the largest value of C~ for fixed g, so that
Now, (22) states in particular
(] ~- 1,2,...,
~ <=EP~ q~ (g "- 1,2,...,
Summingboth members of (24) over g and using (18), we
~ ~ <= EP~--a,
which provides ,the desired lower bound:
That each of the ~, and hence their sum, is observable from but
a single trial is proved in §10.
The lower bound in (26) is better than the marginal lower bound
in (15), a fact which can be.seen as follows. Summingboth members
of (19) over g and using (18) and (12), we
* For further
= ~ C~.
:Hence, from (23)
so that in particular
9. :The Upper Bounds From Two Trials
The problem of two trials of a single item can be handled by regarding the item V of the preceding section to be a re.trial of U, so
that expectations over k are interpreted as over a universe of pairs
of trials. Thenm ----- n, U~---- V~, and p~# --~ q~#, so that (21) becomes
C~ --- Ep~ p~.
Equation (30) shows that the table of joint occurrences for the tes~retest of an item ’.must be symmetric and Gramian.
The probabili’Lty that person i will be in Us on both trials is, by
the multiplicative law of independent probabilities,
p2i~. Let ~ be
the probability that person i will be in some one category on both
trials. Then, by the additive law for mutually exclusive events,
n, ---- ~ p2~.
Let r be the mean probability
category on both trials:
that a person will remain in the same
? ----- E~.
7 -- C~#.
Hence, r is observable from a single pair of trials, since the diagonal
elements C~ are observable, as is shownin the next section, r is the
basis for our two upper bounds.
From (31), since one of the p~ is P~, we have
~ _->/~ (i-----
Taking expectations over i, and using (32),
Since the meanof squares is not less than the square of the mean,
EP2~>=(EP~)~ -- 2.
Hence, from (35) and (36), we have the first
a -~ Yr.
To improve on (37), we write (31)
~ = P~ + ~’ p~,~,
where Y/indicates summation over ~he m -- 1 non-modal categories.
Since the meanof squares is not less than the square of the mean,
Using (38) and (39) and the fact that ~’ p~j ---- 1 -- P~, we
n~ _>- P~ +
(1 -- P~)~
(i-- 1,2, ..-).
Taking expectations over i in (40) and using (36) and (32)
ma~ -- 2a + 1
m~2-2a+ [1-- (m--l)
From (42) we obtain the second upper bound for
1 + x/(m-- 1) (mr--
10. Observabiliby from a Single Trial
The proportion of the population observed in U~ on trial
Weshall prove that A~,~ ---- A~except possibly for an infinitesimal proportion of the trials, so that each of the total marginals is observable
from but a single triai. In particular, then, A will be equal to the
largest marginal observed on a single trial.
The proof consists of showing the variance of/Ij,~ is zero over
all trials. Taking expectations over k for both membersof (44) and
using (12) we have
EAj,~ -~ At.
Then the v~riance over trials of A~,~ is
~ ~ E A ~,~ -- A ~ .
To evaluate the right member, it is convenient first to consider a
finite ~pulation of N individuals and then to ~ke the li~t as N ~ ~
For fini~ N, the operation E is the operation ~ ~. Then from (44),
we Can write
Weassume that what one person does on the item is experimen~lly
i~dependent 0 f what anyb~y else does oa the item:
If h : i, then of course, since a~ ~ a~,
Ea~ ~ p~.
Taking executions over k in (47) and using (48) and (49) we
or, using (13) and combining the last two terms,
EA~,~ ~ A~/ + : Ep~ (1 -- p~).
The expiation in the s~ond tern on the right is always finiS, and in
f~t is nopnegative and not ~ea~r than 1/4, so that the ~ond t~
has the limit z~o as ~ ~ ~ ~ Hence, lira EA~,~ ~ A~, so that from
~j = O.
Therefore, A~.k is equal to its expectation Aj except possibly in an
infinitesimal proportion of the trials.
This observability from a single trial implies that the marginal
proportions of any item based on an indefinitely large population have
perfect test-retest reliability, regardless o] the reliability of the individuals in the population.
Moregenerally, in the joint occurrence table of two or more experimentally independent items, each of the joint frequencies observed on a single trial of a large population is perfectly reliable.
That is, if
Cjg,k =- Ea~b~gk,
and if a2~ is the variance of Cj~,~ over trials,
(~ = EC~~,~ -- C~ =- 0 ,
except possibly for an infinitesimal proportion of the trials.
proof is identical in form for that of the marginals.
Therefore, both the marginal lower bound and the lower bound
based on joint occurrences are observable from but a single trial ; and
the upper bounds are observable from but a single pair of trials.
Guilford, J. P. Fundamental statistics
in psychology and education. New
York: McGraw-Hill, 19’42.
Guttman, Louis. An outline of the statistical
theory of prediction, in P.
Horst, et al., The prediction of personal adj~,~tm~mt. NewYork: Social Science Research Council, 1941.
Guttman, Louis. A basis for scaling qualitative data. Amer. Sociol. Rev. 1944,
Guttman, Louis. A basis for analyzing test-retest
1945, 10, 255-282.