Understanding Y haplotype matching probability

Document technical information

Format pdf
Size 463.4 kB
First found Nov 13, 2015

Document content analysis

Language
English
Type
not defined
Concepts
no text concepts found

Persons

Dennis M. Nagy
Dennis M. Nagy

wikipedia, lookup

H. C. Robbins Landon
H. C. Robbins Landon

wikipedia, lookup

Organizations

Places

Transcript

Understanding Y haplotype matching probability
Charles H. Brennera,b
published Forensic Sci. Int. Genet. 8 233–243, January 2014
a Human
b DNA·VIEW,
Rights Center, U.C. Berkeley, Berkeley, CA United States
6801 Thornhill Drive, Oakland, CA 94611-1336, United States
Abstract
The Y haplotype population-genetic terrain is better explored from a fresh perspective rather than by analogy with the more
familiar autosomal ideas. For haplotype matching probabilities, versus for autosomal matching probabilities, explicit attention to
modeling – such as how evolution got us where we are – is much more important while consideration of population frequency is
much less so. This paper explores, extends, and explains some of the concepts of “Fundamental problem of forensic mathematics –
the evidential strength of a rare haplotype match” [1]. That earlier paper presented and validated a “kappa method” formula for the
evidential strength when a suspect matches a previously unseen haplotype (such as a Y-haplotype) at the crime scene. Mathematical
implications of the kappa method are intuitive and reasonable. Suspicions to the contrary raised in [2] rest on elementary errors.
Critical to deriving the kappa method or any sensible evidential calculation is understanding that thinking about haplotype
population frequency is a red herring; the pivotal question is one of matching probability. But confusion between the two is
unfortunately institutionalized in much of the forensic world. Examples make clear why (matching) probability is not (population)
frequency and why uncertainty intervals on matching probabilities are merely confused thinking. Forensic matching calculations
should be based on a model, on stipulated premises. The model inevitably only approximates reality, and any error in the results
comes only from error in the model, the inexactness of the approximation. Sampling variation does not measure that inexactness
and hence is not helpful in explaining evidence and is in fact an impediment.
Alternative haplotype matching probability approaches that various authors have considered are reviewed. Some are based on
no model and cannot be taken seriously. For the others, some evaluation of the models is discussed. Recent evidence supports
the adequacy of the simple exchangability model on which the kappa method rests. However, to make progress toward forensic
calculation of Y haplotype mixture evidence a different tack is needed. The “Laplace distribution” model of Andersen et al [3]
which estimates haplotype frequencies by identifying haplotype clusters in population data looks useful.
Keywords: haplotype, Y-haplotype, likelihood ratio, weight of evidence calculation, probability, model
1
1. Introduction – understanding Y haplotypes
19
20
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
For understanding the forensic use of Y-haplotype evidence, 21
rather than adapt the methods and habits that have evolved for 22
the analysis of autosomal DNA evidence it is more appropriate 23
and productive to start over from the beginning. Evidence is 24
quantified by a likelihood ratio built from the probability for a 25
coincidental match by an innocent suspect; that fact remains. 26
All else is up for grabs.
27
The genetic rules for the Y haplotype are different in sev28
eral ways from the autosomal rules and these differences have
29
population genetic consequences, which in turn affect match30
ing probabilities. The most important genetic difference is of
31
course the lack of recombination. One obvious consequence ev32
eryone knows: a matching probability can’t be obtained by mul33
tiplication across loci. The haplotype must be treated as a unit.
34
A Y-haplotype thus seems analogous to an autosomal allele, but
35
we will see that copying the treatment of autosomal loci would
be to fall into a trap, to err in several respects. For a start, the 36
37
38
Email address: [email protected] (Charles H. Brenner)
Preprint submitted to Forensic Science International: Genetics
39
idea that sample frequency approximates population frequency
approximates matching probability is too careless when most
haplotypes in the population are completely unrepresented in
the sample, as is the usual case with Y-haplotypes composed
of multiple STR loci. Another habit from autosomal practice
that doesnt apply sensibly to Y-haplotypes is the treatment of
θ – the Cockerham/NRC II [4, 5] notation for allele sharing by
common descent. The unexpected reason why it is not is explained in §2.4.
In the autosomal case, a simple model considers only allele
probabilities (sometimes carelessly called frequencies – see §4)
and taking θ into account is a refinement whose introduction
adds a bit of accuracy by acknowledging that identical alleles
are occasionally identical by descent (IBD). The Y-haplotype
situation is the opposite: Identical alleles are nearly always
IBD. Hence θ comes first and anything else is the minor refinement. Once we know θ we are close to the matching probability.
Terminology: Two or more haplotypes are considered “identical” if they are identical as far as determined, i.e. concerning
Yfiler haplotypes it means having the same repeat number(s) at
each STR locus. Haplotypes are “identical by descent” (IBD)
February 9, 2014
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
if they are identical and all the haplotypes along the (necessar- 87
ily patrilineal) paths connecting them to their common ancestor 88
are also identical. “Identical by state” (IBS) as used here is 89
synonymous with “identical”, an umbrella meaning in that IBS 90
thus includes IBD as a subset.1 Adopting the umbrella definition for IBS means some other term may be needed to mean 91
IBS but not IBD and for this purpose I use the word “strict.” 92
93
Identity strictly by state is also “convergence”.
The expression “unrelated man” is common in forensic prac- 94
tice; a good enough approximation sometimes. In the Y haplo- 95
type world, always remember that everyone is related for that is 96
97
nearly the only reason men match.
The historical development of autosomal calculation meth- 98
ods has been unsystematic, often intuitive, and to the extent 99
that there are underlying models, which concepts have been in-100
corporated (such as random mating, subpopulations, mutation)101
has of course been motivated by relevance to the autosomal sit-102
uation. The appropriate and important concepts for a model103
useful for haplotype analysis may be different ones. It makes104
sense to begin by exploring the population genetics of haplo-105
types in order to gain an understanding then to consider models106
and methods of calculation.
107
108
62
2. Exploring Y haplotype populations
109
110
63
64
65
66
67
68
Before discussing a few approaches, good and bad, that have111
been suggested for Y haplotype matching probabilities, we will112
delve into the nature of Y haplotype populations in order to113
have some background that will be a helpful context for evalu-114
ating the approaches.
115
I see three ways to explore: data, theory, and simulations. 116
117
69
70
71
72
73
74
75
2.1. Exploring Y-haplotype data
118
Population samples (or “reference databases”) for the 17locus YfilerTM Y haplotypes for n > 1000 men are conveniently119
available on the Internet [6] for several different populations.120
Examples, simulations, data and analysis mentioned in this pa-121
per assume Yfiler haplotypes unless otherwise specified.
122
Examining the data, a few points are quickly obvious:
123
124
76
77
78
79
80
81
82
(a) The vast majority of types that occur at all in a population
125
sample occur exactly once.2 I use the symbol κ for this
126
singleton proportion. κ gradually becomes smaller as the
number of collected reference haplotypes increases and is
127
larger when haplotypes include more loci and hence are
128
more polymorphic. κ > 0.8 for reference databases con129
sidered in this paper.
84
85
86
(ii) Therefore this usual situation is the most important
and most worthy of our attention.
(iii) Sample frequency is a poor estimate for population
frequency. Among the haplotypes that are not represented, the sample frequency of 0 obviously underestimates the true frequency. Further, in order to
compensate for the large fraction of the population
that is under-represented in the sample, those sample
frequencies that are not 0 must on average greatly
overestimate the true frequency.
(iv) Point (a)iii may seem paradoxical, seemingly contradicting the intuition that sample frequency is an unbiased estimate for frequency. That intuition is correct
in that averaging sample frequencies for a particular
haplotype T over many repeated samplings is an unbiased estimate for population frequency. For a typical rare haplotype, repeated samplings give a sample frequency that is usually 0 but occasionally 1/n
(where n=sample size), rarely more, with expected
value equal to the population frequency – no bias.
But the situation at hand is quite different; we fix our
attention on a single sample and a single, necessarily
non-zero, sample frequency such as 1/n (the common situation for casework), and consider the many
haplotypes in the sample with that sample frequency.
For example consider the set of population frequencies of the once-observed haplotypes in a database of
size n = 1000. Not only are those frequencies well
under 1/1000 on average, only a relative handful of
individual frequencies exceed 1/1000.
(b) The probability that two randomly selected men have the
same type is small – about 1/8800 among Caucasians,
1/3300 among Chinese, 1/13000 among US Blacks [6].
These numbers are calculated simply by comparing every
pair of men in a database. Note that this calculation from
pairwise comparisons is another way, different from sample frequency, for using the sample database to come up
with matching probability. Implication:
No type occurs many times. For example, among the 4102131
Caucasian full profiles, κ = 84% are singletons (once oc-132
curring), and 98% of the men in the sample have a type133
shared with at most 5 men. Implications:
134
(i) The average haplotype population frequency among
observed Caucasians is 1/8800. For those haplotypes observed only once in the sample, the frequency must be even less. Therefore the singleton
sample frequency of 1/4102 must be an overestimate by much more than two-fold. In fact, as is
shown in [1], the actual amount of overestimate is
by 1/(1 − 84%) or about a factor of 6.
1 Some writers use IBS differently – as an alternative or partially an alter-135
native to IBD – but the umbrella usage looks to me more common especially136
among careful writers.
2 for presently available sample sizes. If – somewhat unrealistically – sample137
size n were increased without increasing the number of loci, then to be sure138
eventually κ < 1/2 (for about n > 10930 according to [1]).
139
(c) Also interesting if not so obvious, using a test devised by
Slatkin [7]: Comparing the population sample with expectation under the Kimura etc. model of infinitely many
selectively neutral alleles generally shows very small pvalues - i.e. p < 0.01 for the Caucasian and other large
130
83
(i) The crime scene type will usually not be found in the
population sample.
2
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
datasets.
MRCA
3
(d) Various published data is available for the mutation rates
at the various Y STR loci, and not surprisingly they are
similar to autosomal STR rates: STR mutation rates average 1/350 per locus, so over the full 17 loci µ = 17/350,
or 5%, per meiosis.
generation g
2.2. Theoretical explorations
The main focus of this and the following section is exploring
and understanding the claim in §1 that the story of rare haplotype matching is essentially the story of common descent.
As a simplified but helpful model of a forensic Y-haplotype:
Represent the data at each locus as an integer repeat amount;
model a compound locus as if two loci; mutation occurs by single steps plus or minus at a locus and the mutation rate is the
same 1/350 per meiosis at every locus. Under this model, a
Yfiler haplotype is an ordered set of 17 integers, which is to say
a point in a 17-dimensional integer lattice, and the evolutionary
course of a haplotype as it is transmitted through generations
with possible mutation is a 17-dimensional random walk. The
mutation rate of µ = 1/20 per generation means that father and
son carry the same haplotype with probability 0.95. If two contemporary men trace back a combined 2g generations to their
most recent common patrilineal ancestor (MRCA), then with
probability 0.952g no mutation occurred at any of the 2g meotic
events between them and their Yfiler haplotypes are IBD. For
example, if the MRCA lived 100, 1000, or 10000 years ago then
g ≈ 4, 40, or 400 and the IBD probabilities are 70%, 1/40 and
1/1016 (!) respectively. If they represent populations that have
truly been separate for 10000 years, there is no chance for them
to be identical by descent.
What is the additional probability that two such men’s haplotypes are identical and not IBD? Identity strictly by state arises180
through convergent evolution, i.e. through multiple mutations181
whose net effect is to cancel one another out. The probability182
of convergence is easily seen to be very small. Since the muta-183
tion model is the same whether time goes forward or backward,184
the trail of haplotypes connecting two men is a random walk in185
the high-dimensional lattice, and convergence corresponds to a186
random walk returning to its starting point. That’s an unlikely187
event when the number of dimensions is large [8].
188
An example of a 3-dimensional integer lattice is a multi-189
story parking building with numbered rows and numbered slots190
within each row on each level. Now imagine 100 such build-191
ings, arrayed in a 10 by 10 grid. Thats a 5-dimensional lattice192
of parking slots. Suppose you park your car and while you are193
gone someone randomly moves it a few times by step-wise mu-194
tation – “mutating” to the same slot in an adjacent row, then195
to the corresponding position one floor up or one structure to196
the East – like the meander through a haplotype’s genealogi-197
cal mutational history. After even a few mutations your car is198
hopelessly lost. Equivalently, it is very unlikely that the car will199
200
3 However
201
a few of the datasets show non-significant p-values such as three
samples of size |D| ≈ 300 from Malaysia in Table 2 of [1], for which p = 0.08,202
0.46, and 0.66.
203
3
Figure 1: If the two IBS Y haplotypes of generation g are not IBD (are
strictly IBS), then the lineage path connecting them through a common ancestor
(MRCA) must include at least two mutational (dotted line) events.
mutate back to where you left it. Large-dimensional space is
huge, cavernous and sparsely populated, and intuitions derived
from one, two, or even three dimensions are a poor guide. In
particular, in one or two dimensions random walks essentially
always return to the start; in high dimensions rarely. In autosomal analysis each locus, such as vWA, can be considered as an
independent unit and so considered identity between two alleles is only 10% or so to be IBD. (The same would be true for a
single Y STR locus, but we are not considering Y loci singly.)
The main chance for haplotype convergence is to have exactly
two intervening mutations which cancel one another (Figure 1).
In the model there are 34 different possible mutations (gain or
loss at each of 17 loci), hence even assuming that exactly two
mutations occur it is only 1/34 that they cancel. Computing
the probability of 2-mutation convergence is a simple binomial
exercise:
Pr(strictly IBS match)
= Pr(two haplotypes are convergent)
≈ Pr(2 cancelling mutations in 2g generations)
!
2
2g−2 2g 1
.
= µ (1 − µ)
2 34
Cancelling patterns involving more than two mutations are improbable (and trickier to compute), but when considering long
time spans – millenia – are, although a slight chance, virtually the only chance of Y haplotype random matching and their
probability is therefore at least of evolutionary interest. Figure 2
shows the relationship. Matching across a patrilineal separation
of several centuries is nearly always because there are no intervening mutations at all, and matching across many millenia of
separation is expected almost never to happen but will be after
multiple mutations when it does. Indeed, comparing a recent
study of 424 Chinese [9] with 10000 men in the various ABI
populations showed no overlap of haplotypes. By comparison
with the IBD probabilities above, the corresponding IBS probabilities increase slightly to 70%, 1/30 and 1/108 .
Compare Y haplotype populations with the theoretically
much discussed “infinitely many neutral alleles” model of
Kimura, Crow, etc. (§3.4). The infinite alleles model assumes
that every mutation is to a new type, that there is no selection,
and that the population size is fixed. From Slatkin’s test [7, 10]
some of the assumptions don’t hold very well for Y haplotypes.
The first assumption is nearly true for Y haplotypes, the second
is debatable, the third violated. It would be therefore be risky
to assume that Y haplotype populations conform to the infinite
alleles model.
Pr(Yffilerr tyype
es are ide
entiicall)
MRCA (1500 A D )
MRCA (1500 A.D.)
1
0.1
10
100
1000
Major IBD group
10000
years since MRCA
years since MRCA
0.01
0.001
0 0001
0.0001
0 00001
0.00001
0 000001
0.000001
Minor IBD Minor
IBD
groups
500 years
IBS
IBD
strictly IBS
0.0000001
0 0000001
Figure 2: Comparing probability of any match (IBS) with IBD and strictly IBS.
A match strictly by state is very unlikely although when the MRCA was more
than 1500 years ago, IBD is even more unlikely.
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
1000 men
Figure 3: Schematic representation of the IBS cohort for a typical haplotype T.
A single large IBD pedigree tree accounts for 98% of all T haplotypes.
2.3. Y-haplotype population simulations
Figure 2 shows that IBD is the dominant category of IBS,
especially in the situation of recent MRCA and correspond-238
ingly larger chance of IBS. But without knowning the distri-239
bution of MCRA times between pairs of haplotypes it can’t240
be interpreted to say by how much IBD dominates strict IBS.241
To answer that question I generated populations of 17-locus242
Y haplotypes by simulation. By tagging each individual with243
his mutation history it is possible to count an observed rate244
of IBD among the pairs that are IBS. The experimental result245
Pr(IBD|IBS) ≈ 33/34, and the probability would increase to246
the limit of 1 as the number of loci increases. This says that if247
two men share the same Y haplotype it is nearly always (e.g.248
97%) because they are IBD – opposite to the autosomal situa-249
tion in which only a few percent of matching is IBD. Let “IBD250
group” mean a group of (patrilineally related) men who share251
the same Y haplotype because it was transmitted among them252
without mutation. The entire collection of men with a partic-253
ular haplotype T – an “IBS group” – thus consists of one or
more IBD groups. The condition that IBD accounts for 97% of254
the T T pairs of men ensures that one of the IBD groups must
dominate including over 98% of the IBS men.4 Thus the almost invariable pattern is that almost all the men with a given
type belong to one dominant IBD group. Figure 3 illustrates
the idea. How closely are they related? As we have noted, two
men 400 generations apart are never IBD, so an IBD group is
much closer than 200th cousins. The chance for two random255
men to be IBS is about 1/8800, so their chance to be IBD is256
about 97% of that or 1/9000. Hence for even a medium sized257
population of ten million men, the dominant IBD group for a258
particular type is perhaps 1000 men – too many to be brothers or259
even close cousins. More distant cousins are far more numerous260
though less likely to be match. Computer simulations weighing261
these countervailing tendencies suggest that 10th to 30th cousin262
263
264
4 as you can easily convince yourself with a little mental shuffling of IBD265
P
P
group size proportions d1 , d2 , ..., di = 1, realizing that Pr(IBD|IBS ) ≈ di2 . 266
4
(measured via patrilineage) is roughly how related two matching men typically are. The story of a randomly selected man
matching a crime scene haplotype is mostly the story of a randomly selected man being such a relative.
In summary Y-haplotype identity is overwhelmingly identity
by descent, which confirms the assertion in §1 that the story
of many-locus haplotype matching is predominately a story of
θ and which thus tends to justify the simplifying modeling assumption under which the “κ method” (§3.6.2 and [1]) ignores
haplotype structure. Consideration of haplotype structure (such
as by looking at mutational neighbors; viz §3.3) is essentially
an effort to evaluate the possibility of non-IBD matching, of
convergence, but since that is so rare even a good job of estimating it would be only a minor refinement. Therefore I do not
concur with the speculation in [2] that ignoring structure entails
“a substantial loss of information”.
2.4. Thinking about θ
Classic allele matching probability for an autosomal locus
[2]:
Pr(T T ) =pT Pr(T |T ), where
Pr(T |T ) =θ + (1 − θ)pT .
(1)
Here Pr(TT) means the probability that two alleles such as
those of the genotype of a person are both T and pT is supposed to be the probability that a randomly selected allele is
T.
Formula (1) says that there are two ways that the second allele examined can also be T – with probability θ it is T because
the two alleles are IBD; with probability 1 − θ they are not IBD
in which case it is T with probability pT . If T T is understood to
mean the two alleles of a genotype then θ means the inbreeding
coefficient, a measure of relatedness between the parents.
Can formula (1), by suitable analogy, apply to Y-haplotypes?
More generally T T might be two alleles chosen by whatever
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
rule; then θ is the probability that two alleles chosen by that318
rule are IBD. If the scenario is that the two “alleles” are a crime319
scene Y-haplotype and the Y-haplotype of a randomly selected320
suspect, then θ means the chance that two randomly selected
men belong to the same patrilineal IBD group as in Figure 3.
Hence in the case of Y-haplotype matching the first term of (1),
the θ, represents the chance of IBD matching, an interpretation proposed in [2]. But as we have discovered, the chance of
IBD matching is already approximately the chance of matching. Hence the other term, (1 − θ)pT , representing the chance of
a non-IBD match, will be very small, dwarfed by comparison.
Then what is pT ? It is something like the conditional probability for a second allele to match the first given that they are not
IBD, but the exact meaning is even stranger and less intuitive
than that. The parameter θ is defined as a population-wide average and it may be larger or smaller than a haplotype-specific
matching chance. In particular if the crime scene haplotype is a
type not observed in the reference database (viz §2.1(a)i) then
it rates to be less than averagely common and θ is already larger
than the probability that a randomly selected suspect matches.
It follows that pT would be negative
Example: For the Caucasian example database of §2.1(a),
θ = 1/9000 and Pr(T |T ) = 1/25000 (Table 1) when the
crime type T is new. So pT = −1/14000 by formula (1).
297
and therefore is not even a probability. The analogy fails. The
conclusion is that formula (1) was never an accurate formula
even for autosomal alleles, but that fact has been overlooked
because with forensic autosomal markers the second term dominates. This is another example of an approximation that is adequate in the autosomal arena but is not sensible when dealing
with Y (or other) rare haplotypes.
298
3. Some haplotype matching calculation approaches
291
292
293
294
295
296
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
method
blind counting
Brenner counting
C.I. from 0 (SWGDAM)
frequency surveying
infinite alleles model
average matching chance
t-model
κ method
discrete Laplace model
334
335
• It’s sensible to consider the crime scene type as part of the
336
sample database.
317
• A sample database can be treated as data in various ways. 338
337
5
(§3.1)
(§3.1.1)
(§3.2)
(§3.3)
(§3.4)
(§3.5)
(§3.6.1)
(§3.6.2)
(§3.7)
likelihood ratio
4102/0 – i.e. infinite
4102
4102/3
∼ 4100?
13000
8800
23000
25000
∼ 25000
Table 2: Glossary of notation
notation
LRΩ
LRΩ (T,D)
|D|
Ω
Ωt , Ωκ
θ
θEwens
316
315
Table 1: Comparison of approaches assuming the Caucasian example database
with n = 4102 and κ = 0.84 and a new haplotype
T
D
This section surveys various matching calculation methods
that have been suggested for haplotype evidence. We assume321
here (as in [1]) that reference data exists that is suitably representative of some conceptual population of possible donors of322
the evidential haplotype, and that the hypotheses regarding a323
suspect are that he is either the donor or is in effect randomly324
selected from the population. This idealization is adopted not325
because other assumptions don’t have practical importance –326
they do – but on the grounds of learning to walk before run-327
ning.
328
Table 1 lists the methods discussed and gives a general sense329
of the range of likelihood ratios entailed, the largest of which330
represent the true strength of the evidence. Some of the smaller331
numbers may be acceptable as conservative. Through consider-332
ation of particular examples a few generalities become evident:333
• It’s important to have a model.
I suggest that any sensible method in forensic mathematics must
be grounded in a model – a presumed state of nature – and will
also depend on data. Some of the approaches are based on a
model. For notation used in the discussion refer to Table 2.
κ
p
∝
meaning
matching LR assuming model Ω
matching LR assuming Ω and using explicitly
mentioned data
type to match (data)
reference database augmented with the target
haplotype (data)
size of D
population model
particular models
Pr(IBD) – ([4, 5] notation)
θ as in [11]; essentially the effective number of
alleles, minus 1
proportion of singletons in D
“popularity”; number of occurrences
proportional to; i.e. if y = 2x then y ∝ x
3.1. blind counting method
The “counting method” traditionally means that the number
of types in the population reference sample that match the crime
scene target type, and the size n of the population sample, is reported as representing the matching evidence. In the interesting
and usual case that 0/n is reported, there is at least a possible inference that the trait is infinitely rare and hence the evidence is
infinitely strong against the suspect. Obviously that’s not accurate or intended but it’s fair to ask the reporting expert what the
court is meant to conclude. If the expert doesn’t have an answer
then certainly the court won’t either and the evidence presentation is deficient, arguably lacking probative value. If the expert
does have an answer, the expert should give the answer rather
than be coy.
3.1.1. Counting method (per Brenner)
[1] introduces a modified definition for “counting method”
which we adopt henceforth. Begin with the premise that the
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
appropriate question for evaluating a DNA match is the prob-388
ability that an innocent (meaning randomly selected with respect to the crime scene donor) suspect would have the target389
type, conditional upon the crime scene occurrence. Condition-390
ing is equivalent to mentally extending the reference database
391
by adding the target type [1]. The consequence is that a new
392
type is reported as occurring one time in the extended database
393
D. What calculation to then make from D remains to be de394
cided and various possibilities will be discussed later, but of
395
course the most obvious idea is to consider the sample fre396
quency, 1/|D|, and it is a conservative estimate (i.e. an over397
estimate) of the probability that an innocent suspect will match
398
as noted in §2.1(b)i.
399
“Overestimate the probability” means that the expected value
400
of the sample frequency is greater than the probability in light
401
of the data and everything we know about nature. That’s not a
402
purely mathematical truth; the “nature” component is critical.
403
In fact maybe it’s impossible to have a non-trivial correct for404
mula without population genetic modelling assumptions – i.e.
405
impossible to prove validity with mere mathematics and with406
out appeal to population genetic reality (but §3.6.1 has the op407
posite speculation). As a thought experiment imagine a model
408
called common-type under which nature has a very strong ten409
dency to discourage types rarer than 25% population frequency.
410
Under that model, if we observe five instances of a type T in a
411
sample of size n = 50, it is much more likely that we have
412
under-observed a common trait than over-observed a rare one,
413
hence we would properly conclude that Pr(T ) > 5/50, i.e. the
counting method would be anti-conservative. It follows that
414
accepting even the counting method entails accepting at least
some weak modeling condition, e.g ruling out common-type. 415
In reality, because of drift and mutation, nature very much416
prefers rare Y STR haplotypes, like Figure 4(b), hence the417
counting method is quite conservative. The counting rule like-418
419
lihood ratio
LRcounting = |D|
(2)420
421
370
371
is therefore an understatement of the evidence that the suspect422
is the crime scene donor.
423
424
372
373
374
375
376
377
378
379
380
381
382
383
384
385
3.2. Confidence interval from zero
425
Something like the approach of SWGDAM recommenda-426
tions [12] has been recommended by several forensic statisti-427
cians but I do not like it as it has no mathematical basis. It does428
429
not correspond to any model and it confuses concepts.
As noted above, the case of interest is a crime stain not found430
in the n-haplotype reference database. The analysis begins with431
the correct idea that the value of the evidence against a matching432
suspect can be expressed as the probability that a random (i.e.433
non-donor) person would match, conditional on the observa-434
tion at the crime. Two errors in quick succession then abruptly435
derail the analysis: failing to condition, and replacing “proba-436
bility” by the different and inappropriate concept of “population437
438
frequency”. The right question
439
386
387
Based on the evidence that a type has been seen once,
what is the probability to see it again?
440
441
6
thus has been twisted into the wrong question:
What is the population frequency of a never-seen
type?
An attempt is made to answer this by analogy with how a statistician might estimate the frequency of a type that has been
observed several times, namely confidence intervals – an artifice at best which becomes particularly dubious in the arena of
zero observations, the dreaded “confidence interval from zero.”
Statisticians know well that even at best a confidence interval
doesn’t really mean a range the frequency is likely to be in
(though that of course is how it will be understood), but rather
a converse of that: a range such that if the frequency is within it
the actual observation is likely. In some situations the approximation of one by the other may be plausible; zero observations
is not such a situation. Finally, the question arises as to what
confidence interval is appropriate. Historically 95% is an arbitrary choice. It’s not rooted in any principle. Possibly it has
proven its mettle in a practical way in arenas like manufacturing process control, but there is no reasonable analogy, let alone
logic, for how such an experience would translate to DNA, evidence, and justice. A comical debate lately arose about the
right size for the one-sided interval from zero in the SWGDAM
procedure. A reasonable analogy would be debating what kind
of tutu to put on a dog. In truth, adopting a confidence interval
from thin air is no more than a way to paper over and cover up
deeper seated illogic. See §4.
3.3. Frequency surveying
Haplotype “frequency surveying” [13, 14, 15] is a proposal
motivated by the notion that haplotypes which are near stepwise neighbors will tend to have similar frequencies, presumably because of mutation among neighbors. Hence it seeks to
augment the paucity of reference observations for a particular
haplotype by considering as well the richness or sparseness of
nearby haplotypes. However this tempting idea doesn’t hold up
on close consideration. It suffers from a handful of possibly
reparable shortcomings in execution, and an insuperable fundamental error.
The shortcomings begin with no explicit model. Therefore the curve-fitting approach isn’t derived mathematically, but
rather consists of the guesswork that it will be good enough to
assume that each one-step neighbor will contribute to a haplotype’s frequency as much as two two-step neighbors, three
three-step neighbors, and so on. Presumably the image is that
there is some kind of traffic among the haplotypes in a densely
populated region of the 17-dimensional lattice space so that
they replenish (or otherwise influence) one another through mutation. However, that image is tantamount to assuming a high
rate of convergence through mutation (even across several mutational differences), contrary to the story described in §2.2. On
the other hand with rare traits – which haplotypes are – genetic drift acts relatively quickly. Consequently I expect any
influence from a replenishment phenomenon to be dwarfed by
genetic drift. Haplotype frequencies are mostly just random.
Notwithstanding an unfortunate and false endorsement in [2]
442
443
444
445
446
447
that the simultaneously published paper [15] presents a valida-480
tion of frequency surveying ([15] itself makes no such claim),
the frequency surveying approach cannot work. [16, 17, 18, 3]
have further discussion. However, my impression from author
Krawczak is that he now agrees the method is invalid and therefore perhaps it is a dead issue.
3.6.1. t-model
Suppose the population of haplotypes consisted of some
large number t of equally frequent types, the model called Ωt
in [1]. LRΩt = t. An observation that κ is the proportion of
singletons in a sample of size n, would be most consistent with
t=
448
3.4. The infinite alleles model approach
481
482
From Ewens’ celebrated [11], the model of infinitely many483
neutral alleles suggests a prior distribution of haplotype fre-484
quencies that amounts to β(0, θEwens ) (Figure 4(b) and formula485
7). θEwens ≈ 8800 for Caucasians per §2.1(b). Given such a486
prior distribution one can apply Bayes’ theorem, incorporating487
sample data if available, to compute a posterior probability:
488
Pr(random match|crime stain haplotype) = 1/(n + θEwens )
489
490
449
450
451
452
453
for a type previously unseen in a database of size n. This result,491
mentioned in [1], was the first method I discovered for the rare492
haplotype matching probability. I decided though that the result493
is too risky to recommend in forensic practice for the reasons
494
given at the end of §2.2.
495
454
3.5. The average matching chance
496
497
455
456
457
458
459
460
461
The empirical pairwise matching experiment described in498
§2.1(b) is the same as the experiment of comparing an innocent499
man with a crime stain. Therefore it can fairly be used in re-500
porting the evidence (either for a haplotype not in the database501
as the estimate is then conservative, or if the reference database502
is lost) and an appropriate statement is easy to formulate and503
explain:
504
505
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
The suspect matches the crime stain. If the crime
stain donor is not the suspect, such a match occurs
only one time in 8800. Therefore either the suspect is
the donor or a 1-in-8800 coincidence has occurred.
n
− ln κ
(3)
as is shown in [1]. The model of equal frequencies is artificial,
but it may be possible to prove that it is conservative by demonstrating that it represents a worst possible case for matching
singletons. I have not proven that, but also I’ve not been able to
construct even a contrived counterexample population. Therefore it is a possibility that (3) is conservative without any population genetic assumptions.
3.6.2. κ method
The main result of [1] is the “κ method”, which refines the
counting method. As in the counting method augment the reference database to D by including the crime scene type in it. Let
κ be the proportion of singletons in D. The following lemma
quantifies remark §2.1(a)i.
Lemma 1. D includes the types of 1 − κ of the population.
Proof Think of the database samples D accumulated one at a
time. The last man added to D is, with probability κ, one of the
singletons and hence was unrepresented before he was added.
The same will be true of the next man added – probability κ that
his type will be previously unrepresented. That’s the same as
saying that κ of the population is unrepresented in D. Therefore
the proportion of the population that is represented is 1 − κ. 5
Otherwise put, for types that do occur in D such as the crime
scene type, the sample frequency on average over-represents the
probability of being observed in a randomly selected man by an
“inflation factor” I = 1/(1 − κ).
Modifying (2) by the factor I suggests the rule
LRΩκ (1) = |D|/(1 − κ)
(4)
as the matching likelihood for a singleton and in general
LRΩκ (p) = |D|/p(1 − κ)
3.6. Models motivated by the data
Efron [19] makes the point that in the modern statistical
world of “big data”, models are sometimes implied by the data
itself. Haplotype population sample data shows a prevalence of
singletons, and some of the implications of this have already
been mentioned §2.1(a). If every type in the database were
unique – as would be the case if we sampled complete genomes
for example – the hypothesis of maximum likelihood would be
that everyone in the world is unique (and a lower bound on the506
order of n2 effective types). The presence of some repeated ob-507
servations in the database implies an upper bound as well as a
lower on the amount of diversity in the population. I considered two approaches exploiting this idea, the “t-model” and the
“κ method”.
7
(5)
for types of popularity p in the reference sample. To clarify,
suppose D is the Caucasian population sample (κ = 0.84) so
I = 6. Think of the partition of D into “popularity cohorts” –
the singletons (p = 1) in aggregate, the doubletons (p = 2) in
aggregate, etc:
D = singletons ∪ doubletons ∪ ...
If the inflation proportion I holds for each of the popularity
cohorts individually, then (5) is correct (unbiased). However,
5 A rigorous version of this lemma would acknowledge that κ changes
slightly with the addition of a new man, and the proof appeals to Robbins’
theorem [20].
508
509
510
511
512
513
514
515
516
517
518
unless we can rule out that some of the more-observed types562
may be legitimately common because of some special mech-563
anism (Genghis Khan effect [21]? selection via hitchhiking?)564
the full generality of (5) would be hard to show and therefore565
is cautiously not recommended by [1]. For singletons though,566
(4) is well-supported by simulations. Incidentally for a rapidly567
growing population it may be notably conservative which im-568
plies that for at least some p > 1 the more general formula is569
anti-conservative, another reason for caution in its use at least
in court. For humanitarian body identification and small p (5)570
seems reasonable though.
571
572
519
520
521
522
523
524
525
526
527
528
529
530
531
532
3.6.3. Doubts about κ by Buckleton, Krawczak and Weir
573
While not written as formal mathematics, [1] lays out ar-574
guments and results systematically – statement of problem,575
premises, model, derivation of results, validation of the κ576
method. Among various potential benefits of an explicit, linear,577
and deductive organization are clear communication and facil-578
itating a substantive and reasoned discussion or argument. But579
the assertion “we have shown, Brenner’s approach ... suffers580
from potential anti-conservativeness in the way it inherently es-581
timates haplotype frequencies” as part of the ending discussion582
in [2] is not backed up by substantitive or reasoned discussion.583
Inquiry to the authors elicited that the remark referred to §5.2584
of [2]. That section mentions three ideas which are respectively585
pointless, confused, and wrong as follows:
586
587
533
534
535
536
537
538
539
540
541
542
543
544
545
546
• “difficult to determine” – After a bit of mathematical manipulation aimed at gaining insight into the performance
of the κ method, §5.2 of [2] comes to a rather complicated588
but correct expression for the expected value of my formula in terms of the (unknown) population frequencies of589
various haplotypes. If the complicated expression could590
be shown by analysis to have too large or too small an ex-591
pected value, that would show that my method is conser-592
vative (ok) or anti-conservative (bad). The paper comes to593
neither conclusion, but rather, frankly admits that “It is a594
complex function ... therefore it is difficult to judge” based595
on this, sadly, dead end approach. I sympathize – my note-596
books are littered ideas that didn’t pan out, but I don’t see597
598
any point in publishing them.
599
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
• “Brenner demonstrated himself” an example where the600
method is anti-conservative. For an artificial and unre-601
alistic population consisting of 10000 equally rare haplo-602
types (Ω10000 in the notation of [1]), the κ formula would603
not work. But [1] doesn’t claim it would; rather it ex-604
plicitly points out that the modelling condition of “equal605
over-representation” upon which the method rests can be606
violated by artificially constructed populations but seems607
valid for populations that might occur naturally. My own608
pathological example is therefore not evidence against my609
own method.
610
• elementary algebraic blunder – [2] derives from the κ formula the result that any singleton in a database rates to be
seen more than once in a second realization of the same
database – a surprising and counterintuitive conclusion if
611
correct. One could then presume that sampling a third
time would include still more copies of the ever burgeoning haplotype. But how to reconcile that with my claim of
validation? A possible course, when one comes up with a
startling result, is to check your work. Given p for the frequency of some type, 1 − (1 − p)n is the probability for an
n-sample to include the type at all, not “more than once”
as [2] alleges.6
3.7. discrete Laplace model
I argue above (§3.3) that the structural neighborhood of
a haplotype offers little information about the probability to
match the haplotype. That isn’t to say the structure contains
no information at all.
Andersen et al [3] present a method called the “discrete
Laplace model” based on modeling a population as a collection of subpopulations each of which is a collection of haplotypes clustered around a central (ancestral?) haplotype. The
model incorporates neighborhood information in that it calculates higher probabilities for haplotypes closer to a central haplotype. Analyzing the performance on many simulated populations, the paper concludes that while both κ and Laplace haplotype probabilities are close to unbiased compared to the actual frequency, the Laplace method is more accurate. Moreover
and very usefully the Laplace method is applicable even for unobserved haplotypes, a consideration that comes into play for
analyzing Y mixtures.
4. Confusing probability with frequency
Matching probability is not population frequency, though
confounding the two is a misconception that has long held sway
in forensic genetic practice. I sometimes use the phrase “probability is not frequency” but that slogan may incite accidental
misunderstanding because of its freighted and potentially ambiguous words.
Probability and frequency are related concepts to be sure. Indeed, a common (the “frequentist”) definition of the probability
of an event is the frequency with which that event would occur in an unending series of repeated experimental trials. Even
an anti-frequentist who refuses this as a definition probably accepts the intuition it represents at least for a reasonably objective instance of probability. H´ajek [22]: “[F]inding out a relative frequency in a series of trials can often be the best ... way
of finding out the value of a probability.” The “probability” at
issue is matching probability. The simplest matching question
concerns an atomic trait – an autosomal allele or a Y-haplotype.
Such a DNA trait is found at a crime scene and we ask the
probability that the corresponding DNA from a randomly selected person would match. That’s a fairly objective example
of probability, the sort we should be able to handle without getting mired in the Bayesian/Frequentist ideological conflict of
20th century statisticians.
6 Through another slight confusion [2] instead writes 1 − (1 − p)n−1 but the
difference is immaterial.
8
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
Of course higher-probability events occur more frequently664
whether in life or in the “repeated experiment” interpretation665
of probability. But that abstract sense of “frequency” isn’t the666
sense at issue, population frequency is. Population frequency667
is not abstract; it’s an actual fact about nature, a number but668
an unknown one that exists in the real world. And the insti-669
tutionalized misconception in forensics is to confuse matching670
probability with population frequency. For example consider
the probability for a man to match a crime scene Y-haplotype671
T assuming that the man has been randomly selected from the672
population. Population frequency means the number of men673
with T in the population divided by the number of men in the674
population.
675
“So what?” some will say. “Doesn’t the matching proba676
bility end up being the haplotype frequency in the population
677
anyway? If 500 men in a population of 500000 have type T –
678
population frequency is 1/1000 – then imagining a large number
of repetitions of the evidential situation won’t random suspects679
match 1/1000 of the time?” That sounds quite reasonable. What680
681
is wrong with it?
It would be correct if the population frequency of T were682
known to be 1/1000. Then and only then are haplotype fre-683
quency and matching probability equivalent. For probability684
necessarily is a summary of data – known (or stipulated) facts685
– and cannot possibly depend on facts that are unknown. This686
687
is an undisputable and familiar principle.
688
638
639
640
641
642
643
644
645
646
647
648
649
650
For example we say team A has some probability to
win a game coming up this weekend based on our
knowledge of past encounters, information about the
players’ conditions, and other information.
If the weekend passes and we receive no information
whatever relevant to the game, then we can equally
ask in retrospect “What is the probability that team A
won?” and with the evidence the same, the answer
will be the same. Sometimes people say there is no
probability any more and either it is 0 or 1, depending
on what happened, but talk like that is facetious or
confused. The reality of winning changed over the
weekend; the probability of winning did not.
689
690
691
692
693
694
695
696
697
698
699
700
701
702
651
652
653
654
655
656
657
658
659
660
The probability of the team to win isn’t a property of the team703
or the game itself, but rather is a description of the particular704
knowledge we have about them. As the 19th century thinker JS705
Mill put it [23],
706
Every event is in itself certain, not probable; if we
knew all, we should either know positively that it will
happen, or positively that it will not. But its probability to us means the degree of expectation of its occurrence, which we are warranted in entertaining by our
present evidence.
707
708
709
710
711
712
661
662
663
Once we subscribe to the confused position that probability713
can depend on unknown facts, we confront the impossible ques-714
tion as to which unknown facts are privileged to be considered7 715
716
7 genotypes
at additional untested loci?
717
9
and which are not. The only reasonable view when evaluating
probability is therefore the same as the judiciary’s principle for
accepting evidence: only known facts are admissible.
Equating match probability to population frequency though
nearly universal in forensic genetics, is thus a muddled view
in that it says probability rests on a fact that is in practice unknown. What can we do instead? Does it matter?
4.1. What can we do instead?
The appropriate understanding of haplotype matching probability isn’t how often a random match will occur to the particular haplotype T . Rather:
The random matching probability to haplotype T is
how often a random match will occur in general to
a haplotype the data about which is the same as the
data we have about T .
This is a more abstract formulation because it abstracts away
consideration of T per se in favor of considering only data
about T . That’s progress because data is useful and valid fodder
for probability. Thus the re-formulation is a question that has at
least a chance of being answerable. And it has been answered
to various extents by the various methods mentioned in §3.1.1,
§3.4, §3.5, §3.6.1, §3.6.2, and §3.7.
Note immediately one consequence: If the data we know
about two different haplotypes is the same, it follows that the
random match probability to each of them is the same notwithstanding that of course they may have very different population
frequencies. Now, in practice we have some choice about what
we regard as data since before applying mathematics to a realworld situation some simplification is necessary. That is, you
have to choose a model. The κ model includes the simplifying assumption that a Y haplotype “sequence” (the repeat numbers) is merely a label, not data. Therefore the data specific
to a haplotype is only the number of observations (including
the reference database and the crime scene, i.e. in D) so two
singletons have the same matching probability. If you would
prefer to account for the possible significance of sequence similarity among Y haplotypes, then you might adopt a model under which the data about T includes the number of observations
not only of T itself, but also (for example) some function of the
numbers of 1-step, 2-step, etc. neighbors of T . But the principle
remains that if for another haplotype, U, the data is the same,
then T and U have the same random matching probability. If
there is no such U, if T is unique with respect to the data about
it, the principle remains, that the matching probability for T is
about the data, not about unknown population frequencies.
4.2. Does it matter?
Distinguishing probability from frequency doesn’t matter so
much for autosomal forensic work. The product rule covers up
a lot of sins. The right astronomical number and the wrong
astronomical number are not usually different in their practical
evidential impact.
But for Y, as Table 1 shows, the SWGDAM idea of 4102/3 ≈
1400 instead of LR = 25000 from a sound evidential evaluation means discarding a factor of 18 of evidentiary strength
718
719
720
721
722
723
724
725
726
727
728
729
from a terrestrial number. And it could be worse. I’ve han-765
dled cases with a single 1-step Y-haplotype discrepancy, i.e.766
between alleged father and son; consequently LR = 25000 ×767
Pr(mutation) ≈ 70. That’s still pretty strong evidence, but wan-768
tonly reduce it a further 18-fold and you’ve got nothing.
769
Besides, it’s a good principle to do things right. A clear and770
correct understanding is a good foundation for progress. Ran-771
dom half-wrong ideas and guesses are not, and they’re also dan-772
gerous to bring to court. A good defense attorney should be able
to shred an expert who talks nonsense, even if the nonsense is
numerically better for the defense than the right number would
be.
Schematically:
unsound: database ⇒ sample frequency
sample frequency ⇒ frequency ± interval
frequency ± interval ≈ probability ± interval
4.4. Example – Two gun Russian roulette
Bored with the traditional form of the game, we invent a twogun version of Russian roulette using two 12-chamber revolvers
R1 , R2 with respectively f1 = 1/12 and f2 = 3/12 of the chambers occupied by lethal bullets (Figure 4(a)). Some agent randomly selects one of the guns, and then we fire from a randomly
selected chamber at a man “Innocent Suspect”. What is the kill
probability K?
• expected frequency analysis – We might call { fi } the
“lethalities” of {Ri }. I trust we agree that using either gun
alone to play the traditional one-gun Russian roulette, the
kill probability K is the same as the lethality of the gun;
K = f1 or K = f2 as the case may be. Back to the two-gun
game, the probabilities to select each gun, w1 = w2 = 1/2,
can be considered as weights. The kill probability is then
the weighted average, the expected value
K =E( fi )
=average of 1/12 and 3/12
correct: data+model ⇒ probability
730
731
732
733
734
735
736
737
738
739
740
741
742
743
744
745
746
The unsound paradigm for matching probability amounts to:
The type is very rare in the database. Therefore it is probably
at least pretty rare in the population. The feeble logical conclusion is only that matching by an innocent suspect is probably
unlikely.
A correct footing takes the larger perspective in which the
database is viewed as evidence (“database as evidence” – [24])
drawn from a population that conforms to some modeling
premises. Then we can say: The type is very rare in the
database. That may be because the type is very rare in the population or it may be because the database is a very abnormal sample. Either way, for an innocent suspect to match requires a very
773
unlikely circumstance, meaning the evidence is very strong that
a matching suspect is not innocent.
774
Two examples follow of situations in which the matching775
probability to haplotype T has little relation to the population776
frequency of T . The first is a little artificial, but simpler.
777
778
747
4.3. Example – An island population
779
780
748
749
750
751
752
753
754
755
756
757
758
759
760
761
762
763
764
Thought experiment: In 14 A.D. a complete catalogue is781
compiled of the Y haplotypes of every man in a closed com-782
munity. One particular type T happens to have a population783
frequency of exactly 1/100. Suppose by immigration controls784
and magically stopping mutation, no new types will be intro-785
duced. Fast forward 2000 years and the modern population,786
because of genetic drift, has completely different haplotype fre-787
quencies which are unknown – we have no new data. Just as a788
puzzle, what is the probability that a randomly selected man is789
type T ?
790
Answer: The random man can trace his lineage back to a791
single man in 14 A.D., and that ancestor is 1/100 to be type792
T . Hence so is our man. Of course if we knew the modern793
population frequency the answer would be different, but based794
on the knowledge that we actually have the probability is 1/100.795
Matching probability is not population frequency; rather it is796
whatever inference is warranted from available evidence.
797
10
=Σwi fi
(6)
=1/6.
At the risk of insulting my readers, I expect that some of
them feel that since the constituent guns kill either more
or less often than 1/6 there is more to be said, and that
the kill probability is better expressed as something like a
“credible interval”,
K = 1/6 ± c, for some small number c.
Is it?
A simpler way to find K is to notice that the game is entirely equivalent to a one-gun game with a 24-chamber gun,
4 of which are loaded. The kill probability is therefore exactly
K = 4/24 = 1/6; no ifs, buts, or and-an-uncertainty-interval
about it.
Now consider a thought experiment involving haplotypes.
Suppose that we know that haplotype frequencies are controlled
by a whimsical agent Daemon Nature who monitors and controls all fertilization or birth events (`a la Maxwell’s Daemon)
to ensure that the all haplotypes occur with one only two possible population frequencies, either f = 1/12 or f = 3/12, in
such a way as to ensure that the type T of a randomly selected
individual is equally likely to occur in the population with one
frequency or the other. A capital crime occurs and the crime
scene haplotype is T . An innocent suspect is arrested, who will
be executed if found to match T . What is the probability the
innocent suspect will match, and therefore die?
That scenario is exactly analogous to the roulette game. The
whole project in which Nature chooses one of the two values for
f then generates a population with freq(T ) = f (that part constitutes an evolutionary replicate) then T turns up at the crime,
is the analog of choosing a gun in the roulette game. Selecting, testing, and disposing one way or the other of an innocent
(i.e. random) suspect, is the analog of firing the gun. Just as
(a)
one
revolver
1/2
revolver
selection
probability
(b)
other
revolver
K = net kill
probability
relative
probability
w(f )
0
0
0
1/12
1/6
3/12
lethality per revolver
1/10000
1/5000
haplotype frequency f
1/3333
Figure 4: discrete and continuous frequency spectra: (a) 2-gun roulette (§4.4), (b) “infinite alleles” haplotype frequency distribution (§4.5)
798
799
800
801
802
803
804
805
806
807
808
809
810
811
812
813
814
815
in the roulette game, the haplotype matching probability isn’t
the unknown number f ; it’s an average of the values f might
have, known as the expected value of f (per equation (6)). And
as in the roulette game E( f ) = 1/6 exactly. Despite that we
have the extreme situation in which this is known to be significantly different from freq(T ) – a difference of ±1/12 – there is817
no more to be said, no confidence or credible interval. Such in-818
tervals are reasonably used in statistics to describe uncertainty819
in estimating a population parameter such as freq(T ) by par-820
tially describing the probability distribution of the parameter.
Matching probability however is not frequency, is not a parameter. Such intervals do not apply to probability itself because the
very idea of a “probability distribution of a probability” is totally pointless since, as we have just seen in this example, such
a probability distribution would be exactly equivalent to its ex821
pected value. The mistaken practice of attaching an uncertainty
822
interval to a probability is not seen, so far as I can determine,
823
outside of forensic statistics.
824
825
816
4.5. Frequency spectrum as a prior probability distribution
826
827
The situation in real life is more intricate than the preceding828
example but not different in principle. Nature provides a con-829
tinuum of possible haplotype frequencies, so the two weights830
{wi } above are replaced by a prior probability distribution (Fig-831
ure 4(b)) – a continuous “frequency spectrum” w( f ) Rdefined for832
all frequencies f ∈ [0, 1] – and the Σ becomes an . For ex-833
ample if we assume the infinite alleles model discussed in §3.4834
then the relative propensity for nature to create a haplotype with
frequency f would be something like
835
w( f ) ∝ (1 − f )8800 / f,
m(p) =
Z
1
f =0
w( f )c(p, f )d f.
(8)
Note that this formula works even with no reference data. In
that case p = 1 and D= {T } (because of “extending” the
database with the crime scene observation per §3.1.1), c(1, f ) =
f and formula (8) looks very like formula (6).
For the case of w( f ) as in formula (7), via standard properties of the beta distribution m(p) has the simple form given in
§A.2.4 of [1]:
p
.
m(p) =
|D| + 8799
And as above that’s the matching probability, end. Any impulse
to decorate a probability by credible or confidence intervals is
only from momentum and confusion, not from mathematical
reasoning.
Which is not to say the number is scientifically exact. It’s
not; it will be to some extent wrong. But the source of error
is uncertaintly about the model and the premises, not sampling
variation. If D doesn’t represent the relevant population, that
can be a problem – the magnitude of which has nothing to do
with sampling variation. Since infinite alleles model isn’t an
accurate description of evolution, the result cannot be accurate.
In this regard having a larger D may paper over deficiencies
in the model. But even so the extent of that isn’t measured by
credible/confidence intervals or sampling variation.
5. Concluding discussion
(7)
836
a distribution very strongly favoring very small values of f (Fig-837
ure 4(b)). Next, in the real world of forensic genetics there is838
usually a population sample – e.g. D as in §3.1.1 – that can be839
incorporated via Bayes’ theorem. For example, let p, p ≥ 1,840
denote the number of observations of the haplotype of interest841
in D. That’s an event whose probability of occurrence, c(p, f ),842
is easily written down:
843
c(p, f ) = Pr(p observations| population frequency = f )
!
p
|D|−p |D|
= f (1 − f )
p
and by Bayes’ theorem the matching probability m(p) is
844
845
846
847
11
This paper has two related aims. First is to clarify the correctness of the κ method [1] in the face of published criticism
[2]. The criticism is unsound, resting on misunderstanding and
errors as §3.6.3 shows. The κ method (4) holds up well for so
simple a model. Recent work, especially concordant results in
[3], confirm that the method is right within it’s assumptions, and
moreover shows that the model simplification sacrifices some
accuracy but usually not very much. An additional benefit of
the Laplace method [3], and to me a more important one, is that
it gives plausible probabilities for never-observed haplotypes
(which must be considered as part of analyzing Y mixtures)
and for multiply-observed haplotypes.
848
849
850
851
852
853
The second aim grew out of the first. Understanding the basis899
of the κ approach requires understanding the Y haplotype ter-900
rain. And that in turn means casting aside preconceptions that901
902
we hold from autosomal experience and resisting the impulse903
to assume obvious seeming parallels between the two. There904
905
are a great many important differences.
906
854
855
1. Obviously the product rule is out for Y haplotypes – noth-907
908
ing new, everyone knew that.
909
910
856
857
858
859
860
861
2. Sample frequency as an estimate for matching probabil-911
ity (i.e. “blind counting”, §3.1) is not bad for autosomal912
work although the augmented count (§3.1.1) is clearly bet-913
914
ter. For Y haplotype work even the latter leaves a lot on915
the table; sample frequency severely overestimates match-916
917
ing probability.
918
862
863
864
865
3. To clear the table better, it helps a lot in the Y case to look919
at the whole reference database, not just observances of the920
921
target haplotype. For autosomal work we never thought to922
do that (for the good reason that it wouldn’t help much). 923
924
866
867
868
869
4. Confidence or credible intervals represent careless think-925
ing but are mostly harmless in the autosomal arena. For Y,926
927
that careless thinking is an impediment to understanding928
and to expressing the actual evidential strength.
929
930
870
871
872
5. The “unrelated person” concept is an acceptable approxi-931
mation for autosomal work; a fatal one for understanding932
933
Y haplotypes.
934
6. The autosomal idea of a θ correction is almost backwards936
from the reality of the Y haplotype matching world where937
θ is the starting point, and where literal application of938
“standard” autosomal theory gives negative probabilities. 939
935
873
874
875
876
940
877
878
879
880
881
7. Thinking about models is vital for understanding and de-941
942
veloping a Y haplotype matching probability approach.943
Some autosomal practice acknowledges population sub-944
structure – good, but a mere academic exercise by com-945
946
parison with the importance of theory for Y.
947
948
882
883
884
885
886
887
888
889
890
891
892
893
894
895
896
897
898
8. Finally there is another distinction outside the ambit of this949
paper, the common concern about possible geographical950
clustering of Y haplotypes. To what extent clustering com-951
plicates evidential calculation and what to do about it are
topics to explore.
Acknowledgements
I thank David DeGusta for invaluable advice with the
manuscript and the referees for prodding me to greater clarity.
[1] C. Brenner, Fundamental problem of forensic mathematics – the evidential value of a rare haplotype, Forensic Sci. Int. Genet. 4 (2010) 281–291.
[2] J. Buckleton, M. Krawczak, B. Weir, The interpretation of lineage markers in forensic DNA testing, Forensic Sci. Int. Genet. 5 (2011) 78–83.
[3] M. M. Andersen, P. S. Eriksen, N. Morling, The discrete Laplace exponential family and estimation of Y-STR haplotype frequencies, Journal of
Theoretical Biology 329 (2013) 39–51.
[4] C. Cockerham, Variance of gene frequencies, Evolution 23 n1 (1969) 72–
84.
12
[5] J. F. Crow, et al, The evaluation of forensic DNA evidence, National Research Council, 1996.
[6] Applied Biosystems,
http://www6.appliedbiosystems.com/
yfilerdatabase/ (2009).
[7] M. Slatkin, An exact test for neutrality based on the Ewens sampling distribution, Genetical Research 64 (1994) 71–74.
[8] Wolfram Inc., Polya’s random walk constants, http://mathworld.
wolfram.com/PolyasRandomWalkConstants.html.
[9] Long Bing et al, Population genetics for 17 Y-STR loci (AmpFISTRYfiler TM) in Luzhou Han ethnic group, Forensic Sci. Int. Genet. 7(2)
(2013) e23–e26.
[10] M. Slatkin, A correction to the exact test based on the Ewens sampling
distribution, Genetical Research 68 (1996) 259–260.
[11] W. Ewens, The sampling theory of selectively neutral alleles, Theor Pop
Biol 3 (1972) 87–112.
[12] SWGDAM, SWGDAM haplotype frequencies, http://www.fbi.
gov/about-us/lab/forensic-science-communications/fsc/
oct2009/standards/2009\_01\_standards01.htm/ (2009).
[13] L. Roewer, M. Kayser, P. de Knijff, K. Anslinger, A. Betz, A. Caglia,
D. Corach, S. Fredi, L. Henke, M. Hidding, H. Krgel, R. Lessig, M. Nagy,
V. Pascali, W. Parson, B. Rolf, C. Schmitt, R.Szibor, J. Teifel-Greding,
M. Krawczak, A new method for the evaluation of matches in nonrecombining genomes: to Y-chromosomal short tandem repeat (STR)
haplotypes in European males, Forensic Sci. Int. 114,1 (2000) 31–43.
[14] M. Krawczak, Forensic evaluation of Y-STR haplotype matches: a comment, Forensic Sci. Int. 118,2 (2001) 114–115.
[15] S. Willuweit, A. Caliebe, M. M. Andersen, L. Roewer, Y-STR frequency
surveying method: A critical reappraisal, Forensic Sci. Int. Genet. 5
(2001) 84–90.
[16] C. Brenner, The “frequency surveying” approach cannot work, http:
//dna-view.com/downloads/documents/RareHaplotypes/
Critique/%20haplotype/%20A4.pdf.
[17] A. Veldman, Evidential strength of Y-STR haplotype matches in forensic
DNA casework, Mathematisch Instituut, Universiteit Leiden, 2007.
[18] M. M. Andersen, A. Caliebe, A. Jochens, S. Willuweit, M. Krawczak,
Estimating trace-suspect match probabilities for singleton Y-STR haplotypes using coalescent theory, Forensic Sci. Int. Genet. 7(2) (2013) 264–
271.
[19] B. Efron, Modern science and the Bayesian-frequentist controversy, http://www-stat.stanford.edu/~ckirby/brad/papers/
2005NEWModernScience.pdf.
[20] H. Robbins, Estimating the total probability of the unobserved outcomes
of an experiment, Ann. Math. Stat. 39 (1968) 256–257.
[21] T. Zerjal, et al, The genetic legacy of the Mongols, Am. J. Hum. Genet.
72 (2003) 717–721.
[22] A. H´ajek, ”mises redux” – redux: Fifteen arguments against finite frequentism, Erkenntnis 45 (1996) 209–227.
[23] J. Mill, System of Logic: Ratiocinative and Inductive; Being a Connected
View of the Principles of Evidence and the Methods of Scientific Investigation, Vol. 2, London http://www.questia.com/PM.qst?a=o&d=
5774540: Longmans, Green, Reader, and Dyer. 62, 1868.
[24] P. Dawid, J. Mortera, Coherent evaluation of forensic evidence, J. R.
Statist. Soc. B 58(2) (1966) 425–443.

Similar documents

×

Report this document