Forensic identification: the Island Problem and its generalisations
Abstract
In forensics it is a classical problem to determine, when a suspect shares a property with a criminal , the probability that . In this paper we give a detailed account of this problem in various degrees of generality. We start with the classical case where the probability of having , as well as the a priori probability of being the criminal, is the same for all individuals. We then generalize the solution to deal with heterogeneous populations, biased search procedures for the suspect, correlations, uncertainty about the subpopulation of the criminal and the suspect, and uncertainty about the frequencies. We also consider the effect of the way the search for is conducted, in particular when this is done by a database search. A returning theme is that we show that conditioning is of importance when one wants to quantify the “weight” of the evidence by a likelihood ratio. Apart from these mathematical issues, we also discuss the practical problems in applying these issues to the legal process. The posterior probabilities of are typically the same for all reasonable choices of the hypotheses, but this is not the whole story. The legal process might force one to dismiss certain hypotheses, for instance when the relevant likelihood ratio depends on prior probabilities. We discuss this and related issues as well. As such, the paper is relevant both from a theoretical and from an applied point of view.
Keywords: Island problem, Forensic identification, Weight of evidence, Posterior odds, Bayes’ rule.
1 Introduction
In 1968, a couple stood to trial in a notorious case, known as “People of the State of California vs. Collins”. The pair had been arrested since it matched eyewitness descriptions. It was estimated by the prosecution that only one in twelve million couples would match this description. The jury were invited to consider the probability that the accused pair were innocent and returned a verdict of guilty.
Later, the verdict was overthrown, essentially because of the flaws in the statistical reasoning. The case sparked interest in the abstraction of this problem, which became known as the island problem, following terminology introduced by Eggleston [4]. Its formulation is the following. A crime has been committed by an unknown member of a population of individuals. It is known that the criminal has a certain property . Each individual has (independently) with probability . A random member of the population is tested and observed to have . What is the probability that it is the criminal?
This problem has been quite extensively studied in the literature. For example, Balding and Donnelly [1] give a detailed account of the island problem as well as of its generalization to inhomogeneous populations or (alternatively) uncertainty about . They also discuss the effects of a database search or a sequential search (i.e., a search which stops when the first bearer is found). Dawid and Mortera have studied the generalization of the island problem to the case where the evidence may be unreliable [2, 3].
The current paper is expository in the sense that some of the above mentioned results are reproduced  albeit presented in a somewhat different way  and a research article in the sense that we consider generalizations which to our knowledge have not appeared elsewhere. Apart from the expository versus research nature, there is another duality in this paper, namely the distinction between the purely mathematical view versus a more applied viewpoint, and we elaborate on this issue first.
Most texts focus on the “likelihood ratio”, the quantity that transforms “prior” odds of guilt, that is, before seeing the evidence, into “posterior” odds after seeing the evidence. There is good reason to do so. Indeed, the likelihood ratio is often viewed as the weight of the evidence  it is therefore the quantity of interest for a forensic lab, which is unable or not allowed to compute prior (or posterior, for that matter) odds, this being the domain of the court. However, this already implies a first question. Which part of the available data should be seen as the evidence, and which part is “just” background information? In other words: which evidence do we consider and what is the context? Indeed, the weight of the evidence, that is, the value of the likelihood ratio, sometimes depends on which of the available information is regarded as background information or as evidence (and of course also on the propositions that one is interested in proving). From a purely mathematical point of view, concentrating on the“posterior probabilities”, that is, the probability that a suspect is guilty, given background information and/or evidence, settles the issue. Indeed, it is well known ([7]) that the posterior probabilities are invariant under different choices of the hypotheses as long as they are “conditionally equivalent given the data”. Hence, from a purely mathematical point of view, the situation is quite clear, and one should concentrate on the posterior probabilities rather than on the likelihood ratios.
However, from a legal perspective things are not so simple. The likelihood ratio is, as mentioned earlier, supposed to be in the domain of the statistical expert, but what if this likelihood ratio involves prior probabilities itself? We will see concrete examples of this in this article, and in these cases the classical point of view (likelihood ratio is for the expert, the rest is for the court) does not seem to immediately apply. If we have the choice among various likelihood ratios, are there reasons to prefer one over the other? Also this question will be addressed in particular cases in this paper.
For the island problem, the above discussion is relevant as soon as the population has subpopulations, each with their own frequency. In that case, considering the information that the criminal has as information on the one hand or as evidence on the other, leads to different likelihood ratios, but the posterior odds are (of course) the same. We will go into this phenomenon in detail, considering subpopulations simultaneously with uncertainty about to which subpopulation the criminal and the suspect belong, together with uncertainty about the frequencies in each of the subpopulations. Another possibility which we will consider is that of correlation or a biased search (i.e., the choice of suspect depends on the true identity of the criminal).
The outline of this paper is as follows. In Section 2, we review the classical island problem. We then consider in Section 3 the effect of having a biased search protocol, and of having correlations; we show that these two different types of having dependencies are strongly related to each other. In Section 4, we treat the case where the population is a disjoint union of subpopulations, each with their own frequency and prior probability of having issued the criminal. In Section 5, we consider the effect of uncertainty of the frequencies, both in a homogeneous and heterogeneous population. In addition, we investigate the effect on the likelihood ratio of uncertainty about the criminal’s and the suspect’s subpopulations. Section 6 deals with the case in which a suspect is found through a match in a database. Finally, in Section 7 we present a significant number of numerical examples.
We have tried to include all details of the computations, but at the same time to state our conclusions in a nontechnical and accessible way. Our main conclusions can be recognized in the text as bulleted () lists. As such, we hope that our contribution is interesting and useful both for mathematicians, forensic scientists and legal representatives.
2 The classical case
Our starting point is a collection of individuals. All forthcoming random variables are defined on a (nonspecified) probability space with probability measure . The random variables and take values in and represent the criminal and the suspect respectively. Furthermore, we have a characteristic , for which we introduce indicator random variables , taking value 1 if has the characteristic and 0 otherwise. The are independent of in the sense that for all . The number of bearers is written as .
We are primarily interested in the conditional probability
often we follow the habit of stating the socalled posterior odds in favour of guilt, that is,
(2.1) 
Since we will often be working conditional on we introduce the notation
We define the events , , , and . We will sometimes refer to the event (or similar events) as “information”, and to (or similar events) as “evidence”; this is just colloquial use of language, and sometimes we will view as part of the evidence.
With this notation, (2.1) reads
which can be rewritten in two different ways, namely
(2.2) 
or
(2.3) 
The left hand side of these equations is called the posterior odds. In (2.2), we arrive at the posterior odds by “starting out” with background information via the quotient called the prior odds. These prior odds are transformed into the posterior odds by multiplication with . This latter quotient is called the likelihood ratio and is supposed to be a measure of the strength of the evidence . On the other hand, in (2.3) we “start out” from prior odds , that is, we interpreted both and as evidence. The likelihood ratio in that case is and measures the “combined” strength of the evidence and .
In this section treating the classical case, we assume that and are independent and that is uniformly distributed on . Furthermore, the are independent and identically Bernoulli distributed with success probability . These assumptions are not without problems when applied to concrete legal cases. The assumption that is uniformly distributed means that we a priori regard each member of the population equally likely to be the criminal. It is probably the case that computations based on this assumption cannot be used as legal evidence. However, many of the computations below can also be done with other choices for the distribution of . Having a particular choice in mind does allow us to compare various formulas in a meaningful way. The independence and equidistribution of the will be relaxed later on in this paper, in various ways: one can consider subpopulations with different frequencies, allow dependencies between the or incorporate uncertainty in the probability . Also the independence between and will be relaxed later on.
The outcomes in the current section do not depend on the particular we condition on, but for the sake of consistency, we do write instead of . We abbreviate . The independence between and now implies that . Both likelihood ratios in (2.2) and (2.3) are equal to . It easily follows that
(2.4) 
In this case it does not really matter which viewpoint one takes: the likelihood is a function of alone, and does not involve any prior knowledge. Of course, as mentioned before, in a legal setting it is not clear that uniform priors are acceptable or useful, and starting from other prior probabilities is of course possible in this framework.
In the next two subsections we will examine (for the classical case) how is related to the random variable . It turns out that we may express both as the inverse of the expectation of and as the expectation of the inverse of , as long as we condition correctly.
2.1 Expected number of bearers.
Before anyone is tested for , has a distribution. When the crime is committed and it is observed that the criminal has , we condition on and obtain
It follows that the probability that , given , is equal to the probability that a random variable with a distribution takes value , i.e., is distributed as . Hence, writing for expectation with respect to , we have
Thus, the posterior probability of guilt is given by the inverse of the expected number of bearers, where this expectation takes into account that there is a specific individual  the criminal  who has :
(2.5) 
Intuitively this makes sense: the criminal is a bearer, any one of the bearers is equally likely to be the criminal, and we have found one of them. So we have to compute the expected number of bearers, given the knowledge that is one of them.
2.2 Expected inverse number of bearers
As we have seen, is distributed as . Therefore, one expects bearers of . If we in addition also condition on , we compute
We use this calculation to obtain:
(2.6)  
(2.7)  
(2.8) 
Summarizing,
(2.9) 
So is in fact also equal to the expectation of , however, not of but of . This can be understood in an intuitive way: both and have , they have been sampled with replacement, so the probability that they are equal is the inverse of the number of bearers. This number is unknown, so we have to take expectations, given knowledge of and .
When we compare this explanation with the one of (2.5), we see the importance of careful conditioning.
2.3 Effect of a search, Yellin’s formula
So far, and were supposed to be independent of each other. In this subsection, we consider a different situation. The random variable representing the criminal is still supposed to be uniformly distributed, but the definition of is different: we repeatedly select from  with or without replacement  until a bearer is found, without keeping any records on the search itself, such as its duration. The bearer found this way is denoted by ; if there is no bearer in the population, we set , and define . As before we write and note that in this situation .
As above, we are interested in which, since , reduces to , and this conditional probability is easy to compute:
(2.10)  
This formula was published by Yellin in [10] as the solution to this version of the island problem with a search. Sometimes, however, it is incorrectly quoted in the literature (e.g. in [1]) as an incorrect solution to the island problem without search as we have discussed it.
2.4 Conclusions

The classical version of the island problem is not difficult to solve, but the relation between the probability of guilt and the expected number of bearers is rather subtle. The basic formula is

In the case of a search we have and this leads to
These outcomes are independent of .

For the value of the likelihood ratio, it does not matter whether or not one interprets as background information or as evidence  in both cases the value is and this quantity does not depend on any prior knowledge.

The prior odds, the likelihood ratio and (hence) the posterior odds are all independent of .
3 Dependencies
In this section we relax the condition that the are independent random variables or that and are independent. To this end, we define
(3.1) 
(3.2) 
3.1 Independent
First we assume that the are independent (not necessarily identically distributed) random variables, but and are not. This is the case, for instance, in a biased search situation. It also accounts for selection effects, where certain members of the population are more likely to become a suspect than others. We write for . Now (2.1) becomes
(3.3)  
In this last expression (3.3), the first term is the likelihood ratio in case of a search such that for all , i.e., such that the probability of selecting is independent of . In particular, this holds for a search where is uniformly random but other distributions of may also satisfy this criterion.
The middle term in (3.3) is the term that accounts for the bias of the search, i.e., it expresses the effect of the dependence between and in the case .
The last term of (3.3) is the “prior odds”, the odds in favour of , when is taken into account. It is of course also possible to start from “prior odds” ; this will yield the same posterior odds, but a different expression for the likelihood ratio. We will make this explicit for some special cases later on.
3.2 Arbitrary
We now assume again that and are independent, but we drop the assumption that the are independent. In that case, we can write
Since we have assumed that the are independent of and , we have
and we continue as
(3.4)  
(3.5) 
As for the case of a biased search, the term is the likelihood ratio that we obtain in the case where the correlations do not play a role, i.e., when for all . The middle term, analogously to (3.3),
(3.6) 
accounts for the correlations, and the last term
describes the prior odds, conditional on . If we remove this conditioning, we get
(3.7)  
(3.8) 
As for (3.5), the last line contains three terms: the likelihood ratio in the uncorrelated case, the term due to the correlation and the prior odds.
3.3 Comparison of biased search and correlations
When we compare the posterior odds (3.3) and (3.5) of the two situations, we see that the expressions are very similar. Both have a correction factor in the denominator. In fact, when and are independent, then in (3.5) can be replaced with , and the two cases reduce to each other if for all . A trivial example of this is obtained when is uniform on and the are independent Bernoulli random variables. More generally, every case of a biased search without correlations where the correlation coefficients between criminal and suspect are such that is equivalent (as far as the probability of guilt is considered) to a case where the search is unbiased but the are correlated with coefficients .
4 Heterogeneous populations
In this section we consider the situation where the population consists of several subpopulations, each with their own frequency and each with their own probability of containing the criminal. To model this, we write as a disjoint union of subpopulations :
(4.1) 
with whenever . If , we say that is in subpopulation and write . Let be the size of subpopulation . We write if . Let
(4.2) 
where the ’s are positive and satisfy . We assume that the random variables are independent Bernoulli variables with probability of success ; hence they are not identically distributed as their distribution varies for different subpopulations.
4.1 Posterior probability of guilt
It follows from the above that we have for all . Therefore, it follows from (3.5) and (3.7) that
(4.3) 
We can work this out in more detail in the case where and are independent and is uniform on subpopulations:
(4.4) 
This assumption is not a restriction, since we assume that all are independent. It is always possible to split up the population into parts such that the are i.i.d. on the parts and (4.4) holds (a trivial decomposition would be into singletons).
First, we define to be the probability that , given that has :
(4.5) 
Now, and , and (4.3) can be rewritten as
(4.6) 
4.2 Likelihood ratios
It follows from (4.6) that, whether and are independent or not, the likelihood ratio conditioned on is given by
(4.7) 
If we assume independence of and and that restricted to each subpopulation is uniform, then we obtain
(4.8) 
We note two special cases. First, when is large which means that the prior probability of guilt for is small), (4.8) is approximately equal to
(4.9) 
in which the subpopulation to which belongs plays no special role. A second special case arises when we take , and only one other subpopulation. This is the standard practice for many forensic labs: there is a default population (the local population), and only two hypotheses are considered: either , or is from the default population. In that case, the likelihood ratio (4.8) is equal to
(4.10) 
where is the frequency in the default population and , the prior probability that is from the default population, is equal to .
4.3 Discussion
It seems that (at least) two likelihood ratios can be used to answer the informal question “What is the weight of the evidence that the suspect has the same characteristic as the criminal?”. Contrary to the classical case described in Section 2, the weight of the evidence depends on whether or not we consider the fact that the criminal has to be evidence or background information. Depending on that choice and on the prior odds on guilt for , we may arrive at the reciprocal of either , , or. These quantities may be very different. This articulates the fact that one should be very careful with the use of such likelihood ratios, and that one should primarily be interested in posterior odds rather than in likelihood ratios. A similar warning in a different situation can be found in [6] and [7].
On the other hand, if one wants to divide the ingredients in the computation of the posterior odds into parts that are for the court to decide, and parts that are for an expert witness to provide, one faces difficulties. We will now go into these in some detail.
4.3.1 Choice of evidence
The difference between the choice of conditioning on or not, is directly related to the difference between the questions “What is the probability that has , if innocent?” and “What is the probability that has , if is innocent?”; or more informally “How else can we explain that has ?” versus “How else can we explain that has ?” Indeed, if we consider both and as evidence to be expressed by a single likelihood ratio, then we can first consider , and then given . But without knowledge of , the probability that has is the same under as under , so the likelihood ratio of and together is in fact the same as the likelihood ratio of , given . Thus, the issue here is that we need to decide if the fact that has counts as evidence against , or not. Should the fact that has a certain characteristic count as (legal) evidence against someone, because he belongs to a subpopulation in which the characteristic is more common? Or do we only consider the fact that has the characteristic, knowing that has it, as evidence? It seems unlikely that an answer can be given in full generality, but it is important to realize that the value of the evidence will depend on it.
4.3.2 Role of expert
Legal systems generally wish to make a distinction between the strength of the evidence, and the strength of the case. Ideally, the expert witness informs the court about the strength of the evidence (i.e., gives a Likelihood Ratio), and the court combines this information with its prior to draw conclusions about the strength of the case. The prior is not discussed with, or communicated to, the expert. Hence, for this to be possible, the likelihood ratio should not depend on the prior of the court. Looking at (4.8) however, it is apparent that this likelihood ratio does depend on the prior probabilities and on the suspect’s population size . The value of the legal evidence, if taken to be both and , thus is a function of the prior and seems as such to be generally not admissible in court. In the special case (4.10), however, it is; but in that case we only obtain useful information if the assumption that either , or is from the default population, is justified.
The Likelihood Ratio (4.7) does not suffer from these problems: it is a function of the suspect’s subpopulation only, irrespective of any prior, on or on any other person or group. Thus, if a court has somehow arrived at a prior probability , it can use the expert’s information to proceed. But it must now be made clear to the court that there is a distinction between the priors with or without taken into account, and that to compute one from the other it also needs expert information.
4.3.3 In practice: which likelihood ratio?
We end this discussion by pointing out some pro’s and cons of the likelihood ratios (4.7) and (4.8). Clearly, (4.7) only involves the suspect. This is a conceptually satisfactory property, since it allows for a clear distinction between prior probabilities and the value of the evidence, as we have pointed out above. It may also provide a safeguard against using irrelevant information as evidence. Consider, for example, the following hypothetical scenario: at a crime scene, a hair of is found. Analysis by a forensic hair expert shows that must belong to subpopulation . Later, a suspect is found. From the hair a mitochondrial DNA profile is generated, and ’s mitochondrial DNA profile matches with it. The court wishes to be informed about the value of that match. Clearly, it only makes sense to report , since it is at this point already known that and are from the same subpopulation. But the DNA expert may not know this, and if it is standard procedure to report a variant of (4.8), e.g. (4.10), then a profile frequency for the default, or even the world’s population, could be reported.
On the other hand, an advantage of (4.8) is that it reduces the value of the evidence if there is a plausible alternative to for : if there are other groups in which is relatively frequent, and which have a positive prior probability, then (4.8) decreases whereas (4.7) does not. But as we have seen, (4.8) can only do this because it makes use of all the prior probabilities, and as such it is likely to be inadmissible as legal evidence, especially if the court leaves the choice of prior to the expert. A possible way out would be for the expert to report all the separately to the court.
Of course, in practice may be hard for the expert to determine, because he only has data about other populations, or because it is not immediately clear to which subpopulation belongs, or even what the subpopulations themselves are. In that case, it may be practical (though potentially dangerous) to use (4.10) and report (together with the hypotheses!), if it is the only statistic concerning that the expert has knowledge of.
The difference in numerical value of (4.7) and (4.8) may lead to the prosecution and defence having different preferences for the use of (4.7) or (a variant of) (4.8). For example, if is much smaller than the weighted mean , the prosecution will prefer (4.7), but the defence will point out that in the population as a whole, there are subpopulations in which is much more common, and therefore try to persuade the court that (4.8) better reflects the value of the match. The court should realize that both points of view can be justified: the prosecutor focuses on the suspect and comes up with the likelihood that has , if not guilty; the defence focuses on and points out that need not be , since there are other good candidates. The court should realize that these arguments can be both valid.
To better understand the influence of uncertainty about the frequencies in the different populations and about the suspect’s and the criminal’s subpopulation, we proceed with a more detailed model involving these issues in Section 5.
4.4 Expected number of bearers
If we choose as we did in the classical case, then we can again express the posterior probability of guilt as the inverse of the expected number of bearers. We compute , and from (4.6) it follows that
(4.11) 
which is the analogue of (2.5). The reader may check that similarly,
This is the analogue of (2.9).
4.5 Without conditioning on
Assume that is uniformly distributed on , and suppose we do not condition on . Concentrating on the conditional probability of we obtain
(4.12) 
The first term in the summation is computed above already, so we need only to compute . Since information about and its status does not say anything about , we have that
Hence it follows that
where
Hence the posterior probability of guilt is a weighted average of the conditioned ones, with weights .
4.6 Conclusions

The probability of guilt in this situation is equal to
and this answer depends on via the frequency of in the subpopulation of , the distribution of and the size of the subpopulation of . The sizes of the other subpopulations do not play a role other than in the assessment of the and thereby of the , i.e., in the distribution of .

For the value of the likelihood ratio, it does matter whether or not is interpreted as background information or evidence. For the probability of guilt this distinction is  of course  irrelevant, but we have seen that there can be reasons to have preference for a particular choice. It is preferable to use a likelihood ratio which does not involve any prior knowledge. The prior should then, in theory, be estimated by the juror.

The probability of guilt, conditioning only on the fact that the suspect has but not on the identity (subpopulation) of the suspect, is the weighted average of the individual conditional probabilities, with weight factors . The sizes of the subpopulations and the distribution of do not play a role in the weights.
5 Uncertainty about the frequency of
In this section we assume that the frequency is not known with certainty. Instead, we describe the frequency with a probability distribution.
5.1 Classical case
We assume that there are no subpopulations. The random variable is uniform on , and and are independent. To model the uncertainty of the frequency, we assume that there is a random variable , taking values in and with density , such that conditional on , the are independent Bernoulli variables with . We let denote the expectation of and its variance. We again condition on whenever we compute odds, but all results in this section are independent of .
Definition 5.1.
The distribution of is called the priortocrime distribution and the distribution of conditioned on is called the priortosuspect distribution. Finally, the distribution of conditioned on both and is called the postmatch distribution. The densities of these three random variables are denoted by , and respectively.
Since
the continuous version of Bayes’ theorem implies that
(5.1) 
Furthermore, we have
(5.2) 
To see this, note that
(5.3) 
and compute the denominator:
(5.4)  
(5.5)  
(5.6)  
(5.7) 
From this, the claim readily follows.
The expectation of given is expressed in terms of by
(5.8) 
The expected number of bearers, given is now given by
(5.9) 
As in the classical case where (cf. (2.5)), the inverse of this expression is equal to the posterior probability of guilt, since
(5.10)  
(5.11)  
(5.12) 
Since the prior probability of guilt is just as before, the likelihood ratio is . Since this likelihood ratio is not controversial in this case, we concentrate on the posterior probability of guilt in terms of the various conditional distributions.
As in the classical case (cf. (2.8)), we also have . Indeed,
(5.13)  
(5.14)  
(5.15) 
The expectation only depends on and not on the population size. This is to be expected, since learning that a (randomly chosen) population member has is not informative about the population size. This changes when we learn , the fact that a randomly selected islander has as well. Indeed, in a small population this is more likely to happen since we are more likely to accidentally select the criminal. In the extreme case where , can not offer any new information, but for other , it does. It follows from (5.2) that
We can also write
if we want to express in terms of , where denotes the variance of . The above formula can be rewritten as
with equality only if or (as expected, cf. the remark above).
It is perhaps worth mentioning that one can reconstruct from and from . Indeed we have
(5.16) 
and
(5.17) 
To see this, note that from (5.2) we have
(5.18) 
On the other hand, it follows from (5.15) that
and the first claim (5.16) follows.
For (5.17) we simply note from (5.1) that
(5.19) 
where . Integrating this equation gives
and this expresses in terms of . Substituting this into (5.19) gives (5.17).
As a conclusion, we have seen that
so one has
5.1.1 Conclusions

The conditional probability of guilt expressed in terms of is
(5.20) Therefore, ignoring the uncertainty (i.e., using instead of ), is unfavourable to the suspect. If, on the other hand, one incorrectly assumes that there is uncertainty, then this is favourable to the suspect.

The conditional probability of guilt expressed in terms of is
(5.21) In this case, the uncertainty in is irrelevant in the sense that its variance plays no role.

The conditional probability of guilt expressed in terms of is
(5.22) Ignoring the uncertainty in (obtaining ) would be favourable to the suspect.
5.2 Uncertainty about the criminal’s subpopulation
Suppose that, as in Section 4, the population is divided into subpopulations , and that has characteristic . We let be the random variable modelling the frequency of in . The expectation resp. variance of are denoted by resp. . So, if then and are independent, and furthermore conditional on