Suppose there are two coins and the percentage that each coin flips a Head is \(p\) and \(q\), respectively. \(p, q \in [0,1] \), \(p \neq q \), and the values are given and known. If you are free to flip one of the coins any number of times, how many times \(n\) do you have to flip the coin to decide with some significance level \( \left( \textrm{say } \alpha = 0.05 \right) \) that it’s the \(p\) coin or the \(q\) coin that you’ve been flipping?

The distribution of heads after \(n\) flips for a coin will be a binomial distribution with means at \(pn\) and \(qn\).

** Setting Up Our Hypothesis Test **

Let’s say we want to test if our coin is the \(p\) coin and let’s say we arbitrarily decide to call the smaller probability \(p\), i.e. \(p < q\). We know that coin flips give us a binomial distribution, and we know the standard deviation of a binomial random variable \(X_p\) (let \(X_p\) or \(X_{p,n}\) be a binomial random variable for the number of flips that are heads, where the probability of a head on a flip is \(p\) and we do \(n\) number of flips), which is:

$$ \textrm{Standard Deviation of }{X_p} = \sqrt{ Var\left( {X_p} \right) } = \sqrt{ np(1-p) } $$

—–

Digression: we can also split our \(n\) Bernoulli trial coin flips that make up our binomial random variable \(X_p\) into \(m\) number of binomial random variables \(X_{p,k}\) each with \(k\) trials, such that \(k \times m = n\). Then the standard error of the mean proportion of heads from \(m\) binomial random variables (each with \(k\) trials) is:

$$ \textrm{Standard error of the mean} = \sqrt{ Var\left( \overline{X_{p,k}} \right) } = \sqrt{ Var \left( {1 \over m} \sum_{i=1}^{m} {X_{p,k}} \right) } $$

$$= \sqrt{ Var(\sum_{i=1}^{m} X_{p,k}) \over m^2 } = \sqrt{ m \cdot Var(X_{p,k}) \over m^2 }= \sqrt{ {m \cdot kp(1-p) \over m^2 } } = \sqrt{ { kp(1-p) \over m} } $$

This standard error above is for the random variable \(X_{p,k}\), each of which has \(k\) Bernoulli trials. In other words, the standard deviation of \( {1 \over m} \sum_{i=1}^{m} X_{p,k} \) is \( \sqrt{ kp(1-p) \over m }\). But if you simply change \(k\) to \(km = n\) and reduce \(m\) to \(1\), you get the same result as if you took all \(km = n\) trials as the number of trials for one binomial random variable, our original \(X_p\): where we now say that the standard deviation of \( {1 \over 1} \sum_{i=1}^{1} X_{p,n} = X_{p,n} = X_p \) is \( \sqrt{ np(1-p) \over 1 } = \sqrt{ np(1-p) } \).

By going from \(m\) repetitions of \(X_{p,k}\) to \(1\) repetition of \(X_{p,n}\), both the mean and the standard deviation is multiplied by \(m\). The mean of \(X_{p,k}\) is \(kp\) and the mean of \(X_{p,n}\) is \(mkp = np\); the standard deviation of \(X_{p,k}\) is \( \sqrt{ kp(1-p) } \) and the standard deviation of \(X_{p,n}\) is \( \sqrt{ mkp(1-p) } =\sqrt{ np(1-p) } \). The standard error of the mean of \(m\) repetitions of \(X_{p,k}\) is \( \sqrt{ { kp(1-p) \over m} } \) while the mean of \(m\) repetitions of \(X_{p,k}\) is of course just \( {1 \over m} \sum_{i=1}^{m} \mathbb{E} \left[ X_{p,k} \right] = {1 \over m} m (kp) = kp \). So when going from \(1\) repetition of \(X_{p,k}\) to \(m\) repetitions of \(X_{p,k}\), the mean goes from \(kp\) to \(mkp = np\) and the standard error of the mean of \(X_{p,k}\) goes from \( \sqrt{ { kp(1-p) \over m} } \) to the standard deviation of \( X_{p,n} \) by multiplying the standard error of the mean of \( X_{p,k} \) by \(m\): \( m \cdot \sqrt{ { kp(1-p) \over m} } = \sqrt{ { m^2 \cdot kp(1-p) \over m} } = \sqrt{ { mkp(1-p)} } = \sqrt{ { np(1-p)} } \).

—–

Knowing the standard deviation of our random variable \(X_p\), a 0.05 significance level for a result that “rejects” the null would mean some cutoff value \(c\) where \(c > pn\). If \(x_p\) (the sample number of heads from \(n\) coin tosses) is “too far away” from \(pn\), i.e. we have \(x_p > c\), then we reject the null hypothesis that we have been flipping the \(p\) coin.

But note that if we choose a \(c\) that far exceeds \(qn\) as well, we are in a weird situation. If \(x_p > c\), then \(x_p \) is “too large” for \(pn\) but also is quite larger than \(qn\) (i.e. \( x_p > qn > pn \) ). This puts us in an awkward situation because while \(x_p \) is much larger than \(pn\), making us want to reject the hypothesis that we have were flipping the \(p\) coin, it is also quite larger than \(qn\), so perhaps we obtained a result that was pretty extreme “no matter which coin we had.” If we assume the null hypothesis that we have the \(p\) coin, our result \(x_p \) is very unlikely, but it is also quite unlikely even if we had the \(q\) coin, our alternative hypothesis. But still, it is more unlikely that it is the \(p\) coin than it is the \(q\) coin, so perhaps it’s not that awkward. But what if \(x_p\) does not exceed \(c\)? Then we can’t reject the null hypothesis that we have the \(p\) coin. But our sample result of \(x_p\) might in fact be closer to \(qn\) than \(pn\) – \(x_p\) might even be right on the dot of \(qn\) – and yet we aren’t allowing ourselves to use that to form a better conclusion, which is a truly awkward situation.

If \(c\) is, instead, somewhere in between \(pn\) and \(qn\), and \(x_p > c\), we may reject the null hypothesis that our coin is the \(p\) coin, while \(x_p\) is in a region close to \(q\), i.e. a region that is a more likely result if we actually had been flipping the \(q\) coin, bringing us closer to the conclusion that this is the \(q\). However, if we reverse the experiment – if we use the same critical value \(c\) and say that if \(x_p < c\) then we reject our null hypothesis that \(q\) is our coin, then the power and significance of the test for each coin is different, which is also awkward.

Above, the pink region is the probability that \(X_p\) ends in the critical region, where \(x_p > c\), assuming the null hypothesis that we have the \(p\) coin. This is also the Type I Error rate (a.k.a. false positive) – the probability that we end up falsely rejecting the null hypothesis, assuming that the null hypothesis is true.

Above, the green region is the power \(1-\beta\), the probability that we get a result in the critical region \(x_p > c\) assuming that the alternative hypothesis is true, that we have the \(q\) coin. The blue-gray region is \(\beta\), or the Type II Error rate (a.k.a. false negative) – the probability that we fail to reject the null hypothesis (that we have the \(p\) coin) when what’s actually true is the alternative hypothesis (that we have the \(q\) coin).

Now let us “reverse” the experiment with the same critical value – we want to test our null hypothesis that we have the \(q\) coin:

We have \(x_p < c\). We fail to reject the null hypothesis that we have the \(p\) coin, and on the flip side we would reject the null hypothesis that we have the \(q\) coin. but we have failed a tougher test (the first one, with a small \(\alpha_p\)) and succeeded in rejecting an easier test (the second one, with a larger \(\alpha_q\)). In hypothesis testing, we would like to be conservative, so it is awkward to have failed a tougher test but "be ok with it" since we succeeded with an easier test. Common sense also, obviously, says that something is strange when \(x_p\) is closer to \(q\) than \(p\) and yet we make the conclusion that since \(x_p\) is on the "\(p\)-side of \(c\)," we have the \(p\) coin. In reality, we wouldn't take one result and apply two hypothesis tests on that one result. But we would like the one test procedure to make sense with whichever null hypothesis we start with, \(p\) coin or \(q\) coin (since it is arbitrary which null hypothesis we choose in the beginning, for we have no knowledge of which coin we have before we start the experiment).

What we can do, then, is to select \(c\) so that the probability, under the hypothesis that we have the \(p\) coin, that \(X_p > c\) is equal to the probability that, under the hypothesis that we have the \(q\) coin, that \(X_q < c\). In our set up, we have two binomial distributions, which are discrete, as opposed to the normal distributions above. In addition, binomial distributions, unless the mean is at \(n/2\), are generally not symmetric, as can be seen in the very first figure, copied below as well, where the blue distribution is symmetric but the green one is not.

We can pretend that the blue distribution is the binomial distribution for the \(p\) coin and the green distribution for the \(q\) coin. The pmf of a binomial random variable, say \(X_p\) (that generates Heads or Tails with probability of Heads \(p\)) is:

$$ {n \choose h} p^h (1-p)^{n-h} $$

where \(n\) is the total number of flips and \(h\) is the number of Heads among those flips. We let \(c\) be the critical number of Heads that would cause us to reject the null hypothesis that the coin we have is the \(p\) coin in favor of the alternative hypothesis that we have the \(q\) coin. The area of the critical region, i.e. the probability that we get \(c_H\) heads or more assuming the hypothesis that we have the \(p\) coin, is:

$$ Pr(X_p > c) = \sum_{i=c}^{n} \left[ {n \choose i} p^i (1-p)^{n-i} \right] $$

And the reverse, the probability that we get \(c_H\) heads or less assuming the hypothesis that we have the \(q\) coin, is:

$$ Pr(X_q < c) = \sum_{i=0}^{c} \left[ {n \choose i} q^i (1-q)^{n-i} \right] $$ So we want to set these two equal to each other and solve for \(c\): $$ \sum_{i=c}^{n} \left[ {n \choose i} p^i (1-p)^{n-i} \right] = \sum_{i=0}^{c} \left[ {n \choose i} q^i (1-q)^{n-i} \right] $$ But since the binomial distribution is discrete, there may not be a \(c\) that actually works. For large \(n\), a normal distribution can approximate the binomial distribution. In that case, we can draw the figure below, which is two normal distributions, each centered on \(pn\) and \(qn\) (the means of the true binomial distributions), and since normal distributions are symmetric, the point at which the distributions cross will be our critical value. The critical regions for \(X_p\) (to the right of \(c\)) and for \(X_q\) (to the left of \(c\)) will have the same area.

If we pretend that these normal distributions are binomial distributions, i.e. if we pretend that our binomial distributions are symmetric (i.e. we pretend that \(n\) is going to be large enough that both our binomial distributions of \(X_p\) and \(X_q\) are symmetric enough), then to find \(c\) we can find the value on the horizontal axis at which, i.e. the number of Heads at which, the two binomial probability distributions are equal to each other.

$$ {n \choose c} p^c (1-p)^{n-c} = {n \choose c} q^c (1-q)^{n-c} $$

$$ p^c (1-p)^{n-c} = q^c (1-q)^{n-c} $$

$$ \left({p \over q}\right)^c \left({1-p \over 1-q}\right)^{n-c} = 1 $$

$$ \left({p \over q}\right)^c \left({1-p \over 1-q}\right)^{n} \left({1-q \over 1-p}\right)^c = 1 $$

$$ \left({p(1-q) \over q(1-p)}\right)^c = \left({1-q \over 1-p}\right)^{n} $$

$$ c \cdot log \left({p(1-q) \over q(1-p)}\right) = n \cdot log \left({1-q \over 1-p}\right) $$

$$ c = n \cdot log \left({1-q \over 1-p}\right) / log \left({p(1-q) \over q(1-p)}\right) $$

The mean of a binomial distribution \(X_p\) has mean \(pn\) with standard deviation \(\sqrt{np(1-p)}\). With a normal distribution \(X_{\textrm{norm}}\) with mean \(\mu_{\textrm{norm}}\) and standard deviation \(\sigma_{\textrm{norm}}\), the value \( c_{\alpha} = X_{\textrm{norm}} = \mu_{\textrm{norm}} = 1.645\sigma_{\textrm{norm}}\) is the value where the area from that value \(c_{\alpha}\) to infinity is \(0.05 = \alpha\). Thus, \( c_{\alpha} \) is the critical value for a normal random variable where the probability that \(X_{\textrm{norm}} > c_{\alpha} = 0.05)\). So for a binomial random variable \(X_p\), we would have \(c_{\textrm{binomial, }\alpha} = pn + 1.645\sqrt{np(1-p)}\).

Thus, we have that this critical value for a binomial random variable \(X_p\):

$$ c = n \cdot log \left({1-q \over 1-p}\right) / log \left({p(1-q) \over q(1-p)}\right) $$

must also be

$$ c_{\textrm{binomial, }\alpha} \geq pn + 1.645\sqrt{np(1-p)} $$

for the area to the right of \(c\) to be \(\leq 0.05\). To actually find the critical value \(c_{\textrm{binomial, }\alpha}\), we can just use

$$ c_{\textrm{binomial, }\alpha} \geq pn + 1.645\sqrt{np(1-p)} $$

Since we are given the values of \(p\) and \(q\), we would plug in those values to find the required \(n\) needed to reach this condition for the critical value. So we have

$$ n \cdot log \left({1-q \over 1-p}\right) / log \left({p(1-q) \over q(1-p)}\right) = pn + 1.645\sqrt{np(1-p)} $$

$$ \sqrt{n} = 1.645\sqrt{p(1-p)} / \left[ log \left({1-q \over 1-p}\right) / log \left({p(1-q) \over q(1-p)}\right) – p \right] $$

$$ n = 1.645^2p(1-p) / \left[ log \left({1-q \over 1-p}\right) / log \left({p(1-q) \over q(1-p)}\right) – p\right]^2 $$

For example, if \(p = 0.3\) and \(q = 0.7\), we have \(n = 14.2066 \), or rather, \(n \geq 14.2066 \).

Wolfram Alpha calculation of above, enter the following into Wolfram Alpha:

1.645^2 * p * (1-p) / (ln((1-q)/(1-p))/ln(p*(1-q)/(q*(1-p))) – p )^2; p = 0.3, q = 0.7

Note that if we switch the values so that \(p = 0.7\) and \(q = 0.3\), or switch the \(p\)’s and \(q\)’s in the above equation for \(n\), we obtain the same \(n_{\textrm{min}}\). This makes sense since our value for \(n_{\textrm{min}}\) depends on \(c\) and \(c\) is the value on the horizontal axis at which the two normal distributions from above (approximations of binomial distributions) with means at \(pn\) and \(qn\) cross each other. Thus, we set up the distributions so that that whole problem is symmetric.

So if we generate a sample such that the number of samples is \(n \geq 14.2066\), we can use our resulting \(x_p\) and make a hypothesis test regarding if we have the \(p\) or \(q\) coin with \(\alpha = 0.05\) significance level.

If \(p\) and \(q\) are closer, say \(p = 0.4\) and \(q = 0.5\), then we have \(n \geq 263.345\). This makes intuitive sense, where the closer the probabilities are of the two coins, the more times we have to flip our coin to be more sure that we have one of the coins rather than the other. To be more precise, the smaller the **effect size** is, the larger sample size we need in order to get the certainty about a result. An example of the effect size is **Cohen’s d** where:

$$\textrm{Cohen’s d } = {\mu_2 – \mu_1 \over \textrm{StDev / Pooled StDev}} $$

Wolfram Alpha calculation of above for \(n\) with \(p = 0.4\) and \(q = 0.5\), or enter the following into Wolfram Alpha:

1.645^2 * p * (1-p) / (ln((1-q)/(1-p))/ln(p*(1-q)/(q*(1-p))) – p )^2; p = 0.4, q = 0.5

From here, where the question is asked originally, is an answer that finds the exact values for the two \(n_{\textrm{min}}\) using R with the actual binomial distributions (not using normal distributions as approximations):

https://math.stackexchange.com/a/2033739/506042

Due to the discrete-ness of the distributions, the \(n_{\textrm{min}}\)’s found are slightly different: \(n_{\textrm{min}} = 17\) for the first case and \(n_{\textrm{min}} = 268\) for the second case. I.e., the difference comes from using the normal distribution as an approximation for the binomial distribution.