Test Coin

https://math.stackexchange.com/questions/2033370/how-to-determine-the-number-of-coin-tosses-to-identify-one-biased-coin-from-anot/2033739#2033739

Suppose there are two coins and the percentage that each coin flips a Head is $p$ and $q$, respectively. $p, q \in [0,1] $ and the values are given and known. If you are free to flip one of the coins, how many times $n$ do you have to flip the coin to decide with some significance level $ \left( \textrm{say } \alpha = 0.05 \right) $ that it’s the $p$ coin or the $q$ coin that you’ve been flipping?

The distribution of heads after $n$ flips for a coin will be a binomial distribution with means at $pn$ and $qn$.

alternative text — Two binomial distributions, n = 20. The means are pn = 10 and qn = 14.

The Usual Hypothesis Test

In the usual hypothesis test, for example with data $x_i, i=1, 2, 3, …, n$ from a random variable $X$, to find the if the mean $ \mu $ is $\leq$ some constant $\mu_0$:

\begin{align}
H_0 & : \mu \leq \mu_0 ( \textrm{ and } X \sim N(\mu_0, \textrm{ some } \sigma^2 ) )
H_1 & : \mu > \mu_0
\end{align}

If the sample mean of the data points $ \overline{x} $ is “too large compared to” $ \mu_0 $, then we reject the null hypothesis $ H_0 $.

If we have the probability distribution of the random variable (even if we don’t know the true value of the mean $ \mu $), we may be able to know something about the probability distribution of a statistic obtained from manipulating the sample data, e.g. the sample mean. This, the probability distribution of a statistic (obtained from manipulating sample data), is called the sampling distribution. And a property of the sampling distribution, the standard deviation of a statistic, is the standard error. For example, the standard error of the mean is:

Sample Data: $x$ $\qquad$ Sample Mean: $ \overline{x} $

Variance: $ Var(x) $ $\qquad$ Standard Deviation: $ StDev(x) = \sigma(x) = \sqrt{Var(x)} $

Variance of the Sample Mean: $ Var( \overline{x} ) = Var \left( \frac{1}{n} \sum_{i=0}^{n}{ x_i } \right) = \frac{1}{n^2} \sum_{i=0}^{n} { Var(x_i) } = \frac{1}{n^2} n Var(x) = \frac{1}{n} Var(x) = {\sigma^2 \over n} $

Standard Deviation of the Sample Mean, Standard Error of the Mean: $ \frac{1}{\sqrt{n}} StDev(x) = {\sigma \over \sqrt{n}} $

Thus, if the random variable is $i.i.d.$ (independent and identically distributed), then with the sample mean $ \overline{x} $ we obtain from the data, we can assume this $ \overline{x} $ has a standard deviation of $ \frac{\sigma}{\sqrt{n}} $. This standard deviation, being smaller than the standard deviation of the original $X$, i.e. $\sigma$, means that $\overline{X}$ is narrower around the mean than $X$. This means $\overline{X}$ gives us a better ability to hone in on what the data says about $ \mu $ than $X$’s ability to hone, i.e. a narrower, more precise, “range of certainty,” from the sample data, with the same significance level.

Thus, given our sample $x_i, i = 1, \dots, n $, we can calculate the statistic $ \overline{x} = \frac{1}{n} \sum_{i=1}^{n} {x_i} $, our sample mean. From the data (or given information), we would like to calculate the standard error of the mean, the standard deviation of this sample mean as a random variable (where the sample mean is a statistic, i.e. can be treated as a random variable): $ \frac{1}{\sqrt{n}} StDev(x) = {\sigma \over \sqrt{n}} $. This standard error of the mean gives us a “range of certainty” around the $\overline{x}$ with which to make an inference.

A. If we know/are given the true standard deviation $ \sigma $

If we are given the true standard deviation $ \sigma $ of the random variable $ X $, then we can calculate the standard error of the sample mean: $ \frac{\sigma}{\sqrt{n}} $. So under the null hypothesis $ H_0: \mu \leq \mu_0 $, we want to check if the null hypothesis can hold against a test using the sample data.

A.a Digression about $H_0: \mu \leq \mu_0$ and $H_0: \mu = \mu_0$

If the $\mu$ we infer from the sample data is “too extreme,” in this case “too large” compared to $\mu_0$, i.e. the test statistic is > some critical value that depends on $\mu_0$, i.e. $c(\mu_0)$, we reject the null hypothesis. If we check a $\mu_1$ that is $\mu_1 < \mu_0$ (since our null hypothesis is $ H_0: \mu < \mu_0 $), our critical value $c(\mu_1)$ will be less extreme than $c(\mu_0)$ (in other words $ c(\mu_1) < c(\mu_0) $), and thus it would be "easier to reject" the null hypothesis if using $ c(\mu_1) $. Rejecting a hypothesis test ought to be conservative since rejecting a null hypothesis is reaching a conclusion, so we would like the test to be "the hardest to reject" that we can (a conclusion, i.e. a rejection here, should be as conservative as possible). The "hardest to reject" part of the range of $H_0: \mu \leq \mu_0$ would be $ \mu = \mu_0 $ where the critical value $ c(\mu_0) $ would be the largest possible critical value. Testing a $\mu_1 < \mu_0$ would mean that we may obtain a test statistic that rejects is too extreme/large) for $\mu_1$ (i.e. $ t > c(\mu_1) $ ) but not too extreme/large for $\mu_0$ (i.e. $ t \not> c(\mu_0) $ ). But if we test using $\mu_0$, if the test statistic is extreme/large enough that we reject the null hypothesis of $\mu = \mu_0$, that would also reject all other null hypotheses using $\mu_1$ where $\mu_1 < \mu_0$.

So under the null hypothesis $ H_0: \mu \leq \mu_0 $ or the “effective” null hypothesis $ H_0: \mu = \mu_0 $, we have that $ X \sim N(\mu_0, \sigma^2) $ with $ \sigma $ known, and we have that $ \overline{X} \sim N(\mu_0, \sigma^2/n) $. This means that

$ \frac{\overline{X} – \mu_0} { ^{\sigma}/_{\sqrt{n}} } \sim N(0, 1) $

Then we can use a standard normal table to find where on the standard normal is the $ \alpha = 0.05 $ cutoff – for a one-tailed test, the cutoff is at $ Z_{\alpha} = 1.645 $ where $ Z \sim N(0, 1) $. So if

$ \frac{\overline{X} – \mu_0} { ^{\sigma}/_{\sqrt{n}} } > 1.645 = Z_{\alpha} $,

then this result is “too large compared to $ \mu_0 $” so we reject the null hypothesis $ H_0: \mu \leq \mu_0 $. If $ \frac{\overline{X} – \mu_0} { ^{\sigma}/_{\sqrt{n}} } \leq 1.645 = Z_{\alpha} $, then we fail to reject the null hypothesis $ H_0: \mu \leq \mu_0 $.

B. If we don’t know the standard deviation $ \sigma $

If we don’t know the value of the standard deviation $ \sigma $ of our random variable $ X \sim N( \mu, \sigma^2 ) $ (which would be somewhat expected if we already don’t know the value of the mean $\mu$ of $ X $), then we need to estimate $ \sigma $ from our data $ x_i, i = 1, 2, \dots, n $. We can estimate $ \sigma $ by taking the sample standard deviation of $ x_i, i = 1, \dots, n $ by doing $ s = \sqrt{ \frac{1}{n-1} \sum_{i=0}^{n} { \left[ (x_i – \overline{x})^2 \right] } } $, or rather the sample variance $ s^2 = { \frac{1}{n-1} \sum_{i=0}^{n} { \left[ (x_i – \overline{x})^2 \right] } } $ and then taking the square root of that.

However, note that while the estimator for the sample variance is unbiased:

\begin{align}
\mathbb{E}\left[s^2\right] & = \mathbb{E}\left[ \frac{1}{n-1} \sum_{i=0}^{n} { \left[ (x_i – \overline{x})^2 \right] } \right] \\
& = \frac{1}{n-1} \mathbb{E} \left[ \sum_{i=0}^{n} { \left[ (x_i – \overline{x})^2 \right] } \right] = \frac{1}{n-1} \mathbb{E} \left[ \sum_{i=0}^{n} { \left[ (x_i -\mu + \mu – \overline{x})^2 \right] } \right] \\
& = \frac{1}{n-1} \mathbb{E} \left[ \sum_{i=0}^{n} { \left[ \left( (x_i -\mu) – (\overline{x} – \mu) \right)^2 \right] } \right] \\
& = \frac{1}{n-1} \mathbb{E} \left[ \sum_{i=0}^{n} { \left[ (x_i – \mu)^2 – 2 (x_i – \mu) (\overline{x} – \mu) + (\overline{x} – \mu)^2 \right] } \right] \\
& = \frac{1}{n-1} \mathbb{E} \left[ \sum_{i=0}^{n} { \left[ (x_i – \mu)^2 – 2 (x_i – \mu) (\overline{x} – \mu) + (\overline{x} – \mu)^2 \right] } \right] \\
& = \frac{1}{n-1} \mathbb{E} \left[   \sum_{i=0}^{n} { (x_i – \mu)^2 } – 2 (\overline{x} – \mu) \sum_{i=0}^{n} { (x_i – \mu) } + \sum_{i=0}^{n} { (\overline{x} – \mu)^2 } \right] \\
& = \frac{1}{n-1} \mathbb{E} \left[   \sum_{i=0}^{n} { (x_i – \mu)^2 } – 2 (\overline{x} – \mu)   (n \overline{x} – n \mu) + n (\overline{x} – \mu)^2 \right]   \\
& = \frac{1}{n-1} \mathbb{E} \left[   \sum_{i=0}^{n} { (x_i – \mu)^2 } – 2 n (\overline{x} – \mu)^2 + n (\overline{x} – \mu)^2 \right]   \\
& = \frac{1}{n-1} \mathbb{E} \left[   \sum_{i=0}^{n} { (x_i – \mu)^2 } – n (\overline{x} – \mu)^2 \right]   \\
& = \frac{1}{n-1}    \sum_{i=0}^{n} { \mathbb{E} \left[ (x_i – \mu)^2 \right] } – n \mathbb{E} \left[ (\overline{x} – \mu)^2 \right] = \frac{1}{n-1} \left(    \sum_{i=0}^{n} { \mathbb{E} \left[ (x_i – \mu)^2 \right] } – n \mathbb{E} \left[ (\overline{x} – \mu)^2 \right] \right) \\
& = \frac{1}{n-1} \left(    \sum_{i=0}^{n} { \mathbb{E} \left[ x_i^2 – 2 \mu x_i + \mu^2 \right] } – n \mathbb{E} \left[ \overline{x}^2 – 2 \mu \overline{x} + \mu^2 \right] \right) \\
& = \frac{1}{n-1} \left(    \sum_{i=0}^{n} { \left( \mathbb{E} \left[ x_i^2 \right] – 2 \mu \mathbb{E} [x_i] + \mu^2 \right) } – n \left( \mathbb{E} \left[ \overline{x}^2 \right] – 2 \mu \mathbb{E} [\overline{x}] + \mu^2 \right) \right) \\
& = \frac{1}{n-1} \left(    \sum_{i=0}^{n} { \left( \mathbb{E} \left[ x_i^2 \right] – 2 \mu^2 + \mu^2 \right) } – n \left( \mathbb{E} \left[ \overline{x}^2 \right] – 2 \mu^2 + \mu^2 \right) \right) \\
& = \frac{1}{n-1} \left(    \sum_{i=0}^{n} { \left( \mathbb{E} \left[ x_i^2 \right] – \mu^2 \right) } – n \left( \mathbb{E} \left[ \overline{x}^2 \right] – \mu^2 \right) \right) \\
& = \frac{1}{n-1} \left(    \sum_{i=0}^{n} { \left( \mathbb{E} \left[ x_i^2 \right] – \left( \mathbb{E} [x_i] \right)^2 \right) } – n \left( \mathbb{E} \left[ \overline{x}^2 \right] – \left( \mathbb{E} [\overline{x}] \right)^2 \right) \right) \\
& = \frac{1}{n-1} \left(    \sum_{i=0}^{n} { \left( Var(x_i) \right) } – n Var(\overline{X}) \right) = \frac{1}{n-1} \left(    \sum_{i=0}^{n} { \left( \sigma^2 \right) } – n \frac{\sigma^2}{n} \right) \\
& = \frac{1}{n-1} \left(    n \sigma^2 – \sigma^2 \right) = \sigma^2 \\
\end{align}

that does not allow us to say that the square root of the above estimator gives us an unbiased estimator for the standard deviation $ \sigma $. In other words:

$ \mathbb{E}\left[s^2\right] = \mathbb{E}\left[ \frac{1}{n-1} \sum_{i=0}^{n} { \left[ (x_i – \overline{x})^2 \right] } \right] = \sigma^2 $

but

$ \mathbb{E} [s] = \mathbb{E} \left[ \sqrt{ \frac{1}{n-1} \sum_{i=0}^{n} { \left[ (x_i – \overline{x})^2 \right] } } \right] \neq \sigma $

because the expectation function and the square root function do not commute:

$ \sigma = \sqrt{\sigma^2} = \sqrt{ \mathbb{E}[s^2] } \neq \mathbb{E}[\sqrt{s^2}] = \mathbb{E}[s] $

B.a The sample standard deviation $ s = \sqrt{s^2} = \sqrt{ \frac{1}{n-1} \sum_{i=0}^{n} { \left[ (x_i – \overline{x})^2 \right] } } $ is a biased estimator of $ \sigma $

In fact, we can infer the bias of $ \mathbb{E} [s] $ to some extent. The square root function $ f(x) = \sqrt{x} $ is a concave function. A concave function $ f $ is:

$$ \forall x_1, x_2 \in X, \forall t \in [0, 1]: \quad f(tx_1 + (1 – t) x_2 ) \geq tf(x_1) + (1 – t) f(x_2) $$

The left-hand side of the inequality is the blue portion of the curve $ \{ f( \textrm{mixture of } x_1 \textrm{ and } x_2 ) \} $ and the right-hand side of the inequality is the red line segment $ \{ \textrm{a mixture of } f(x_1) \textrm{ and } f(x_2) \} $. We can see visually what it means for a function to be concave, where between to arbitrary $x$-values $x_1$ and $x_2$, the blue portion is always $\geq$ the red portion between two $x$-values, .

Jensen’s Inequality says that if $ g(x) $ is a convex function, then:

$$ g( \mathbb{E}[X] ) \leq \mathbb{E}\left[ g(X) \right] $$

and if $ f(x) $ is a concave function, then:

$$ f( \mathbb{E}[X] ) \geq \mathbb{E}\left[ f(X) \right] $$

The figure above showing the concave function $f(x) = \sqrt{x}$ gives an intuitive illustration of Jensen’s Inequality as well (since Jensen’s Inequality can be said to be a generalization of the “mixture” of $x_1$ and $x_2$ property of convex and concave functions to the expectation operator). The left-hand side $ f(\mathbb{E}[X]) $ is like $ f( \textrm{a mixture of } X \textrm{ values} ) $ and the right-hand side $ \mathbb{E}\left[ f(X) \right] $ is like $ {\textrm{a mixture of } f(X) \textrm{ values} } $ where the “mixture” in both cases is the “long-term mixture” of $ X $ values that is determined by the probability distribution of $ X $.

Since $ f(z) = \sqrt{z} $ is a concave function, going back to our estimation of the standard deviation of $ X $ using $\sqrt{s^2}$, we have
\begin{align}
f( \mathbb{E}[Z] ) & \geq \mathbb{E}\left[ f(Z) \right] \longrightarrow \\
\sqrt{\mathbb{E}[Z]} & \geq \mathbb{E}\left[ \sqrt{Z} \right] \longrightarrow \\
\sqrt{ \mathbb{E}[s^2] } & \geq \mathbb{E}\left[ \sqrt{s^2} \right] \longrightarrow \\
\sqrt{ Var(X) } & \geq \mathbb{E}\left[s\right] \\
\textrm{StDev} (X) = \sigma(X) & \geq \mathbb{E}\left[s\right] \\
\end{align}

Thus, $ \mathbb{E} [s] = \mathbb{E} \left[ \sqrt{ \frac{1}{n-1} \sum_{i=0}^{n} { \left[ (x_i – \overline{x})^2 \right] } } \right] \leq \sigma $. So $ \mathbb{E} [s] $ is biased and underestimates the true $\sigma$.

However, the exact bias $ \textrm{Bias}(s) = \mathbb{E} [s] – \sigma $ is not as clean to show.

https://en.wikipedia.org/wiki/Unbiased_estimation_of_standard_deviation

$ \frac{(n-1)s^2}{\sigma^2} = \frac{1}{\sigma^2} \sum_{i=0}^{n} { \left[ (x_i – \overline{x})^2 \right] } \sim $ a $ \chi^2 $ distribution with $ n-1 $ degrees of freedom. In addition, $ \sqrt{ \frac{(n-1)s^2}{\sigma^2} } = \frac{\sqrt{n-1}s}{\sigma} = \frac{1}{\sigma} \sqrt{ \sum_{i=0}^{n} { \left[ (x_i – \overline{x})^2 \right] } } \sim $ a $ \chi $ distribution with $ n-1 $ degrees of freedom. A $ \chi $ distribution with $k$ degrees of freedom has mean $ \mathbb{E} \left[ \frac{\sqrt{n-1}s}{\sigma} \right] = \mu_{\chi} = \sqrt{2} \frac{\Gamma ( ^{(k+1)} / _2 ) } { \Gamma ( ^k / _2 )} $ where $ \Gamma(z) $ is the Gamma function.

https://en.wikipedia.org/wiki/Gamma_function

If $n$ is a positive integer, then $ \Gamma(n) = (n – 1)! $. If $z$ is a complex number that is not a non-positive integer, then $ \Gamma(z) = \int_{0}^{\infty}{x^{z-1} e^{-x} dx} $. For non-positive integers, $ \Gamma(z) $ goes to $\infty$ or $-\infty$.

From the mean of a $ \chi $ distribution above, we have:

$ \mathbb{E}[s] = {1 \over \sqrt{n – 1} } \cdot \mu_{\chi} \cdot \sigma $

and replacing $k$ with $n-1$ degrees of freedom for the value of $\mu_{\chi}$, we have:

$ \mathbb{E}[s] = \sqrt{ {2 \over n – 1} } \cdot { \Gamma(^n/_2) \over \Gamma(^{n-1}/_2) } \cdot \sigma $

Wikipedia tells us that:

$ \sqrt{ {2 \over n – 1} } \cdot { \Gamma(^n/_2) \over \Gamma(^{n-1}/_2) } = c_4(n) = 1 – {1 \over 4n} – {7 \over 32n^2} – {19 \over 128n^3} – O(n^{-4}) $

So we have:

$ \textrm{Bias} (s) = \mathbb{E}[s] – \sigma = c_4(n) \cdot \sigma – \sigma = ( c_4(n) – 1) \cdot \sigma $

$ = \left( \left( 1 – {1 \over 4n} – {7 \over 32n^2} – {19 \over 128n^3} – O(n^{-4}) \right) – 1 \right) \cdot \sigma = – \left( {1 \over 4n} + {7 \over 32n^2} + {19 \over 128n^3} + O(n^{-4}) \right) \cdot \sigma $

Thus, as $n$ becomes large, the magnitude of the bias becomes small.

From Wikipedia, these are the values of $n$, $c_4(n)$, and the numerical value of $ c_4(n) $:

\begin{array}{|l|r|c|}
\hline
n & c_4(n) & \textrm{Numerical value of } c_4(n) \\
\hline
2 & \sqrt{2 \over \pi} & 0.798… \\
3 & {\sqrt{\pi} \over 2} & 0.886… \\
5 & {3 \over 4}\sqrt{\pi \over 2} & 0.940… \\
10 & {108 \over 125}\sqrt{2 \over \pi} & 0.973… \\
100 & – & 0.997… \\
\hline
\end{array}

Thus, for the most part, we don’t have to worry too much about this bias, especially with large $n$. So we have
$ \mathbb{E}[\hat{\sigma}] \approx \mathbb{E}[s] = \mathbb{E}[\sqrt{s^2}] = \mathbb{E} \left[ \sqrt{ \frac{1}{n-1} \sum_{i=0}^{n} { \left[ (x_i – \overline{x})^2 \right] } } \right] $

More rigorously, our estimator $ \hat{\sigma} = s = \sqrt{ \frac{1}{n-1} \sum_{i=0}^{n} { \left[ (x_i – \overline{x})^2 \right] } } $ is a consistent estimator of $ \sigma $ (even though it is a biased estimator of $ \sigma $).

An estimator is consistent if $ \forall \epsilon > 0 $:

$$ \lim\limits_{n \to \infty} \textrm{Pr } (|\hat{\theta} – \theta| > \epsilon ) = 0 $$

In other words, as $ n \to \infty $, the probability that our estimator $ \hat{\theta} $ “misses” the true value of the parameter $\theta$ by greater than some arbitrary positive amount (no matter how small) goes to $0$.

For the sample standard deviation $s$ as our estimator of the true standard deviation $\sigma$ (i.e. let $\hat{\sigma} = s$),

$ \lim_{n \to \infty} (|\hat{\sigma} – \sigma|) = \lim_{n \to \infty} ( | c_4(n) \sigma – \sigma |) = (| \sigma – \sigma |) = 0 $

$ \lim_{n \to \infty} \textrm{Pr } (|\hat{\sigma} – \sigma| > \epsilon) = \textrm{Pr } ( 0 > \epsilon ) = 0 $

Since $s$ is a consistent estimator of $\sigma$, we are fine to use $s$ to estimate $\sigma$ as long as we have large $n$.

So back to the matter at hand: we want to know the sampling distribution of $\overline{X} $ to see “what we can say” about $\overline{X}$, specifically, the standard deviation of $\overline{X}$, i.e. the standard error of the mean of $X$. Not knowing the true standard deviation $\sigma$ of $X$, we use a consistent estimator of $\sigma$ to estimate it: $s = \sqrt{{1 \over n-1} \sum_{i=1}^n {(x_i – \overline{x})^2}}$.

So instead of the case where we know the value of $\sigma$
$\overline{X} \sim N(\mu, \sigma^2/n)$
we have, instead something like:
$\overline{X} \quad “\sim” \quad N(\mu, s^2/n)$

When we know the value of $\sigma$, we have
${ \overline{X} – \mu \over \sigma/\sqrt{n} } \sim N(0,1) $
When we don’t know the value of $\sigma$ and use the estimate $s$ instead of having something like
${ \overline{X} – \mu \over s/\sqrt{n} } \quad “\sim” \quad N(0,1) $
we actually have the exact distribution:
${ \overline{X} – \mu \over s/\sqrt{n} } \sim T_{n-1} $
the student’s t-distribution with $n-1$ degrees of freedom.

Thus, finally, when we don’t know the true standard deviation $\sigma$, under the null hypothesis $ H_0: \mu \leq \mu_0 $, we can use the expression above to create a test statistic
$ t = { \overline{x} – \mu_0 \over s/\sqrt{n} } ~ T_{n-1} $
and check it against the student’s t-distribution with $n-1$ degrees of freedom $T_{n-1}$ with some critical value with some significance level, say $\alpha = 0.05$.

So if the test statistic exceeds our critical value $\alpha 0.05$:

$ t = { \overline{x} – \mu_0 \over s/\sqrt{n} } > T_{n-1, \alpha} $

then we reject our null hypothesis $ H_0: \mu \leq \mu_0 $ at $\alpha = 0.05$ significance level. If not, then we fail to reject our null hypothesis.

asdf

we know the standard deviation of a data point

If under the null hypothesis $ H_0 $ we have a probability distribution, the sample data gives us a sample standard deviation, i.e. the standard error.

Back to our case with 2 coins. Let’s say we want to test if our coin is the $p$ coin and let’s say we arbitrarily decide to call the smaller probability $p$, i.e. $p \leq q$. We know that coin flips give us a binomial distribution, and we know the standard error of the mean proportion of heads from $n$ flips. So a 0.05 significance level would mean some cutoff value $c$ where $c > p$. But note that if $c$ ends up really big relative to $q$, e.g. it gets close to $q$ or even exceeds $q$, we are in a weird situation.

we can decide on some cutoff value $c$ between $p$ and $q$. If we change around $c$, what happens is that the significance level and the power of the test, whether testing $p$ or $q$, changes.

Test Coin

Related

Leave a Reply Cancel reply

Share this:

Related

Leave a Reply Cancel reply