Suppose there are two coins and the percentage that each coin flips a Head is \(p\) and \(q\), respectively. \(p, q \in [0,1] \) and the values are given and known. If you are free to flip one of the coins, how many times \(n\) do you have to flip the coin to decide with some significance level \( \left( \textrm{say } \alpha = 0.05 \right) \) that it’s the \(p\) coin or the \(q\) coin that you’ve been flipping?
The distribution of heads after \(n\) flips for a coin will be a binomial distribution with means at \(pn\) and \(qn\).
The Usual Hypothesis Test
In the usual hypothesis test, for example with data \(x_i, i=1, 2, 3, …, n\) from a random variable \(X\), to find the if the mean \( \mu \) is \(\leq\) some constant \(\mu_0\):
\begin{align}
H_0 & : \mu \leq \mu_0 ( \textrm{ and } X \sim N(\mu_0, \textrm{ some } \sigma^2 ) )
H_1 & : \mu > \mu_0
\end{align}
If the sample mean of the data points \( \overline{x} \) is “too large compared to” \( \mu_0 \), then we reject the null hypothesis \( H_0 \).
If we have the probability distribution of the random variable (even if we don’t know the true value of the mean \( \mu \)), we may be able to know something about the probability distribution of a statistic obtained from manipulating the sample data, e.g. the sample mean. This, the probability distribution of a statistic (obtained from manipulating sample data), is called the sampling distribution. And a property of the sampling distribution, the standard deviation of a statistic, is the standard error. For example, the standard error of the mean is:
Sample Data: \(x\) \(\qquad\) Sample Mean: \( \overline{x} \)
Variance: \( Var(x) \) \(\qquad\) Standard Deviation: \( StDev(x) = \sigma(x) = \sqrt{Var(x)} \)
Variance of the Sample Mean: \( Var( \overline{x} ) = Var \left( \frac{1}{n} \sum_{i=0}^{n}{ x_i } \right) = \frac{1}{n^2} \sum_{i=0}^{n} { Var(x_i) } = \frac{1}{n^2} n Var(x) = \frac{1}{n} Var(x) = {\sigma^2 \over n} \)
Standard Deviation of the Sample Mean, Standard Error of the Mean: \( \frac{1}{\sqrt{n}} StDev(x) = {\sigma \over \sqrt{n}} \)
Thus, if the random variable is \(i.i.d.\) (independent and identically distributed), then with the sample mean \( \overline{x} \) we obtain from the data, we can assume this \( \overline{x} \) has a standard deviation of \( \frac{\sigma}{\sqrt{n}} \). This standard deviation, being smaller than the standard deviation of the original \(X\), i.e. \(\sigma\), means that \(\overline{X}\) is narrower around the mean than \(X\). This means \(\overline{X}\) gives us a better ability to hone in on what the data says about \( \mu \) than \(X\)’s ability to hone, i.e. a narrower, more precise, “range of certainty,” from the sample data, with the same significance level.
Thus, given our sample \(x_i, i = 1, \dots, n \), we can calculate the statistic \( \overline{x} = \frac{1}{n} \sum_{i=1}^{n} {x_i} \), our sample mean. From the data (or given information), we would like to calculate the standard error of the mean, the standard deviation of this sample mean as a random variable (where the sample mean is a statistic, i.e. can be treated as a random variable): \( \frac{1}{\sqrt{n}} StDev(x) = {\sigma \over \sqrt{n}} \). This standard error of the mean gives us a “range of certainty” around the \(\overline{x}\) with which to make an inference.
A. If we know/are given the true standard deviation \( \sigma \)
If we are given the true standard deviation \( \sigma \) of the random variable \( X \), then we can calculate the standard error of the sample mean: \( \frac{\sigma}{\sqrt{n}} \). So under the null hypothesis \( H_0: \mu \leq \mu_0 \), we want to check if the null hypothesis can hold against a test using the sample data.
A.a Digression about \(H_0: \mu \leq \mu_0\) and \(H_0: \mu = \mu_0\)
If the \(\mu\) we infer from the sample data is “too extreme,” in this case “too large” compared to \(\mu_0\), i.e. the test statistic is > some critical value that depends on \(\mu_0\), i.e. \(c(\mu_0)\), we reject the null hypothesis. If we check a \(\mu_1\) that is \(\mu_1 < \mu_0\) (since our null hypothesis is \( H_0: \mu < \mu_0 \)), our critical value \(c(\mu_1)\) will be less extreme than \(c(\mu_0)\) (in other words \( c(\mu_1) < c(\mu_0) \)), and thus it would be "easier to reject" the null hypothesis if using \( c(\mu_1) \). Rejecting a hypothesis test ought to be conservative since rejecting a null hypothesis is reaching a conclusion, so we would like the test to be "the hardest to reject" that we can (a conclusion, i.e. a rejection here, should be as conservative as possible). The "hardest to reject" part of the range of \(H_0: \mu \leq \mu_0\) would be \( \mu = \mu_0 \) where the critical value \( c(\mu_0) \) would be the largest possible critical value. Testing a \(\mu_1 < \mu_0\) would mean that we may obtain a test statistic that rejects is too extreme/large) for \(\mu_1\) (i.e. \( t > c(\mu_1) \) ) but not too extreme/large for \(\mu_0\) (i.e. \( t \not> c(\mu_0) \) ). But if we test using \(\mu_0\), if the test statistic is extreme/large enough that we reject the null hypothesis of \(\mu = \mu_0\), that would also reject all other null hypotheses using \(\mu_1\) where \(\mu_1 < \mu_0\).
So under the null hypothesis \( H_0: \mu \leq \mu_0 \) or the “effective” null hypothesis \( H_0: \mu = \mu_0 \), we have that \( X \sim N(\mu_0, \sigma^2) \) with \( \sigma \) known, and we have that \( \overline{X} \sim N(\mu_0, \sigma^2/n) \). This means that
\( \frac{\overline{X} – \mu_0} { ^{\sigma}/_{\sqrt{n}} } \sim N(0, 1) \)
Then we can use a standard normal table to find where on the standard normal is the \( \alpha = 0.05 \) cutoff – for a one-tailed test, the cutoff is at \( Z_{\alpha} = 1.645 \) where \( Z \sim N(0, 1) \). So if
\( \frac{\overline{X} – \mu_0} { ^{\sigma}/_{\sqrt{n}} } > 1.645 = Z_{\alpha} \),
then this result is “too large compared to \( \mu_0 \)” so we reject the null hypothesis \( H_0: \mu \leq \mu_0 \). If \( \frac{\overline{X} – \mu_0} { ^{\sigma}/_{\sqrt{n}} } \leq 1.645 = Z_{\alpha} \), then we fail to reject the null hypothesis \( H_0: \mu \leq \mu_0 \).
B. If we don’t know the standard deviation \( \sigma \)
If we don’t know the value of the standard deviation \( \sigma \) of our random variable \( X \sim N( \mu, \sigma^2 ) \) (which would be somewhat expected if we already don’t know the value of the mean \(\mu\) of \( X \)), then we need to estimate \( \sigma \) from our data \( x_i, i = 1, 2, \dots, n \). We can estimate \( \sigma \) by taking the sample standard deviation of \( x_i, i = 1, \dots, n \) by doing \( s = \sqrt{ \frac{1}{n-1} \sum_{i=0}^{n} { \left[ (x_i – \overline{x})^2 \right] } } \), or rather the sample variance \( s^2 = { \frac{1}{n-1} \sum_{i=0}^{n} { \left[ (x_i – \overline{x})^2 \right] } } \) and then taking the square root of that.
However, note that while the estimator for the sample variance is unbiased:
\begin{align}
\mathbb{E}\left[s^2\right] & = \mathbb{E}\left[ \frac{1}{n-1} \sum_{i=0}^{n} { \left[ (x_i – \overline{x})^2 \right] } \right] \\
& = \frac{1}{n-1} \mathbb{E} \left[ \sum_{i=0}^{n} { \left[ (x_i – \overline{x})^2 \right] } \right] = \frac{1}{n-1} \mathbb{E} \left[ \sum_{i=0}^{n} { \left[ (x_i -\mu + \mu – \overline{x})^2 \right] } \right] \\
& = \frac{1}{n-1} \mathbb{E} \left[ \sum_{i=0}^{n} { \left[ \left( (x_i -\mu) – (\overline{x} – \mu) \right)^2 \right] } \right] \\
& = \frac{1}{n-1} \mathbb{E} \left[ \sum_{i=0}^{n} { \left[ (x_i – \mu)^2 – 2 (x_i – \mu) (\overline{x} – \mu) + (\overline{x} – \mu)^2 \right] } \right] \\
& = \frac{1}{n-1} \mathbb{E} \left[ \sum_{i=0}^{n} { \left[ (x_i – \mu)^2 – 2 (x_i – \mu) (\overline{x} – \mu) + (\overline{x} – \mu)^2 \right] } \right] \\
& = \frac{1}{n-1} \mathbb{E} \left[ \sum_{i=0}^{n} { (x_i – \mu)^2 } – 2 (\overline{x} – \mu) \sum_{i=0}^{n} { (x_i – \mu) } + \sum_{i=0}^{n} { (\overline{x} – \mu)^2 } \right] \\
& = \frac{1}{n-1} \mathbb{E} \left[ \sum_{i=0}^{n} { (x_i – \mu)^2 } – 2 (\overline{x} – \mu) (n \overline{x} – n \mu) + n (\overline{x} – \mu)^2 \right] \\
& = \frac{1}{n-1} \mathbb{E} \left[ \sum_{i=0}^{n} { (x_i – \mu)^2 } – 2 n (\overline{x} – \mu)^2 + n (\overline{x} – \mu)^2 \right] \\
& = \frac{1}{n-1} \mathbb{E} \left[ \sum_{i=0}^{n} { (x_i – \mu)^2 } – n (\overline{x} – \mu)^2 \right] \\
& = \frac{1}{n-1} \sum_{i=0}^{n} { \mathbb{E} \left[ (x_i – \mu)^2 \right] } – n \mathbb{E} \left[ (\overline{x} – \mu)^2 \right] = \frac{1}{n-1} \left( \sum_{i=0}^{n} { \mathbb{E} \left[ (x_i – \mu)^2 \right] } – n \mathbb{E} \left[ (\overline{x} – \mu)^2 \right] \right) \\
& = \frac{1}{n-1} \left( \sum_{i=0}^{n} { \mathbb{E} \left[ x_i^2 – 2 \mu x_i + \mu^2 \right] } – n \mathbb{E} \left[ \overline{x}^2 – 2 \mu \overline{x} + \mu^2 \right] \right) \\
& = \frac{1}{n-1} \left( \sum_{i=0}^{n} { \left( \mathbb{E} \left[ x_i^2 \right] – 2 \mu \mathbb{E} [x_i] + \mu^2 \right) } – n \left( \mathbb{E} \left[ \overline{x}^2 \right] – 2 \mu \mathbb{E} [\overline{x}] + \mu^2 \right) \right) \\
& = \frac{1}{n-1} \left( \sum_{i=0}^{n} { \left( \mathbb{E} \left[ x_i^2 \right] – 2 \mu^2 + \mu^2 \right) } – n \left( \mathbb{E} \left[ \overline{x}^2 \right] – 2 \mu^2 + \mu^2 \right) \right) \\
& = \frac{1}{n-1} \left( \sum_{i=0}^{n} { \left( \mathbb{E} \left[ x_i^2 \right] – \mu^2 \right) } – n \left( \mathbb{E} \left[ \overline{x}^2 \right] – \mu^2 \right) \right) \\
& = \frac{1}{n-1} \left( \sum_{i=0}^{n} { \left( \mathbb{E} \left[ x_i^2 \right] – \left( \mathbb{E} [x_i] \right)^2 \right) } – n \left( \mathbb{E} \left[ \overline{x}^2 \right] – \left( \mathbb{E} [\overline{x}] \right)^2 \right) \right) \\
& = \frac{1}{n-1} \left( \sum_{i=0}^{n} { \left( Var(x_i) \right) } – n Var(\overline{X}) \right) = \frac{1}{n-1} \left( \sum_{i=0}^{n} { \left( \sigma^2 \right) } – n \frac{\sigma^2}{n} \right) \\
& = \frac{1}{n-1} \left( n \sigma^2 – \sigma^2 \right) = \sigma^2 \\
\end{align}
that does not allow us to say that the square root of the above estimator gives us an unbiased estimator for the standard deviation \( \sigma \). In other words:
\( \mathbb{E}\left[s^2\right] = \mathbb{E}\left[ \frac{1}{n-1} \sum_{i=0}^{n} { \left[ (x_i – \overline{x})^2 \right] } \right] = \sigma^2 \)
but
\( \mathbb{E} [s] = \mathbb{E} \left[ \sqrt{ \frac{1}{n-1} \sum_{i=0}^{n} { \left[ (x_i – \overline{x})^2 \right] } } \right] \neq \sigma \)
because the expectation function and the square root function do not commute:
\( \sigma = \sqrt{\sigma^2} = \sqrt{ \mathbb{E}[s^2] } \neq \mathbb{E}[\sqrt{s^2}] = \mathbb{E}[s] \)
B.a The sample standard deviation \( s = \sqrt{s^2} = \sqrt{ \frac{1}{n-1} \sum_{i=0}^{n} { \left[ (x_i – \overline{x})^2 \right] } } \) is a biased estimator of \( \sigma \)
In fact, we can infer the bias of \( \mathbb{E} [s] \) to some extent. The square root function \( f(x) = \sqrt{x} \) is a concave function. A concave function \( f \) is:
$$ \forall x_1, x_2 \in X, \forall t \in [0, 1]: \quad f(tx_1 + (1 – t) x_2 ) \geq tf(x_1) + (1 – t) f(x_2) $$
The left-hand side of the inequality is the blue portion of the curve \( \{ f( \textrm{mixture of } x_1 \textrm{ and } x_2 ) \} \) and the right-hand side of the inequality is the red line segment \( \{ \textrm{a mixture of } f(x_1) \textrm{ and } f(x_2) \} \). We can see visually what it means for a function to be concave, where between to arbitrary \(x\)-values \(x_1\) and \(x_2\), the blue portion is always \(\geq\) the red portion between two \(x\)-values, .
Jensen’s Inequality says that if \( g(x) \) is a convex function, then:
$$ g( \mathbb{E}[X] ) \leq \mathbb{E}\left[ g(X) \right] $$
and if \( f(x) \) is a concave function, then:
$$ f( \mathbb{E}[X] ) \geq \mathbb{E}\left[ f(X) \right] $$
The figure above showing the concave function \(f(x) = \sqrt{x}\) gives an intuitive illustration of Jensen’s Inequality as well (since Jensen’s Inequality can be said to be a generalization of the “mixture” of \(x_1\) and \(x_2\) property of convex and concave functions to the expectation operator). The left-hand side \( f(\mathbb{E}[X]) \) is like \( f( \textrm{a mixture of } X \textrm{ values} ) \) and the right-hand side \( \mathbb{E}\left[ f(X) \right] \) is like \( {\textrm{a mixture of } f(X) \textrm{ values} } \) where the “mixture” in both cases is the “long-term mixture” of \( X \) values that is determined by the probability distribution of \( X \).
Since \( f(z) = \sqrt{z} \) is a concave function, going back to our estimation of the standard deviation of \( X \) using \(\sqrt{s^2}\), we have
\begin{align}
f( \mathbb{E}[Z] ) & \geq \mathbb{E}\left[ f(Z) \right] \longrightarrow \\
\sqrt{\mathbb{E}[Z]} & \geq \mathbb{E}\left[ \sqrt{Z} \right] \longrightarrow \\
\sqrt{ \mathbb{E}[s^2] } & \geq \mathbb{E}\left[ \sqrt{s^2} \right] \longrightarrow \\
\sqrt{ Var(X) } & \geq \mathbb{E}\left[s\right] \\
\textrm{StDev} (X) = \sigma(X) & \geq \mathbb{E}\left[s\right] \\
\end{align}
Thus, \( \mathbb{E} [s] = \mathbb{E} \left[ \sqrt{ \frac{1}{n-1} \sum_{i=0}^{n} { \left[ (x_i – \overline{x})^2 \right] } } \right] \leq \sigma \). So \( \mathbb{E} [s] \) is biased and underestimates the true \(\sigma\).
However, the exact bias \( \textrm{Bias}(s) = \mathbb{E} [s] – \sigma \) is not as clean to show.
https://en.wikipedia.org/wiki/Unbiased_estimation_of_standard_deviation
\( \frac{(n-1)s^2}{\sigma^2} = \frac{1}{\sigma^2} \sum_{i=0}^{n} { \left[ (x_i – \overline{x})^2 \right] } \sim \) a \( \chi^2 \) distribution with \( n-1 \) degrees of freedom. In addition, \( \sqrt{ \frac{(n-1)s^2}{\sigma^2} } = \frac{\sqrt{n-1}s}{\sigma} = \frac{1}{\sigma} \sqrt{ \sum_{i=0}^{n} { \left[ (x_i – \overline{x})^2 \right] } } \sim \) a \( \chi \) distribution with \( n-1 \) degrees of freedom. A \( \chi \) distribution with \(k\) degrees of freedom has mean \( \mathbb{E} \left[ \frac{\sqrt{n-1}s}{\sigma} \right] = \mu_{\chi} = \sqrt{2} \frac{\Gamma ( ^{(k+1)} / _2 ) } { \Gamma ( ^k / _2 )} \) where \( \Gamma(z) \) is the Gamma function.
https://en.wikipedia.org/wiki/Gamma_function
If \(n\) is a positive integer, then \( \Gamma(n) = (n – 1)! \). If \(z\) is a complex number that is not a non-positive integer, then \( \Gamma(z) = \int_{0}^{\infty}{x^{z-1} e^{-x} dx} \). For non-positive integers, \( \Gamma(z) \) goes to \(\infty\) or \(-\infty\).
From the mean of a \( \chi \) distribution above, we have:
\( \mathbb{E}[s] = {1 \over \sqrt{n – 1} } \cdot \mu_{\chi} \cdot \sigma \)
and replacing \(k\) with \(n-1\) degrees of freedom for the value of \(\mu_{\chi}\), we have:
\( \mathbb{E}[s] = \sqrt{ {2 \over n – 1} } \cdot { \Gamma(^n/_2) \over \Gamma(^{n-1}/_2) } \cdot \sigma \)
Wikipedia tells us that:
\( \sqrt{ {2 \over n – 1} } \cdot { \Gamma(^n/_2) \over \Gamma(^{n-1}/_2) } = c_4(n) = 1 – {1 \over 4n} – {7 \over 32n^2} – {19 \over 128n^3} – O(n^{-4}) \)
So we have:
\( \textrm{Bias} (s) = \mathbb{E}[s] – \sigma = c_4(n) \cdot \sigma – \sigma = ( c_4(n) – 1) \cdot \sigma \)
\( = \left( \left( 1 – {1 \over 4n} – {7 \over 32n^2} – {19 \over 128n^3} – O(n^{-4}) \right) – 1 \right) \cdot \sigma = – \left( {1 \over 4n} + {7 \over 32n^2} + {19 \over 128n^3} + O(n^{-4}) \right) \cdot \sigma \)
Thus, as \(n\) becomes large, the magnitude of the bias becomes small.
From Wikipedia, these are the values of \(n\), \(c_4(n)\), and the numerical value of \( c_4(n) \):
\begin{array}{|l|r|c|}
\hline
n & c_4(n) & \textrm{Numerical value of } c_4(n) \\
\hline
2 & \sqrt{2 \over \pi} & 0.798… \\
3 & {\sqrt{\pi} \over 2} & 0.886… \\
5 & {3 \over 4}\sqrt{\pi \over 2} & 0.940… \\
10 & {108 \over 125}\sqrt{2 \over \pi} & 0.973… \\
100 & – & 0.997… \\
\hline
\end{array}
Thus, for the most part, we don’t have to worry too much about this bias, especially with large \(n\). So we have
\( \mathbb{E}[\hat{\sigma}] \approx \mathbb{E}[s] = \mathbb{E}[\sqrt{s^2}] = \mathbb{E} \left[ \sqrt{ \frac{1}{n-1} \sum_{i=0}^{n} { \left[ (x_i – \overline{x})^2 \right] } } \right] \)
More rigorously, our estimator \( \hat{\sigma} = s = \sqrt{ \frac{1}{n-1} \sum_{i=0}^{n} { \left[ (x_i – \overline{x})^2 \right] } } \) is a consistent estimator of \( \sigma \) (even though it is a biased estimator of \( \sigma \)).
An estimator is consistent if \( \forall \epsilon > 0 \):
$$ \lim\limits_{n \to \infty} \textrm{Pr } (|\hat{\theta} – \theta| > \epsilon ) = 0 $$
In other words, as \( n \to \infty \), the probability that our estimator \( \hat{\theta} \) “misses” the true value of the parameter \(\theta\) by greater than some arbitrary positive amount (no matter how small) goes to \(0\).
For the sample standard deviation \(s\) as our estimator of the true standard deviation \(\sigma\) (i.e. let \(\hat{\sigma} = s\)),
\( \lim_{n \to \infty} (|\hat{\sigma} – \sigma|) = \lim_{n \to \infty} ( | c_4(n) \sigma – \sigma |) = (| \sigma – \sigma |) = 0 \)
so
\( \lim_{n \to \infty} \textrm{Pr } (|\hat{\sigma} – \sigma| > \epsilon) = \textrm{Pr } ( 0 > \epsilon ) = 0 \)
Since \(s\) is a consistent estimator of \(\sigma\), we are fine to use \(s\) to estimate \(\sigma\) as long as we have large \(n\).
So back to the matter at hand: we want to know the sampling distribution of \(\overline{X} \) to see “what we can say” about \(\overline{X}\), specifically, the standard deviation of \(\overline{X}\), i.e. the standard error of the mean of \(X\). Not knowing the true standard deviation \(\sigma\) of \(X\), we use a consistent estimator of \(\sigma\) to estimate it: \(s = \sqrt{{1 \over n-1} \sum_{i=1}^n {(x_i – \overline{x})^2}}\).
So instead of the case where we know the value of \(\sigma\)
\(\overline{X} \sim N(\mu, \sigma^2/n)\)
we have, instead something like:
\(\overline{X} \quad “\sim” \quad N(\mu, s^2/n)\)
When we know the value of \(\sigma\), we have
\({ \overline{X} – \mu \over \sigma/\sqrt{n} } \sim N(0,1) \)
When we don’t know the value of \(\sigma\) and use the estimate \(s\) instead of having something like
\({ \overline{X} – \mu \over s/\sqrt{n} } \quad “\sim” \quad N(0,1) \)
we actually have the exact distribution:
\({ \overline{X} – \mu \over s/\sqrt{n} } \sim T_{n-1} \)
the student’s t-distribution with \(n-1\) degrees of freedom.
Thus, finally, when we don’t know the true standard deviation \(\sigma\), under the null hypothesis \( H_0: \mu \leq \mu_0 \), we can use the expression above to create a test statistic
\( t = { \overline{x} – \mu_0 \over s/\sqrt{n} } ~ T_{n-1} \)
and check it against the student’s t-distribution with \(n-1\) degrees of freedom \(T_{n-1}\) with some critical value with some significance level, say \(\alpha = 0.05\).
So if the test statistic exceeds our critical value \(\alpha 0.05\):
\( t = { \overline{x} – \mu_0 \over s/\sqrt{n} } > T_{n-1, \alpha} \)
then we reject our null hypothesis \( H_0: \mu \leq \mu_0 \) at \(\alpha = 0.05\) significance level. If not, then we fail to reject our null hypothesis.
asdf
we know the standard deviation of a data point
If under the null hypothesis \( H_0 \) we have a probability distribution, the sample data gives us a sample standard deviation, i.e. the standard error.
Back to our case with 2 coins. Let’s say we want to test if our coin is the \(p\) coin and let’s say we arbitrarily decide to call the smaller probability \(p\), i.e. \(p \leq q\). We know that coin flips give us a binomial distribution, and we know the standard error of the mean proportion of heads from \(n\) flips. So a 0.05 significance level would mean some cutoff value \(c\) where \(c > p\). But note that if \(c\) ends up really big relative to \(q\), e.g. it gets close to \(q\) or even exceeds \(q\), we are in a weird situation.
we can decide on some cutoff value \(c\) between \(p\) and \(q\). If we change around \(c\), what happens is that the significance level and the power of the test, whether testing \(p\) or \(q\), changes.