# Category: Math

## Test Coin2

https://math.stackexchange.com/questions/2033370/how-to-determine-the-number-of-coin-tosses-to-identify-one-biased-coin-from-anot/2033739#2033739

Suppose there are two coins and the percentage that each coin flips a Head is $$p$$ and $$q$$, respectively. $$p, q \in [0,1]$$, $$p \neq q$$, and the values are given and known. If you are free to flip one of the coins any number of times, how many times $$n$$ do you have to flip the coin to decide with some significance level $$\left( \textrm{say } \alpha = 0.05 \right)$$ that it’s the $$p$$ coin or the $$q$$ coin that you’ve been flipping?

The distribution of heads after $$n$$ flips for a coin will be a binomial distribution with means at $$pn$$ and $$qn$$.

Two binomial distributions, n = 20. The means are pn = 10 and qn = 14.

Setting Up Our Hypothesis Test

Let’s say we want to test if our coin is the $$p$$ coin and let’s say we arbitrarily decide to call the smaller probability $$p$$, i.e. $$p < q$$. We know that coin flips give us a binomial distribution, and we know the standard deviation of a binomial random variable $$X_p$$ (let $$X_p$$ or $$X_{p,n}$$ be a binomial random variable for the number of flips that are heads, where the probability of a head on a flip is $$p$$ and we do $$n$$ number of flips), which is:

$$\textrm{Standard Deviation of }{X_p} = \sqrt{ Var\left( {X_p} \right) } = \sqrt{ np(1-p) }$$

—–

Digression: we can also split our $$n$$ Bernoulli trial coin flips that make up our binomial random variable $$X_p$$ into $$m$$ number of binomial random variables $$X_{p,k}$$ each with $$k$$ trials, such that $$k \times m = n$$. Then the standard error of the mean proportion of heads from $$m$$ binomial random variables (each with $$k$$ trials) is:

$$\textrm{Standard error of the mean} = \sqrt{ Var\left( \overline{X_{p,k}} \right) } = \sqrt{ Var \left( {1 \over m} \sum_{i=1}^{m} {X_{p,k}} \right) }$$
$$= \sqrt{ Var(\sum_{i=1}^{m} X_{p,k}) \over m^2 } = \sqrt{ m \cdot Var(X_{p,k}) \over m^2 }= \sqrt{ {m \cdot kp(1-p) \over m^2 } } = \sqrt{ { kp(1-p) \over m} }$$

This standard error above is for the random variable $$X_{p,k}$$, each of which has $$k$$ Bernoulli trials. In other words, the standard deviation of $${1 \over m} \sum_{i=1}^{m} X_{p,k}$$ is $$\sqrt{ kp(1-p) \over m }$$. But if you simply change $$k$$ to $$km = n$$ and reduce $$m$$ to $$1$$, you get the same result as if you took all $$km = n$$ trials as the number of trials for one binomial random variable, our original $$X_p$$: where we now say that the standard deviation of $${1 \over 1} \sum_{i=1}^{1} X_{p,n} = X_{p,n} = X_p$$ is $$\sqrt{ np(1-p) \over 1 } = \sqrt{ np(1-p) }$$.

By going from $$m$$ repetitions of $$X_{p,k}$$ to $$1$$ repetition of $$X_{p,n}$$, both the mean and the standard deviation is multiplied by $$m$$. The mean of $$X_{p,k}$$ is $$kp$$ and the mean of $$X_{p,n}$$ is $$mkp = np$$; the standard deviation of $$X_{p,k}$$ is $$\sqrt{ kp(1-p) }$$ and the standard deviation of $$X_{p,n}$$ is $$\sqrt{ mkp(1-p) } =\sqrt{ np(1-p) }$$. The standard error of the mean of $$m$$ repetitions of $$X_{p,k}$$ is $$\sqrt{ { kp(1-p) \over m} }$$ while the mean of $$m$$ repetitions of $$X_{p,k}$$ is of course just $${1 \over m} \sum_{i=1}^{m} \mathbb{E} \left[ X_{p,k} \right] = {1 \over m} m (kp) = kp$$. So when going from $$1$$ repetition of $$X_{p,k}$$ to $$m$$ repetitions of $$X_{p,k}$$, the mean goes from $$kp$$ to $$mkp = np$$ and the standard error of the mean of $$X_{p,k}$$ goes from $$\sqrt{ { kp(1-p) \over m} }$$ to the standard deviation of $$X_{p,n}$$ by multiplying the standard error of the mean of $$X_{p,k}$$ by $$m$$: $$m \cdot \sqrt{ { kp(1-p) \over m} } = \sqrt{ { m^2 \cdot kp(1-p) \over m} } = \sqrt{ { mkp(1-p)} } = \sqrt{ { np(1-p)} }$$.

—–

Knowing the standard deviation of our random variable $$X_p$$, a 0.05 significance level for a result that “rejects” the null would mean some cutoff value $$c$$ where $$c > pn$$. If $$x_p$$ (the sample number of heads from $$n$$ coin tosses) is “too far away” from $$pn$$, i.e. we have $$x_p > c$$, then we reject the null hypothesis that we have been flipping the $$p$$ coin.

But note that if we choose a $$c$$ that far exceeds $$qn$$ as well, we are in a weird situation. If $$x_p > c$$, then $$x_p$$ is “too large” for $$pn$$ but also is quite larger than $$qn$$ (i.e. $$x_p > qn > pn$$ ). This puts us in an awkward situation because while $$x_p$$ is much larger than $$pn$$, making us want to reject the hypothesis that we have were flipping the $$p$$ coin, it is also quite larger than $$qn$$, so perhaps we obtained a result that was pretty extreme “no matter which coin we had.” If we assume the null hypothesis that we have the $$p$$ coin, our result $$x_p$$ is very unlikely, but it is also quite unlikely even if we had the $$q$$ coin, our alternative hypothesis. But still, it is more unlikely that it is the $$p$$ coin than it is the $$q$$ coin, so perhaps it’s not that awkward. But what if $$x_p$$ does not exceed $$c$$? Then we can’t reject the null hypothesis that we have the $$p$$ coin. But our sample result of $$x_p$$ might in fact be closer to $$qn$$ than $$pn$$ – $$x_p$$ might even be right on the dot of $$qn$$ – and yet we aren’t allowing ourselves to use that to form a better conclusion, which is a truly awkward situation.

If $$c$$ is, instead, somewhere in between $$pn$$ and $$qn$$, and $$x_p > c$$, we may reject the null hypothesis that our coin is the $$p$$ coin, while $$x_p$$ is in a region close to $$q$$, i.e. a region that is a more likely result if we actually had been flipping the $$q$$ coin, bringing us closer to the conclusion that this is the $$q$$. However, if we reverse the experiment – if we use the same critical value $$c$$ and say that if $$x_p < c$$ then we reject our null hypothesis that $$q$$ is our coin, then the power and significance of the test for each coin is different, which is also awkward.

Above, the pink region is the probability that $$X_p$$ ends in the critical region, where $$x_p > c$$, assuming the null hypothesis that we have the $$p$$ coin. This is also the Type I Error rate (a.k.a. false positive) – the probability that we end up falsely rejecting the null hypothesis, assuming that the null hypothesis is true.

Above, the green region is the power $$1-\beta$$, the probability that we get a result in the critical region $$x_p > c$$ assuming that the alternative hypothesis is true, that we have the $$q$$ coin. The blue-gray region is $$\beta$$, or the Type II Error rate (a.k.a. false negative) – the probability that we fail to reject the null hypothesis (that we have the $$p$$ coin) when what’s actually true is the alternative hypothesis (that we have the $$q$$ coin).

Now let us “reverse” the experiment with the same critical value – we want to test our null hypothesis that we have the $$q$$ coin:

We have $$x_p < c$$. We fail to reject the null hypothesis that we have the $$p$$ coin, and on the flip side we would reject the null hypothesis that we have the $$q$$ coin. but we have failed a tougher test (the first one, with a small $$\alpha_p$$) and succeeded in rejecting an easier test (the second one, with a larger $$\alpha_q$$). In hypothesis testing, we would like to be conservative, so it is awkward to have failed a tougher test but "be ok with it" since we succeeded with an easier test. Common sense also, obviously, says that something is strange when $$x_p$$ is closer to $$q$$ than $$p$$ and yet we make the conclusion that since $$x_p$$ is on the "$$p$$-side of $$c$$," we have the $$p$$ coin.   In reality, we wouldn't take one result and apply two hypothesis tests on that one result. But we would like the one test procedure to make sense with whichever null hypothesis we start with, $$p$$ coin or $$q$$ coin (since it is arbitrary which null hypothesis we choose in the beginning, for we have no knowledge of which coin we have before we start the experiment).

What we can do, then, is to select $$c$$ so that the probability, under the hypothesis that we have the $$p$$ coin, that $$X_p > c$$ is equal to the probability that, under the hypothesis that we have the $$q$$ coin, that $$X_q < c$$. In our set up, we have two binomial distributions, which are discrete, as opposed to the normal distributions above. In addition, binomial distributions, unless the mean is at $$n/2$$, are generally not symmetric, as can be seen in the very first figure, copied below as well, where the blue distribution is symmetric but the green one is not.

We can pretend that the blue distribution is the binomial distribution for the $$p$$ coin and the green distribution for the $$q$$ coin. The pmf of a binomial random variable, say $$X_p$$ (that generates Heads or Tails with probability of Heads $$p$$) is:

$${n \choose h} p^h (1-p)^{n-h}$$

where $$n$$ is the total number of flips and $$h$$ is the number of Heads among those flips. We let $$c$$ be the critical number of Heads that would cause us to reject the null hypothesis that the coin we have is the $$p$$ coin in favor of the alternative hypothesis that we have the $$q$$ coin. The area of the critical region, i.e. the probability that we get $$c_H$$ heads or more assuming the hypothesis that we have the $$p$$ coin, is:

$$Pr(X_p > c) = \sum_{i=c}^{n} \left[ {n \choose i} p^i (1-p)^{n-i} \right]$$

And the reverse, the probability that we get $$c_H$$ heads or less assuming the hypothesis that we have the $$q$$ coin, is:

$$Pr(X_q < c) = \sum_{i=0}^{c} \left[ {n \choose i} q^i (1-q)^{n-i} \right]$$ So we want to set these two equal to each other and solve for $$c$$: $$\sum_{i=c}^{n} \left[ {n \choose i} p^i (1-p)^{n-i} \right] = \sum_{i=0}^{c} \left[ {n \choose i} q^i (1-q)^{n-i} \right]$$ But since the binomial distribution is discrete, there may not be a $$c$$ that actually works. For large $$n$$, a normal distribution can approximate the binomial distribution. In that case, we can draw the figure below, which is two normal distributions, each centered on $$pn$$ and $$qn$$ (the means of the true binomial distributions), and since normal distributions are symmetric, the point at which the distributions cross will be our critical value. The critical regions for $$X_p$$ (to the right of $$c$$) and for $$X_q$$ (to the left of $$c$$) will have the same area.

If we pretend that these normal distributions are binomial distributions, i.e. if we pretend that our binomial distributions are symmetric (i.e. we pretend that $$n$$ is going to be large enough that both our binomial distributions of $$X_p$$ and $$X_q$$ are symmetric enough), then to find $$c$$ we can find the value on the horizontal axis at which, i.e. the number of Heads at which, the two binomial probability distributions are equal to each other.

$${n \choose c} p^c (1-p)^{n-c} = {n \choose c} q^c (1-q)^{n-c}$$
$$p^c (1-p)^{n-c} = q^c (1-q)^{n-c}$$
$$\left({p \over q}\right)^c \left({1-p \over 1-q}\right)^{n-c} = 1$$
$$\left({p \over q}\right)^c \left({1-p \over 1-q}\right)^{n} \left({1-q \over 1-p}\right)^c = 1$$
$$\left({p(1-q) \over q(1-p)}\right)^c = \left({1-q \over 1-p}\right)^{n}$$
$$c \cdot log \left({p(1-q) \over q(1-p)}\right) = n \cdot log \left({1-q \over 1-p}\right)$$

$$c = n \cdot log \left({1-q \over 1-p}\right) / log \left({p(1-q) \over q(1-p)}\right)$$

The mean of a binomial distribution $$X_p$$ has mean $$pn$$ with standard deviation $$\sqrt{np(1-p)}$$. With a normal distribution $$X_{\textrm{norm}}$$ with mean $$\mu_{\textrm{norm}}$$ and standard deviation $$\sigma_{\textrm{norm}}$$, the value $$c_{\alpha} = X_{\textrm{norm}} = \mu_{\textrm{norm}} = 1.645\sigma_{\textrm{norm}}$$ is the value where the area from that value $$c_{\alpha}$$ to infinity is $$0.05 = \alpha$$. Thus, $$c_{\alpha}$$ is the critical value for a normal random variable where the probability that $$X_{\textrm{norm}} > c_{\alpha} = 0.05)$$. So for a binomial random variable $$X_p$$, we would have $$c_{\textrm{binomial, }\alpha} = pn + 1.645\sqrt{np(1-p)}$$.

Thus, we have that this critical value for a binomial random variable $$X_p$$:

$$c = n \cdot log \left({1-q \over 1-p}\right) / log \left({p(1-q) \over q(1-p)}\right)$$

must also be

$$c_{\textrm{binomial, }\alpha} \geq pn + 1.645\sqrt{np(1-p)}$$

for the area to the right of $$c$$ to be $$\leq 0.05$$. To actually find the critical value $$c_{\textrm{binomial, }\alpha}$$, we can just use

$$c_{\textrm{binomial, }\alpha} \geq pn + 1.645\sqrt{np(1-p)}$$

Since we are given the values of $$p$$ and $$q$$, we would plug in those values to find the required $$n$$ needed to reach this condition for the critical value. So we have

$$n \cdot log \left({1-q \over 1-p}\right) / log \left({p(1-q) \over q(1-p)}\right) = pn + 1.645\sqrt{np(1-p)}$$

$$\sqrt{n} = 1.645\sqrt{p(1-p)} / \left[ log \left({1-q \over 1-p}\right) / log \left({p(1-q) \over q(1-p)}\right) – p \right]$$

$$n = 1.645^2p(1-p) / \left[ log \left({1-q \over 1-p}\right) / log \left({p(1-q) \over q(1-p)}\right) – p\right]^2$$

For example, if $$p = 0.3$$ and $$q = 0.7$$, we have $$n = 14.2066$$, or rather, $$n \geq 14.2066$$.

Wolfram Alpha calculation of above, enter the following into Wolfram Alpha:

1.645^2 * p * (1-p) / (ln((1-q)/(1-p))/ln(p*(1-q)/(q*(1-p))) – p )^2; p = 0.3, q = 0.7

Note that if we switch the values so that $$p = 0.7$$ and $$q = 0.3$$, or switch the $$p$$’s and $$q$$’s in the above equation for $$n$$, we obtain the same $$n_{\textrm{min}}$$. This makes sense since our value for $$n_{\textrm{min}}$$ depends on $$c$$ and $$c$$ is the value on the horizontal axis at which the two normal distributions from above (approximations of binomial distributions) with means at $$pn$$ and $$qn$$ cross each other. Thus, we set up the distributions so that that whole problem is symmetric.

So if we generate a sample such that the number of samples is $$n \geq 14.2066$$, we can use our resulting $$x_p$$ and make a hypothesis test regarding if we have the $$p$$ or $$q$$ coin with $$\alpha = 0.05$$ significance level.

If $$p$$ and $$q$$ are closer, say $$p = 0.4$$ and $$q = 0.5$$, then we have $$n \geq 263.345$$. This makes intuitive sense, where the closer the probabilities are of the two coins, the more times we have to flip our coin to be more sure that we have one of the coins rather than the other. To be more precise, the smaller the effect size is, the larger sample size we need in order to get the certainty about a result. An example of the effect size is Cohen’s d where:

$$\textrm{Cohen’s d } = {\mu_2 – \mu_1 \over \textrm{StDev / Pooled StDev}}$$

Wolfram Alpha calculation of above for $$n$$ with $$p = 0.4$$ and $$q = 0.5$$, or enter the following into Wolfram Alpha:

1.645^2 * p * (1-p) / (ln((1-q)/(1-p))/ln(p*(1-q)/(q*(1-p))) – p )^2; p = 0.4, q = 0.5

From here, where the question is asked originally, is an answer that finds the exact values for the two $$n_{\textrm{min}}$$ using R with the actual binomial distributions (not using normal distributions as approximations):

https://math.stackexchange.com/a/2033739/506042

Due to the discrete-ness of the distributions, the $$n_{\textrm{min}}$$’s found are slightly different: $$n_{\textrm{min}} = 17$$ for the first case and $$n_{\textrm{min}} = 268$$ for the second case. I.e., the difference comes from using the normal distribution as an approximation for the binomial distribution.

## Test Coin

https://math.stackexchange.com/questions/2033370/how-to-determine-the-number-of-coin-tosses-to-identify-one-biased-coin-from-anot/2033739#2033739

Suppose there are two coins and the percentage that each coin flips a Head is $$p$$ and $$q$$, respectively. $$p, q \in [0,1]$$ and the values are given and known. If you are free to flip one of the coins, how many times $$n$$ do you have to flip the coin to decide with some significance level $$\left( \textrm{say } \alpha = 0.05 \right)$$ that it’s the $$p$$ coin or the $$q$$ coin that you’ve been flipping?

The distribution of heads after $$n$$ flips for a coin will be a binomial distribution with means at $$pn$$ and $$qn$$.

Two binomial distributions, n = 20. The means are pn = 10 and qn = 14.

The Usual Hypothesis Test

In the usual hypothesis test, for example with data $$x_i, i=1, 2, 3, …, n$$ from a random variable $$X$$, to find the if the mean $$\mu$$ is $$\leq$$ some constant $$\mu_0$$:

\begin{align}
H_0 & : \mu \leq \mu_0 ( \textrm{ and } X \sim N(\mu_0, \textrm{ some } \sigma^2 ) )
H_1 & : \mu > \mu_0
\end{align}

If the sample mean of the data points $$\overline{x}$$ is “too large compared to” $$\mu_0$$, then we reject the null hypothesis $$H_0$$.

If we have the probability distribution of the random variable (even if we don’t know the true value of the mean $$\mu$$), we may be able to know something about the probability distribution of a statistic obtained from manipulating the sample data, e.g. the sample mean.  This, the probability distribution of a statistic (obtained from manipulating sample data), is called the sampling distribution.  And a property of the sampling distribution, the standard deviation of a statistic, is the standard error.  For example, the standard error of the mean is:

Sample Data: $$x$$ $$\qquad$$ Sample Mean: $$\overline{x}$$

Variance: $$Var(x)$$ $$\qquad$$ Standard Deviation: $$StDev(x) = \sigma(x) = \sqrt{Var(x)}$$

Variance of the Sample Mean: $$Var( \overline{x} ) = Var \left( \frac{1}{n} \sum_{i=0}^{n}{ x_i } \right) = \frac{1}{n^2} \sum_{i=0}^{n} { Var(x_i) } = \frac{1}{n^2} n Var(x) = \frac{1}{n} Var(x) = {\sigma^2 \over n}$$

Standard Deviation of the Sample Mean, Standard Error of the Mean: $$\frac{1}{\sqrt{n}} StDev(x) = {\sigma \over \sqrt{n}}$$

Thus, if the random variable is $$i.i.d.$$ (independent and identically distributed), then with the sample mean $$\overline{x}$$ we obtain from the data, we can assume this $$\overline{x}$$ has a standard deviation of $$\frac{\sigma}{\sqrt{n}}$$.  This standard deviation, being smaller than the standard deviation of the original $$X$$, i.e. $$\sigma$$, means that $$\overline{X}$$ is narrower around the mean than $$X$$. This means $$\overline{X}$$ gives us a better ability to hone in on what the data says about $$\mu$$ than $$X$$’s ability to hone, i.e. a narrower, more precise, “range of certainty,” from the sample data, with the same significance level.

Thus, given our sample $$x_i, i = 1, \dots, n$$, we can calculate the statistic $$\overline{x} = \frac{1}{n} \sum_{i=1}^{n} {x_i}$$, our sample mean.  From the data (or given information), we would like to calculate the standard error of the mean, the standard deviation of this sample mean as a random variable (where the sample mean is a statistic, i.e. can be treated as a random variable): $$\frac{1}{\sqrt{n}} StDev(x) = {\sigma \over \sqrt{n}}$$. This standard error of the mean gives us a “range of certainty” around the $$\overline{x}$$ with which to make an inference.

A. If we know/are given the true standard deviation $$\sigma$$

If we are given the true standard deviation $$\sigma$$ of the random variable $$X$$, then we can calculate the standard error of the sample mean: $$\frac{\sigma}{\sqrt{n}}$$.  So under the null hypothesis $$H_0: \mu \leq \mu_0$$, we want to check if the null hypothesis can hold against a test using the sample data.

A.a Digression about $$H_0: \mu \leq \mu_0$$ and $$H_0: \mu = \mu_0$$

If the $$\mu$$ we infer from the sample data is “too extreme,” in this case “too large” compared to $$\mu_0$$, i.e. the test statistic is > some critical value that depends on $$\mu_0$$, i.e. $$c(\mu_0)$$, we reject the null hypothesis. If we check a $$\mu_1$$ that is $$\mu_1 < \mu_0$$ (since our null hypothesis is $$H_0: \mu < \mu_0$$), our critical value $$c(\mu_1)$$ will be less extreme than $$c(\mu_0)$$ (in other words $$c(\mu_1) < c(\mu_0)$$), and thus it would be "easier to reject" the null hypothesis if using $$c(\mu_1)$$. Rejecting a hypothesis test ought to be conservative since rejecting a null hypothesis is reaching a conclusion, so we would like the test to be "the hardest to reject" that we can (a conclusion, i.e. a rejection here, should be as conservative as possible). The "hardest to reject" part of the range of $$H_0: \mu \leq \mu_0$$ would be $$\mu = \mu_0$$ where the critical value $$c(\mu_0)$$ would be the largest possible critical value. Testing a $$\mu_1 < \mu_0$$ would mean that we may obtain a test statistic that rejects is too extreme/large) for $$\mu_1$$ (i.e. $$t > c(\mu_1)$$ ) but not too extreme/large for $$\mu_0$$ (i.e. $$t \not> c(\mu_0)$$ ). But if we test using $$\mu_0$$, if the test statistic is extreme/large enough that we reject the null hypothesis of $$\mu = \mu_0$$, that would also reject all other null hypotheses using $$\mu_1$$ where $$\mu_1 < \mu_0$$.

So under the null hypothesis $$H_0: \mu \leq \mu_0$$ or the “effective” null hypothesis $$H_0: \mu = \mu_0$$, we have that $$X \sim N(\mu_0, \sigma^2)$$ with $$\sigma$$ known, and we have that $$\overline{X} \sim N(\mu_0, \sigma^2/n)$$.  This means that

$$\frac{\overline{X} – \mu_0} { ^{\sigma}/_{\sqrt{n}} } \sim N(0, 1)$$

Then we can use a standard normal table to find where on the standard normal is the $$\alpha = 0.05$$ cutoff – for a one-tailed test, the cutoff is at $$Z_{\alpha} = 1.645$$ where $$Z \sim N(0, 1)$$.  So if

$$\frac{\overline{X} – \mu_0} { ^{\sigma}/_{\sqrt{n}} } > 1.645 = Z_{\alpha}$$,

then this result is “too large compared to $$\mu_0$$” so we reject the null hypothesis $$H_0: \mu \leq \mu_0$$.  If $$\frac{\overline{X} – \mu_0} { ^{\sigma}/_{\sqrt{n}} } \leq 1.645 = Z_{\alpha}$$, then we fail to reject the null hypothesis $$H_0: \mu \leq \mu_0$$.

B. If we don’t know the standard deviation $$\sigma$$

If we don’t know the value of the standard deviation $$\sigma$$ of our random variable $$X \sim N( \mu, \sigma^2 )$$ (which would be somewhat expected if we already don’t know the value of the mean $$\mu$$ of $$X$$), then we need to estimate $$\sigma$$ from our data $$x_i, i = 1, 2, \dots, n$$.  We can estimate $$\sigma$$ by taking the sample standard deviation of $$x_i, i = 1, \dots, n$$ by doing $$s = \sqrt{ \frac{1}{n-1} \sum_{i=0}^{n} { \left[ (x_i – \overline{x})^2 \right] } }$$, or rather the sample variance $$s^2 = { \frac{1}{n-1} \sum_{i=0}^{n} { \left[ (x_i – \overline{x})^2 \right] } }$$ and then taking the square root of that.

However, note that while the estimator for the sample variance is unbiased:

\begin{align}
\mathbb{E}\left[s^2\right] & = \mathbb{E}\left[ \frac{1}{n-1} \sum_{i=0}^{n} { \left[ (x_i – \overline{x})^2 \right] } \right] \\
& = \frac{1}{n-1} \mathbb{E} \left[ \sum_{i=0}^{n} { \left[ (x_i – \overline{x})^2 \right] } \right] = \frac{1}{n-1} \mathbb{E} \left[ \sum_{i=0}^{n} { \left[ (x_i -\mu + \mu – \overline{x})^2 \right] } \right] \\
& = \frac{1}{n-1} \mathbb{E} \left[ \sum_{i=0}^{n} { \left[ \left( (x_i -\mu) – (\overline{x} – \mu) \right)^2 \right] } \right] \\
& = \frac{1}{n-1} \mathbb{E} \left[ \sum_{i=0}^{n} { \left[  (x_i – \mu)^2 – 2 (x_i – \mu) (\overline{x} – \mu) + (\overline{x} – \mu)^2  \right] } \right] \\
& = \frac{1}{n-1} \mathbb{E} \left[ \sum_{i=0}^{n} { \left[  (x_i – \mu)^2 – 2 (x_i – \mu) (\overline{x} – \mu) + (\overline{x} – \mu)^2  \right] } \right] \\
& = \frac{1}{n-1} \mathbb{E} \left[   \sum_{i=0}^{n} { (x_i – \mu)^2 } – 2 (\overline{x} – \mu) \sum_{i=0}^{n} { (x_i – \mu) } + \sum_{i=0}^{n} { (\overline{x} – \mu)^2 }   \right]  \\
& = \frac{1}{n-1} \mathbb{E} \left[   \sum_{i=0}^{n} { (x_i – \mu)^2 } – 2 (\overline{x} – \mu)   (n \overline{x} – n \mu) + n (\overline{x} – \mu)^2    \right]   \\
& = \frac{1}{n-1} \mathbb{E} \left[   \sum_{i=0}^{n} { (x_i – \mu)^2 } – 2 n (\overline{x} – \mu)^2 + n (\overline{x} – \mu)^2    \right]   \\
& = \frac{1}{n-1} \mathbb{E} \left[   \sum_{i=0}^{n} { (x_i – \mu)^2 } – n (\overline{x} – \mu)^2    \right]   \\
& = \frac{1}{n-1}    \sum_{i=0}^{n} { \mathbb{E} \left[ (x_i – \mu)^2 \right] } – n \mathbb{E} \left[ (\overline{x} – \mu)^2 \right]  = \frac{1}{n-1} \left(    \sum_{i=0}^{n} { \mathbb{E} \left[ (x_i – \mu)^2 \right] } – n \mathbb{E} \left[ (\overline{x} – \mu)^2 \right]  \right)  \\
& = \frac{1}{n-1} \left(    \sum_{i=0}^{n} { \mathbb{E} \left[ x_i^2 – 2 \mu x_i + \mu^2 \right] } – n \mathbb{E} \left[ \overline{x}^2 – 2 \mu \overline{x} + \mu^2 \right]  \right) \\
& = \frac{1}{n-1} \left(    \sum_{i=0}^{n} { \left( \mathbb{E} \left[ x_i^2 \right] – 2 \mu \mathbb{E} [x_i] + \mu^2 \right) } – n \left( \mathbb{E} \left[ \overline{x}^2 \right] – 2 \mu \mathbb{E} [\overline{x}] + \mu^2 \right)  \right) \\
& = \frac{1}{n-1} \left(    \sum_{i=0}^{n} { \left( \mathbb{E} \left[ x_i^2 \right] – 2 \mu^2 + \mu^2 \right) } – n \left( \mathbb{E} \left[ \overline{x}^2 \right] – 2 \mu^2 + \mu^2 \right)  \right) \\
& = \frac{1}{n-1} \left(    \sum_{i=0}^{n} { \left( \mathbb{E} \left[ x_i^2 \right] – \mu^2 \right) } – n \left( \mathbb{E} \left[ \overline{x}^2 \right] –  \mu^2 \right)  \right) \\
& = \frac{1}{n-1} \left(    \sum_{i=0}^{n} { \left( \mathbb{E} \left[ x_i^2 \right] – \left( \mathbb{E} [x_i] \right)^2 \right) } – n \left( \mathbb{E} \left[ \overline{x}^2 \right] –  \left( \mathbb{E} [\overline{x}] \right)^2 \right)  \right) \\
& = \frac{1}{n-1} \left(    \sum_{i=0}^{n} { \left(  Var(x_i) \right) } – n Var(\overline{X})  \right) = \frac{1}{n-1} \left(    \sum_{i=0}^{n} { \left( \sigma^2 \right) } – n \frac{\sigma^2}{n} \right) \\
&  = \frac{1}{n-1} \left(    n \sigma^2 – \sigma^2 \right)  = \sigma^2 \\
\end{align}

that does not allow us to say that the square root of the above estimator gives us an unbiased estimator for the standard deviation $$\sigma$$. In other words:

$$\mathbb{E}\left[s^2\right] = \mathbb{E}\left[ \frac{1}{n-1} \sum_{i=0}^{n} { \left[ (x_i – \overline{x})^2 \right] } \right] = \sigma^2$$

but

$$\mathbb{E} [s] = \mathbb{E} \left[ \sqrt{ \frac{1}{n-1} \sum_{i=0}^{n} { \left[ (x_i – \overline{x})^2 \right] } } \right] \neq \sigma$$

because the expectation function and the square root function do not commute:

$$\sigma = \sqrt{\sigma^2} = \sqrt{ \mathbb{E}[s^2] } \neq \mathbb{E}[\sqrt{s^2}] = \mathbb{E}[s]$$

B.a The sample standard deviation $$s = \sqrt{s^2} = \sqrt{ \frac{1}{n-1} \sum_{i=0}^{n} { \left[ (x_i – \overline{x})^2 \right] } }$$ is a biased estimator of $$\sigma$$

In fact, we can infer the bias of $$\mathbb{E} [s]$$ to some extent. The square root function $$f(x) = \sqrt{x}$$ is a concave function. A concave function $$f$$ is:

$$\forall x_1, x_2 \in X, \forall t \in [0, 1]: \quad f(tx_1 + (1 – t) x_2 ) \geq tf(x_1) + (1 – t) f(x_2)$$

The left-hand side of the inequality is the blue portion of the curve $$\{ f( \textrm{mixture of } x_1 \textrm{ and } x_2 ) \}$$ and the right-hand side of the inequality is the red line segment $$\{ \textrm{a mixture of } f(x_1) \textrm{ and } f(x_2) \}$$. We can see visually what it means for a function to be concave, where between to arbitrary $$x$$-values $$x_1$$ and $$x_2$$, the blue portion is always $$\geq$$ the red portion between two $$x$$-values, .

Jensen’s Inequality says that if $$g(x)$$ is a convex function, then:

$$g( \mathbb{E}[X] ) \leq \mathbb{E}\left[ g(X) \right]$$

and if $$f(x)$$ is a concave function, then:

$$f( \mathbb{E}[X] ) \geq \mathbb{E}\left[ f(X) \right]$$

The figure above showing the concave function $$f(x) = \sqrt{x}$$ gives an intuitive illustration of Jensen’s Inequality as well (since Jensen’s Inequality can be said to be a generalization of the “mixture” of $$x_1$$ and $$x_2$$ property of convex and concave functions to the expectation operator). The left-hand side $$f(\mathbb{E}[X])$$ is like $$f( \textrm{a mixture of } X \textrm{ values} )$$ and the right-hand side $$\mathbb{E}\left[ f(X) \right]$$ is like $${\textrm{a mixture of } f(X) \textrm{ values} }$$ where the “mixture” in both cases is the “long-term mixture” of $$X$$ values that is determined by the probability distribution of $$X$$.

Since $$f(z) = \sqrt{z}$$ is a concave function, going back to our estimation of the standard deviation of $$X$$ using $$\sqrt{s^2}$$, we have
\begin{align}
f( \mathbb{E}[Z] ) & \geq \mathbb{E}\left[ f(Z) \right] \longrightarrow \\
\sqrt{\mathbb{E}[Z]} & \geq \mathbb{E}\left[ \sqrt{Z} \right] \longrightarrow \\
\sqrt{ \mathbb{E}[s^2] } & \geq \mathbb{E}\left[ \sqrt{s^2} \right] \longrightarrow \\
\sqrt{ Var(X) } & \geq \mathbb{E}\left[s\right] \\
\textrm{StDev} (X) = \sigma(X) & \geq \mathbb{E}\left[s\right] \\
\end{align}

Thus, $$\mathbb{E} [s] = \mathbb{E} \left[ \sqrt{ \frac{1}{n-1} \sum_{i=0}^{n} { \left[ (x_i – \overline{x})^2 \right] } } \right] \leq \sigma$$. So $$\mathbb{E} [s]$$ is biased and underestimates the true $$\sigma$$.

However, the exact bias $$\textrm{Bias}(s) = \mathbb{E} [s] – \sigma$$ is not as clean to show.

https://en.wikipedia.org/wiki/Unbiased_estimation_of_standard_deviation

$$\frac{(n-1)s^2}{\sigma^2} = \frac{1}{\sigma^2} \sum_{i=0}^{n} { \left[ (x_i – \overline{x})^2 \right] } \sim$$ a $$\chi^2$$ distribution with $$n-1$$ degrees of freedom. In addition, $$\sqrt{ \frac{(n-1)s^2}{\sigma^2} } = \frac{\sqrt{n-1}s}{\sigma} = \frac{1}{\sigma} \sqrt{ \sum_{i=0}^{n} { \left[ (x_i – \overline{x})^2 \right] } } \sim$$ a $$\chi$$ distribution with $$n-1$$ degrees of freedom. A $$\chi$$ distribution with $$k$$ degrees of freedom has mean $$\mathbb{E} \left[ \frac{\sqrt{n-1}s}{\sigma} \right] = \mu_{\chi} = \sqrt{2} \frac{\Gamma ( ^{(k+1)} / _2 ) } { \Gamma ( ^k / _2 )}$$ where $$\Gamma(z)$$ is the Gamma function.

https://en.wikipedia.org/wiki/Gamma_function

If $$n$$ is a positive integer, then $$\Gamma(n) = (n – 1)!$$. If $$z$$ is a complex number that is not a non-positive integer, then $$\Gamma(z) = \int_{0}^{\infty}{x^{z-1} e^{-x} dx}$$. For non-positive integers, $$\Gamma(z)$$ goes to $$\infty$$ or $$-\infty$$.

From the mean of a $$\chi$$ distribution above, we have:

$$\mathbb{E}[s] = {1 \over \sqrt{n – 1} } \cdot \mu_{\chi} \cdot \sigma$$

and replacing $$k$$ with $$n-1$$ degrees of freedom for the value of $$\mu_{\chi}$$, we have:

$$\mathbb{E}[s] = \sqrt{ {2 \over n – 1} } \cdot { \Gamma(^n/_2) \over \Gamma(^{n-1}/_2) } \cdot \sigma$$

Wikipedia tells us that:

$$\sqrt{ {2 \over n – 1} } \cdot { \Gamma(^n/_2) \over \Gamma(^{n-1}/_2) } = c_4(n) = 1 – {1 \over 4n} – {7 \over 32n^2} – {19 \over 128n^3} – O(n^{-4})$$

So we have:

$$\textrm{Bias} (s) = \mathbb{E}[s] – \sigma = c_4(n) \cdot \sigma – \sigma = ( c_4(n) – 1) \cdot \sigma$$

$$= \left( \left( 1 – {1 \over 4n} – {7 \over 32n^2} – {19 \over 128n^3} – O(n^{-4}) \right) – 1 \right) \cdot \sigma = – \left( {1 \over 4n} + {7 \over 32n^2} + {19 \over 128n^3} + O(n^{-4}) \right) \cdot \sigma$$

Thus, as $$n$$ becomes large, the magnitude of the bias becomes small.

From Wikipedia, these are the values of $$n$$, $$c_4(n)$$, and the numerical value of $$c_4(n)$$:

\begin{array}{|l|r|c|}
\hline
n & c_4(n) & \textrm{Numerical value of } c_4(n) \\
\hline
2 & \sqrt{2 \over \pi} & 0.798… \\
3 & {\sqrt{\pi} \over 2} & 0.886… \\
5 & {3 \over 4}\sqrt{\pi \over 2} & 0.940… \\
10 & {108 \over 125}\sqrt{2 \over \pi} & 0.973… \\
100 & – & 0.997… \\
\hline
\end{array}

Thus, for the most part, we don’t have to worry too much about this bias, especially with large $$n$$. So we have
$$\mathbb{E}[\hat{\sigma}] \approx \mathbb{E}[s] = \mathbb{E}[\sqrt{s^2}] = \mathbb{E} \left[ \sqrt{ \frac{1}{n-1} \sum_{i=0}^{n} { \left[ (x_i – \overline{x})^2 \right] } } \right]$$

More rigorously, our estimator $$\hat{\sigma} = s = \sqrt{ \frac{1}{n-1} \sum_{i=0}^{n} { \left[ (x_i – \overline{x})^2 \right] } }$$ is a consistent estimator of $$\sigma$$ (even though it is a biased estimator of $$\sigma$$).

An estimator is consistent if $$\forall \epsilon > 0$$:

$$\lim\limits_{n \to \infty} \textrm{Pr } (|\hat{\theta} – \theta| > \epsilon ) = 0$$

In other words, as $$n \to \infty$$, the probability that our estimator $$\hat{\theta}$$ “misses” the true value of the parameter $$\theta$$ by greater than some arbitrary positive amount (no matter how small) goes to $$0$$.

For the sample standard deviation $$s$$ as our estimator of the true standard deviation $$\sigma$$ (i.e. let $$\hat{\sigma} = s$$),

$$\lim_{n \to \infty} (|\hat{\sigma} – \sigma|) = \lim_{n \to \infty} ( | c_4(n) \sigma – \sigma |) = (| \sigma – \sigma |) = 0$$

so

$$\lim_{n \to \infty} \textrm{Pr } (|\hat{\sigma} – \sigma| > \epsilon) = \textrm{Pr } ( 0 > \epsilon ) = 0$$

Since $$s$$ is a consistent estimator of $$\sigma$$, we are fine to use $$s$$ to estimate $$\sigma$$ as long as we have large $$n$$.

So back to the matter at hand: we want to know the sampling distribution of $$\overline{X}$$ to see “what we can say” about $$\overline{X}$$, specifically, the standard deviation of $$\overline{X}$$, i.e. the standard error of the mean of $$X$$. Not knowing the true standard deviation $$\sigma$$ of $$X$$, we use a consistent estimator of $$\sigma$$ to estimate it: $$s = \sqrt{{1 \over n-1} \sum_{i=1}^n {(x_i – \overline{x})^2}}$$.

So instead of the case where we know the value of $$\sigma$$
$$\overline{X} \sim N(\mu, \sigma^2/n)$$
$$\overline{X} \quad “\sim” \quad N(\mu, s^2/n)$$

When we know the value of $$\sigma$$, we have
$${ \overline{X} – \mu \over \sigma/\sqrt{n} } \sim N(0,1)$$
When we don’t know the value of $$\sigma$$ and use the estimate $$s$$ instead of having something like
$${ \overline{X} – \mu \over s/\sqrt{n} } \quad “\sim” \quad N(0,1)$$
we actually have the exact distribution:
$${ \overline{X} – \mu \over s/\sqrt{n} } \sim T_{n-1}$$
the student’s t-distribution with $$n-1$$ degrees of freedom.

Thus, finally, when we don’t know the true standard deviation $$\sigma$$, under the null hypothesis $$H_0: \mu \leq \mu_0$$, we can use the expression above to create a test statistic
$$t = { \overline{x} – \mu_0 \over s/\sqrt{n} } ~ T_{n-1}$$
and check it against the student’s t-distribution with $$n-1$$ degrees of freedom $$T_{n-1}$$ with some critical value with some significance level, say $$\alpha = 0.05$$.

So if the test statistic exceeds our critical value $$\alpha 0.05$$:

$$t = { \overline{x} – \mu_0 \over s/\sqrt{n} } > T_{n-1, \alpha}$$

then we reject our null hypothesis $$H_0: \mu \leq \mu_0$$ at $$\alpha = 0.05$$ significance level. If not, then we fail to reject our null hypothesis.

asdf

we know the standard deviation of a data point

If under the null hypothesis $$H_0$$ we have a probability distribution, the sample data gives us a sample standard deviation, i.e. the standard error.

Back to our case with 2 coins.  Let’s say we want to test if our coin is the $$p$$ coin and let’s say we arbitrarily decide to call the smaller probability $$p$$, i.e. $$p \leq q$$.  We know that coin flips give us a binomial distribution, and we know the standard error of the mean proportion of heads from $$n$$ flips.  So a 0.05 significance level would mean some cutoff value $$c$$ where $$c > p$$.  But note that if $$c$$ ends up really big relative to $$q$$, e.g. it gets close to $$q$$ or even exceeds $$q$$, we are in a weird situation.

we can decide on some cutoff value $$c$$ between $$p$$ and $$q$$.  If we change around $$c$$, what happens is that the significance level and the power of the test, whether testing $$p$$ or $$q$$, changes.

## Barcodes and Modular Arithmetic

Barcodes

Here is an example of a UPC-A barcode, taken from wikipedia:

UPC-A barcode exampled

A UPC-A barcode has 12 digits.  The first digit is something that tells how the numbers are generally used – for example, a particular industry might use a certain number for certain kinds of items.  The last twelfth digit is a check digit that can try to tell whether or not the numbers have an error.  This check digit is constructed in a certain way at first.  Later on, the check digit may be able to tell us if the numbers have an error or not.

The check digit is constructed as follows:

We have 11 digits:

$$ABCDEFGHIJK$$

So let $$L$$ be the last twelfth difit.  We sum the digits in the odd positions and multiply by 3, and sum that with the sum of the digits in the even positions:

$$3\cdot(A+C+E+G+I+K)+(B+D+F+H+J)$$

We take this modulo 10, or the remainder of this when divided by 10.  If this is 0, that is our twelfth digit; if not, subtract this from 10 and that is our twelfth digit.

$$\text{Let}\ S = (3\cdot(A+C+E+G+I+K)+(B+D+F+H+J))$$

L=
\begin{cases}
0, & \text{if}\ S \pmod{10} \equiv 0 \\
10 – (S \pmod{10}), & \text{otherwise}
\end{cases}

So the logic is that if all 12 digits are correct, they satisfy the check digit equation:

$$3\cdot(A+C+E+G+I+K)+(B+D+F+H+J+L) \equiv 0 \pmod{10}$$

If there is an error in the 12th digit, of course the check digit equation won’t be satisfied.  If there is an error in any one single digit among the first 11 digits, then the check digit equation will also not be satisfied.  Thus, the check digit equation will detect any single digit error.

To see that a single digit error among the first 11 digits will cause the check digit equation to not be satisfied, first note that if any of the digits in the even position are off, that will manifest in $$S$$ as well as $$S \pmod{10}$$ and we will have $$S \pmod{10} \not\equiv 0$$.  But what about the digits in the odd positions, whose sum is multiplied by 3, and why multiplied by 3?

Take a digit in one of the even positions.  As long as the digit is off from the correct value, that will manifest itself in $$S$$ and $$S \pmod{10}$$.  Now take a digit in one of the odd positions and call id $$O$$.  The question then is, if the digit is off from the correct value by say $$d$$, how will that manifest itself in $$S$$ as well as $$S \pmod{10}$$?  The correct $$O$$ gives a term $$3 \cdot O$$ in $$S$$ while an incorrect digit of say $$O + d$$ gives a term $$3 \cdot O + 3 \cdot d$$.

## Portfolio Insurance and Black Monday, October 19, 1987

On the thirtieth anniversary of Black Monday, the stock market crash of October 19th and 20th in 1987, there have been mentions of “portfolio insurance” having possibly exacerbated the crash.

Portfolio insurance, in principle, is exactly what you might expect it to be: if you own a stock, Stock A, you insure it with a put option on Stock A.  Your position becomes equivalent to a call option on Stock A until the put option expires, with the price of this position being the premium of the put option when you bought it.

If you are managing a portfolio on behalf of clients, though, and you just need to insure the portfolio up to a certain date, after which, say, you hand over the portfolio, then to buy American put options to insure the portfolio would be unnecessary.  European put options would suffice.  So let’s suppose that we are only interested in European options.

In the article that I cite at the bottom (Abken, 1987), it seems that at the time, buying put options as insurance had a few issues.  This is assuming that the portfolio we want to insure is a stock index: the S&P 500 index.  The issues were:

• It’s implied that only American options were available (which we would expect have a premium over European options).

Thus, instead of using put options to insure the portfolio, the portfolio and put options are replicated by holding some of the money in the portfolio and some of it in bonds, Treasury bills, that we assume to provide us with the risk-free rate.

Without worrying about the math, the Black-Scholes equation gives us a way to represent our stock index S and put options P as:

$$S + P = S \cdot N_1 + K \cdot DF \cdot N_2$$

d

Source:

Abken, Peter A.  “An Introduction to Portfolio Insurance.”  Economic Review, November/December 1987: 2-25.

## Testing MathJax-LaTeX

https://math.meta.stackexchange.com/questions/5020/mathjax-basic-tutorial-and-quick-reference

At first, we sample $$f(x)$$ in the $$N$$ ($N$ is odd) equidistant points around $$x^*$$:

$f_k = f(x_k),\: x_k = x^*+kh,\: k=-\frac{N-1}{2},\dots,\frac{N-1}{2}$

where $$h$$ is some step.

Then we interpolate points $$(x_k,f_k)$$ by polynomial

\label{eq:poly} \tag{1}
P_{N-1}(x)=\sum_{j=0}^{N-1}{a_jx^j}

Its coefficients $$a_j$$ are found as a solution of system of linear equations:
\label{eq:sys} \tag{asdf}

\label{eq:sys2} \tag{asdf2}

Backslash left and right parentheses:

$\left( \frac{1}{2} \right) \qquad ( \frac{1}{2} ) \\ ( \frac{1}{2} )$
$\left( \frac{1}{2} \right) \qquad \left( \frac{1}{2} )$

$$1 \quad \frac ab \quad 2 \quad \frac{c}{d} \quad 3 \quad {e \over f} \quad 4 \quad {}^g/_h \quad 5 \quad i/j \quad 6 \quad$$

$$1+1=2 \textrm{ centered equation } 1+1=2$$

$$1+1=2 \textrm{ left equation } 1+1=2$$

\begin{align}
1 + 1 & = 2.00000000 \textrm{ aligned to character}\\
& = 2.0000000000000000 \\
& = 1.99999999999 \\
\end{align}

Here are references to existing equations: \ref{eq:poly}, \eqref{eq:sys}.
Here is reference to non-existing equation \eqref{eq:unknown}.

X=
\begin{cases}
0, & \text{if}\ a=1 \\
1, & \text{otherwise}
\end{cases}

$$\lim_{x\to 1}$$

$$\lim_{x\to 1}$$

$$default, \it Italics, \bf bold, \sf sans serif, \tt typewriter, \rm default Roman, \it italics$$

$$horizontal spacing: back slash\ comma\, ! \! > \> : \: ; \; enspace \enspace quad \quad qquad \qquad end$$

$$hskip1point \hskip1pt hskip2point \hskip 2pt hskip10point \hskip10pt hskip3point \hskip 3pt 1ex \hspace{1ex} 1em \hspace{1em} 2em \hskip2em lengthofasdf \hphantom{<asdf>} backslash \ tilde ~ end$$

$$\tiny tiny$$

$$default$$

$$\scriptsize scriptsize \small small \normalsize normalsize or default, \large large$$

$$\normalsize normalsize or default, \large large$$

$$\Large Large \LARGE LARGE \huge huge \Huge Huge1$$

$$\Large \LARGE \huge \Huge Huge2$$

$$\Huge Huge3$$

## Brainteaser: The Monty Hall Problem

You are on a game show and presented with 3 doors.  Behind one is a car and behind the other 2 are goats.  You want to choose the door with a car behind it, as if you do so, you win the car.  You choose one door.  Then, the host opens one of the other doors, which reveals a goat behind it.  The host gives you a choice to either switch your door to the other one that’s still closed or keep your original choice.  Should you switch doors?

If your strategy is to stick to your original choice, your probability of choosing the door with the car behind it is 1/3.  Let’s see what happens if you switch.  So you choose a door, the host reveals one of the other doors with a goat behind it, and asks if you want to switch.  What has happened up to this point?  There’s a 1/3 chance that you picked the door with the car behind it, which means that if you switch, you are switching to a door with a goat behind it.  There’s a 2/3 chance that you picked a door with a goat behind it, which means that if you switch, you are switching to a car behind it.  So if your strategy is to always switch, there’s a 1/3 chance you get a goat in the end (because you happened to choose a car on your first choice, which has a probability of 1/3) and a 2/3 chance you get a car in the end (because you happened to choose a goat on your first choice, which has a probability of 2/3).  So the best strategy is to switch.

The host revealing one of the doors gives you additional information.  Switching lets you use that information, assuming that it was unlikely that you got a car on your original choice.

Perhaps a more intuitive answer is if there are 100 doors.  One has a car behind it and 99 of them have goats behind them.  Choose one door, the hosts reveals another door with a goat behind it, and asks if you want to switch.  If you don’t switch, there’s a 1/100 chance that you chose the door with a car behind it.  But if you switch, assuming that you probably didn’t choose the right door on your first try (because 1/100 is small), now, you have a 1/98 chance of choosing the right door (because the host as revealed one door with a goat behind it and you’re giving up your original door).  Of course 1/98 is better than 1/100.  The exact probability of getting the right door with the switching strategy is 99/100 × 1/98 (probability that you chose the wrong door on the first try × probability of choosing the right door after accepting the offer to switch).  99/100 × 1/98 = 1/100 × (99/98) > 1/100 where 1/100 is the probability of getting the car with not switching, and so switching is better than not switching.

## Brainteaser: 100 Prisoners in a Line and Extension

There are 100 prisoners.  An executioner tells them that tomorrow morning, he will line them up so that each of them is facing the back of the head of another prisoner, except for one prisoner at the end of the line.  In other words, prisoner 1 sees the back of the head of prisoner 2 as well as the backs of the heads of prisoners 3-100, prisoner 2 sees the back of the heads of prisoners 3-100, …, prisoner 99 only sees the back of the head of prisoner 100, and prisoner 100 doesn’t see any prisoners in front of him.  The executioner tells them that he will put either a red or blue hat on each prisoner, then starting with prisoner 1 (the one who can see 99 other prisoners in front of him), will ask him what color hat he is wearing.  The prisoner says a color and if he is wrong, he will silently kill that prisoner (prisoner 1 would be killed in a way that prisoners 2-100 won’t know if he was killed or not).  If he is right, he will keep him alive.  Then, the executioner will move to prisoner 2, ask the same question, and kill if he’s wrong, keep him alive if he is right.  The executioner keeps doing this for every prisoner to prisoner 100.  The prisoners are allowed to discuss together in the night what to do for the next day.  What should their plan be in order to maximize the number of survivors?  For clarity, what should their plan be in order maximize the number of survivors in the worst case scenario (any random guess by a prisoner ends up being wrong)?

A sort of baseline answer is that prisoner 1 says the color of the hat worn by the prisoner right in front of him or her, thus sacrificing his or her life with a guess.  Prisoner 2 is guaranteed to live.  Repeat this for every pair of prisoners, giving us at least 50 prisoners alive at the end.  With 2 colors of hats, it makes intuitive sense that this would be an answer.  Unintuitively, this is far from the answer :-/

One key, or hint, that may remain unemphasized when this brainteaser is presented to people, is that when a prisoner makes and says his guess for his own color, that guess is heard by all the other prisoners.  If each guess is correct, that provides valuable information to the later prisoners.

Let’s say there are only 3 prisoners and we are the middle prisoner.  We see only one prisoner in front of us and say he is wearing a red hat.  From the perspective of the prisoner behind us, either 1 or 2 red hats are seen.  So it’s possible for the prisoner behind us to announce through some code (e.g. “Red” = there is 1 red hat in front of me, “Blue” = there are 2 red hats in front of me) to tell us this.  This allows us to answer our own hat correctly.  Additionally, the prisoner in front of us will have gained two pieces of information: how many red hats there are with the 2 last prisoners and what hat the middle prisoner was wearing.  In other words, initially, there were either 1 or 2 hats worn by the last two prisoners.  The middle prisoner has the ability to answer correctly after the first prisoner sacrifices himself or herself by announcing the code.  If the first prisoners announces that there are 2 red hats in front of him, the middle prisoner will definitely say that he himself is wearing a red hat, leaving 1 red hat for the last prisoner.  If the first prisoner announces that there is 1 red hat in front of him, and then the middle prisoner says “Red,” the last prisoner knows that they are Blue, while if the middle prisoners says “Blue,” the last prisoner knows that they are Red.

Let’s say there are 4 prisoners in a line.  The first prisoner sees 1, 2, or 3 red hats in front of him or her.  But as long as the second prisoner announces his or her own hat color correctly, that will provide information for the later prisoners.  So how can the first prisoner announce information so that at least just the second prisoner will get his or her own hat color correctly?  The second prisoner sees 1 or 2 hats in front of him or her.  The answer is that the first prisoner announces the oddness or evenness of the number of red hats he or she sees.  From the second prisoner’s perspective, whatever he sees in front of him and whatever the last prisoner sees in front of him can only differ by 0 red hats or 1 red hat (whatever hat the second prisoner is wearing).  Thus, the key is, when there is only a difference of one change at each increment, oddness and evenness conveys enough information to tell us what has changed.  So the first prisoner sacrifices himself by announcing, say “Red” for an even number of red hats and “Blue” for an odd number of red hats that he sees in front of him.  This allows the second person to say his hat color correctly.  The third person has information that among the last 3 people, the number of red hats was either odd or even, plus what exact hat color the second person has, plus, of course, what exact hat color the first person, the person in front of him, has.  Effectively, the second person knows the hat colors of all 3 people at the end of the line except his own color plus the information that the first person provides, what the oddness or evenness of the number of red hats was for those 3 people.  This is enough information for the second person to figure out what color hat he has.  It’s the same with the last person.

So with 100 people, the first person sacrifices himself by announcing the oddness or evenness of one of the colors that exist by code.  The second person has exact knowledge of the colors of the 98 people in front of him plus the oddness or evenness of one of the colors for all 99 people excluding the first person (i.e. the 98 people in front of him plus himself), giving him correct knowledge of his own color.  The third person know has exact knowledge of the color of the person behind him and the colors of the 97 people in front of him, plus the oddness or evenness of one of the colors for the 99 people that includes him, giving him enough information to figure out his own color.  This continues until the whole line is finished.  Thus, at least 99 out of 100 people can be saved with this strategy.

Extension:

What if the executioner uses more colors?

In our above case, we had 2 colors, and we sacrificed 1 prisoner at the beginning of the line to announce the oddness or evenness of one of the colors for the 99 people he sees in front of him.  Since all prisoners know the number of prisoners that the first prisoner sees (99), everyone only needs to keep track of one of the colors, say red.  The first prisoner announces the oddness or evenness of red, and each subsequent prisoner counts how many reds there are in the 99 to see if they also have a red hat or not.

If we have 3 colors, the first prisoner that can be saved would see x prisoners in front with 3 different colors and needs to figure out what color hat he has on.  Extending the strategy from above, if we sacrifice two prisoners before him, they can announce the oddness or evenness of two of the colors.  This is enough information for the first prisoner we save what color hat he has.  All subsequent prisoners will then have exact knowledge of the hat colors of all prisoners that can be saved except for their own, which they deduce by the oddness or evenness of the 2 colors that that first two prisoner we sacrifice first announce.  So in this case, we sacrifice 2 prisoners at the start and the 98 subsequent prisoners can be saved.

Let us apply the same logic to more colors.  If the executioner uses y different colors where 1 ≤ y ≤ 100, the first y – 1 prisoners sacrifice themselves by announcing the oddness or evenness of y – 1 colors.  The remaining 100 – (y – 1) prisoners will have enough information to correctly state their hat color.  If the executioner uses more colors than there are prisoners, we don’t have enough prisoners we can sacrifice to convey accurate information about the oddness or evenness of the colors we have to prisoners at the end.  In addition, we can always default back to the “baseline” solution, where each pair works together by sacrificing one prisoner (who simply announces the color of the hat in front of him) and saving the other one (who simply says the color that was announced by the prisoner before him), and guarantee at least 50 prisoners saved.  Thus, for 1 ≤ y ≤ 49, the “sacrifice for odd or even” strategy saves 99 to 51 people.  For y = 50, the strategy saves 50 people, which is the same as the result for the “default pair sacrifice” strategy.  For y > 50 (and even if y > 100), the “default pair sacrifice” strategy can always save 50 people and becomes better than the “sacrifice for odd or even” strategy.

100 people are in a room.

1. All 100 of them are perfect logicians.
2. They are told that at least one person in the room has blue paint on their forehead.
3. They are told that once you deduce that you have blue paint on your forehead, the next time that the lights are turned off, leave the room.

All 100 people have actually had their foreheads painted blue (but of course, each of them don’t know this at this point – they can only see the other people’s foreheads).  The light is turned off, then on, then off, on, etc.  What happens?

So each person sees 99 other people with blue paint on their heads.  While this is the situation we begin with, it doesn’t seem to help with solving the problem at all.  The key for this problem is to start as small as possible and then expand.

Start with 1 person.  1 person in a room sees 0 other people.  Thus, if there is at least 1 person in the room with blue paint, he or she must be it.  The light goes off, and then on, and we see 0 people in the room, as the person has left.

Let’s say we have 2 people.  Put ourselves in to the shoes of one of them.  They see 1 person in the room with blue paint on their forehead, and don’t now if there is blue paint on their own forehead.  But if there was no blue paint on their forehead, then the other person should deduce that they must be the one with blue paint on their forehead, and will be gone by the next light.  The light is turned off, then on.  Since both people see the other person with blue paint, both remain.  Now, each person knows that the other person looked at their forehead and saw blue paint, and so each person knows that they have blue paint on their own forehead.  The lights turns off and on, and there are 0 people in the room.

I think you know where this is going (although I find the logic the most difficult from here).  3 people in the room.  Each person sees 2 other people with blue paint on their foreheads.  The additional key here is, each person needs to think, “What if I don’t have blue paint?  If what happens then is a contradiction, then I must have blue paint.”  Choosing one person’s perspective – our “first” person – we first posit that we don’t have blue paint.  In that case, each of the other 2 people sees 1 person without blue paint and 1 person with blue paint.  Our existence as someone without blue paint doesn’t matter in their calculations.  Each of them thinks, “There is one other person in this room with blue paint.  If they see me without blue paint as well, then they should disappear by the next light.  The light turns off, then on.  All 3 people are still there.  So each of the other 2 people think, “Since that other person didn’t leave, I must have blue paint.  So I will leave by the next light.  The light turns off and on.  But since the truth is that all 3 people have blue paint, the other 2 people won’t disappear.  Instead, each of them are thinking the same thing about the other 2 people in the room that they see have blue paint on their foreheads.  Everyone waited two turns to see if the other people would make a move.  Since they didn’t, everyone has found a contradiction to “If I had blue paint,” and thus everyone deduces that they have blue paint on their own forehead.  Thus, the third time that the light goes off and on, the 3 people have left the room.

4 people in the room.  Assume you don’t have blue paint, so your being there doesn’t affect the others’ logic.  There are 3 people wondering if they have blue paint and the each see 2 other people with blue paint.  After 3 turns of the light going off and on, they should all leave.  If they don’t, we have a contradiction, so we have blue paint.  So on the 4th light, all 4 people leave.

5 people in the room.  Described another way: Let’s say we don’t have blue paint.  There are 4 other people with blue paint.  Let’s label them A, B, C, and D.  D is wondering if he or she has blue paint, looking at A, B, and C.  D first assumes he has no paint and is thinking, “C is thinking if he doesn’t have blue paint, then after 2 turns, A and B will disappear.”  After 2 turns, A and B remain.  D is thinking, “So now, C will conclude that he has blue paint.  So on the 3rd turn, A, B, and C should leave.”  After the 3rd turn, A, B, and C remain.  D is thinking, “OK, so there’s a contradiction to the assumption that I don’t have blue paint.  Thus, I have blue paint, and will disappear on the 4th turn.”  On the 4th turn, we see that A, B, C, and D still remain.  Thus, we have a contradiction to our first assumption that we have no blue paint.  We have blue paint, so on the 5th turn, we leave.  Everyone else also has the same logic process, so on the 5th turn, everyone leaves.

If there are 100 people in the room, all with blue paint on their foreheads, first assume that you don’t have blue paint on your forehead.  So then, your existence shouldn’t matter to the other 99 people’s logic.  Let’s label us A.  There are 100 people in the room: A, B, C, …, X, Y, Z, AA, AB, …, CV.  Person A first assumes they have no paint, and thinks, “B must be thinking, if I don’t have paint, then, C would think, if I don’t have blue paint… etc.”  Basically, we are testing the assumption that everyone first assumes that they themselves don’t have blue paint on their forehead.  It doesn’t make intuitive sense since anyone can see that there are at least 99 other people with paint, but it’s the key step.  Assume, what if everyone from A to CV thought that they didn’t have blue paint?  Or rather that A assumes they don’t have blue paint and that B assumes that B doesn’t have blue paint and B assumes that … CU assumes CU doesn’t have blue paint and that CV assumes that they don’t have blue paint?  Well, this is a contradiction, because at least 1 person must have blue paint.  Now, let’s assume A to CU thinks that they don’t have blue paint and CU sees CV has blue paint and must assume that CV sees everyone else with no paint.  After 1 turn, CV doesn’t leave (because it’s not true that the other 99 people don’t have blue paint), and thus we have a contradiction and CU must believe that they have blue paint on their forehead as well.  After turn 2, CU doesn’t leave though (because it’s not true that the 98 other people other than CV and CU don’t have blue paint), so we have a contradiction and CT must believe that they have blue paint.  Keep going until turn 99, where B doesn’t leave because it’s not true that A doesn’t have blue paint (if B saw that A doesn’t have blue paint, B should have left on turn 99).  We have a contradiction, so A concludes that they have blue paint, and so on turn 100, everyone leaves.

It’s a lot easier to rely on the formula we built from the smaller examples that “With a room of x people, they all leave at once after x turns.”  But I find the intuition disappears with large numbers.  The above paragraph is an attempt to describe the intuition, the key being that we assume that all x people assume that they don’t have blue paint, and then one by one contradict that (because in reality, everyone has blue paint), until we’ve contradicted all cases down to 1 person assuming they have no paint.  Once that is contradicted on the xth turn, after that, everyone leaves at once, since everyone has the same logic process.

There are 3 people placed in a room.  They all have perfect logic.  The 3 people are told by a host that a number has been written on each of their foreheads.  Each of the 3 numbers are unique, they are all positive, and they relate to each other such that A + B = C (i.e. one is the sum of the other two).  In the room, each person can only see the other two people’s numbers, as they cannot see their own foreheads.

Suppose you are one of the 3 and you see one person with “20” on their forehead and the other person with “30.”  The host asks you, then the person with “20,” and then the person with “30” what number is on their heads and all 3 say that they don’t know.  The host then asks, again, you, then the person with “20,” and then the person with “30” what number is on their head, and all 3 answer correctly.  How does this happen?

The key to this brainteaser is to calculate the logic of each person’s point of view, i.e. put yourself in each of their shoes.  The annoying part of solving this brainteaser, then, is having to keep track of 3 different points of view.

“First” person: If you see “20” and “30,” that means you are either 10 or 50.  So you don’t know what’s on your forehead among these two numbers.

The “20” person: You see either 1.) “30” and “10” or 2.) “30” and “50.”  In case 1.) you are either 20 or 40.  In case 2.) you are either 20 or 80.  So you don’t know.

The “30” person: You see either 1.) “20” and “10” or 2.) “20” and “50.”  In case 1.) you must be 30 because you cannot be 10 as well the “First” person.  So the key here is that if you see one person has number “x” and another has number “2x,” you know you cannot also have “x” on your forehead.  You must be “3x.”  So in this case, the “30” person would know the answer that he or she has 30 on his head.  In case 2.) the “30” person has either 30 or 70, and so he or she wouldn’t know.

Since after the first round, everyone answered that he or she did not know, that means that we cannot have the “30” person’s case 1.), which is that he sees “20” and “10.”  In other words, our “First” person cannot have 10.  He has 50 on his forehead.  So when the host asks the “First” person the second time, he or she will answer 50.

The most illuminating and clean part of the problem is just up to here, but in an attempt for completeness, I kept going.

From the “20” person’s point of view, we assume that he or she is able to figure out the above sort of logic on his or her own.  What the “20” person sees is “30” and “50,” which means that he or she is either 20 or 80.  Somehow, the “50” person figured out on his or her own on the second round of questioning that they have 50 on their head.  The logic is that in order to find out what your number is on the second round, you are using someone’s “I don’t know” answer in the first round of questioning.  So if the “20” person indeed has 20 on his or her head, they can deduce that the “50” person is able to figure out all the above and that his or her number is 50 on the second round.  If the “20” person has 80 instead, the “First” person sees “80” and “30” and is thus wondering if his or her number is 50 or 110 and the “30” person sees “80” and 50 and wondering if they’re 30 or 130.  In none of these cases is a person announcing that they are not seeing an “x” and “2x” situation (which is what the “First” person experiences: seeing a “2x” and “3x” situation, and then seeing that the “3x” person doesn’t immediately say that he or she knows that his or her number is “3x.”).  If the “20” person has 20, then, again, the “First” person sees that the “30” person is announcing that they aren’t seeing an “x” and “2x” situation, which means that the “First” person can’t have 10 and must have 50.  This causes the “20” person to know that his or her number is 20.

Similarly, the “30” person sees “50” and “20” initially doesn’t know if he or she is 30 or 70.  If it’s 70, the other people either see “70” and “50” or “70” and “20,” which doesn’t allow the situation described above of someone announcing that he or she doesn’t see an “x” and “2x” situation.  If it’s 30, then everything that’s been discussed happens, and so it must be 30.

The key basically is that if someone sees “x” and “2x,” they should know immediately that they are 3x.  If someone sees “2x” and “3x,” they are immediately on high alert to see if the “3x” person immediately knows that he or she is 3x.  If the “3x” person doesn’t know, that is an announcement that the “3x” person did not see an “x” and “2x” situation, which means that the person we started with must be “5x.”  So, in an “x” and “2x” situation, you know immediately that you are 3x.  In a “2x” and “3x” situation, if everyone says that he or she doesn’t know in the first round, that announces that no one saw (and the “3x” person in particular did not see) an “x” and “2x” situation, which means that you must be 5x.