# Author: Econopunk

## A Thing About the Hot Hand Fallacy and the “Law of Small Numbers”

There was an interesting post and discussion on the NBA subreddit of Reddit on the Hot Hand phenomenon and whether or not it is a fallacy.

A Numberphile video on the topic:

An article on the topic:

https://www.scientificamerican.com/article/do-the-golden-state-warriors-have-hot-hands/

In some parts of the Numberphile video, Professor Lisa Goldberg emphasizes that issues of the “Law of Small Numbers,” which is described in the Scientific American article as:

Early in their careers, Amos Tversky and Daniel Kahneman considered the human tendency to draw conclusions based on a few observations, which they called the ‘‘law of small numbers’’.

when looking at the hot hand phenomenon, comes from the fact that we don’t get to see what happens after an H at the end of a sequence. Let a sequence be a string of shots of some length. A shot is either a make H or a miss T. So a sequence of 3 shots might be:

$$ HTH $$

A make, a miss, and then a make. So looking at that, we see that after the first H, we missed, which is evidence against the hot hand. We don’t care what happens after a miss, the T. We can’t see what happens after the last shot, which is a make. This is what’s noted as causing the “Law of Small Numbers.”

A moment from the Numberphile video illustrating the probabilities of H after an H for each possible sequence of 3 shots, and the average of those probabilities:

And here, this “Law of Small Numbers” causes the average probability of H’s after an H to be 2.5/6. When the sequence is a finite length, the probability of an H after an H (or a T after a T) is biased below 0.5. As the sequence gets longer and tends toward infinity, the probability of an H after an H (or a T after a T) goes toward 0.5.

While all this is true, let’s look a little closer at what’s going on in this illustration to understand why and how exactly this bias occurs.

All possibilities of sequences of 3 shots:

$$ n = \textrm{3} $$

$$ \textrm{Average probability} = \frac{2.5}{6} = 0.416\bar{6} $$

Assuming that an H and a T each appear with 0.5 probability and there is no memory, i.e. no hot hand, each of the above 8 sequences are equally probable. The average probability of the 6 cases where we can evaluate where there is a hot hand or not (cases that have an H in the first or second shot) is calculated to be 2.5/6 < 0.5.
**But let’s count the number of H’s and T’s in the second column. There are 4 H’s and 4 T’s!** So we have:

$$ \frac {\textrm{Number of H’s}}{\textrm{Number of H’s & T’s}} = \frac {4}{8} = 0.5 $$

So it’s as if we’ve undercounted the cases where there are 2 shots that are “hot hand evaluations,” the last two sequences at the bottom of the list. In all (8) sequences of length 3, how many hot hand evaluations in total were there? (How many H’s or T’s in the 2nd column?) 8. How many of those were H’s? 4. So we have a hot hand make probability of 0.5.

It doesn’t necessarily mean that the way they counted hot hand makes in the Numberphile video is wrong. It’s just a particular way of counting it that causes a particular bias. It also may be the particular way the human instinct feels hot handedness – as an average of the probability of hot hand makes over different sequences. In other words, that way of counting may better model how we “feel” or evaluate hot handedness in real world situations.

**So why is the average probability over sequences < 0.5?**

When we evaluate hot-handedness, we are looking at shots that come after an H. Suppose we write down a list or table of each possible permutation of shot sequences of length \(n\) from less H’s, starting from the sequence of all T’s, down to more H’s, ending with the sequence of all H’s. We noted above that if we count all the hot hand makes H’s in all sequences (the H’s in the 2nd column), the probability of hot hand H’s among all hot hand evaluations (the number of H’s or T’s in the 2nd column) is 1/2. When we look at the list of sequences, what we notice is that a lot of the hot hand H’s (the 2nd column) are concentrated in the lower sequences toward the bottom. But these sequences heavy in so many H’s only give one probability entry in the 3rd column of 1 or near 1.

$$ n = \textrm{4} $$

$$ \textrm{Average probability} = \frac{5.6\bar{6}}{14} \approx 0.405 $$

$$ n = \textrm{5} $$

$$ \textrm{Average probability} = \frac{12.25}{30} \approx 0.408\bar{3} $$

Assuming equal probability of H and T on any given shot and no memory between shots: the entire list of sequences (the 1st column) will have an equal number of H’s and T’s. Additionally, all the hot hand evaluations (the 2nd column) will have an equal number of H’s and T’s.

Looking at the 1st column, we go from more T’s at the top to more H’s at the bottom in a smooth manner. Looking at the 2nd column though, we go from rows of T’s and as we go down we find that a lot of H’s are “bunched up” towards the bottom. But remember that we have a “limited” number of H’s in the 2nd column as well, namely 50% of all hot hand evaluations are H’s and 50% are T’s.

**Let’s look closely at how the pattern in the 1st column causes more H’s to be bunched up in the lower sequences in the 2nd column, and also if there is any pattern to the T’s when we look across different sequences.**

Higher sequences have less H’s (looking at the 1st column), which means more HT’s in those sequences as well, i.e. more hot hand misses. Lower sequences have more H’s, which means more HH’s in those sequences, i.e. more hot hand makes. This means that, looking at the 2nd column, higher sequences have more T’s and lower sequences have more H’s. **Lower sequences “use up” more of the “limited amount” of H’s** (limited because the number of H’s and T’s in the 2nd column are equal). Thus, H’s in the 2nd column are “bunched up” in the lower sequences as well. This causes there to be less sequences with higher probability (the 3rd column) than sequences with lower probability. Perhaps this is what brings the average probability below 0.5.

A naive look of the 2nd column shows that the highest sequences have a lone T as its hot hand evaluation, and many other hot hand evaluations of higher sequences end with a T. This makes sense since if a sequence consists of a lot of T’s, any H’s in it are unlikely to be the last two shot in the sequence, like …HH, which is what’s needed for the hot hand evaluations in the 2nd column to end with an H. And as long as a T is the last shot, the hot hand evaluation of the sequence will end with a T, since any lone H or streak of H’s in the sequence will have encountered a T as the next shot either with that last T shot in the sequence (…HHT) or meeting the first of consecutive T’s that lead up to the last T shot of the sequence (…HHTT…T).

Let’s divide up all the sequences in the 1st column into categories of how a sequence ends in its last 2 shots and use that to interpret what the last hot hand evaluation will be in the 2nd column for that sequence category. There are 4 possible ways to have the last 2 shots: TT, TH, HT, and HH. If a sequence ends in …TT, that “…” portion is either all T’s or if it has any H’s, we know that that sequence ends in a T before or at the second-to-last T in the sequence (either …H…TTT or …HTT). So in all cases but one (where the entire sequence is T’s and so there is no hot hand evaluation for the 2nd column), the last hot hand evaluation in the 2nd column will be a T. If a sequence ends in …TH, the thinking is similar to the case that ends in …TT since the very last H doesn’t provide us with an additional hot hand evaluation since the sequence ends right there, so the 2nd column also ends in a T. If a sequence ends in …HT, the last T there is our last hot hand evaluation, so the 2nd column also ends in a T. If a sequence ends in …HH, then the 2nd column ends in an H. So about 3/4 of all sequences end their 2nd column with a T. (\(3/4)n-2\) to be exact, since the sequences of all T’s and \((n-1)\) T’s followed by an H don’t have any hot hand evaluations.) **Thus, the T’s in the 2nd column are “spread out more evenly” across the different sequences** since (\(3/4)n-2\) of all sequences have a T for its last hot hand evaluation (the 2nd column), while the H’s are “bunched up” in the lower sequences. Thus, a relatively large number of sequences, especially sequences that are higher up, have their probabilities (the 3rd column) influenced by T’s in the 2nd column, bringing the average probability across sequences down.

$$ n = \textrm{6} $$

$$ \textrm{Average probability} \approx 0.4161 $$

**As \( n \) grows larger, the average probability seems to drift up. **

Looking at the top of the list of sequences for \( n = 4 \), there are 3 sequences with a 0 in the 3rd column. These 3 sequences consist of 1 H and 3 T’s (and TTTH is uncounted because there is no hot hand evaluation in that sequence). At the bottom, we have the HHHH sequence giving a 1 in the 3rd column, and then 4 sequences that have 3 H’s ant 1 T. The entries in the 3rd column for these 4 sequences are 1, 0.5, 0.5, and 0.667.

For sequences of \( n = 5 \), there are then 4 sequences at the top of the list that give a 0 in the 3rd column. At the bottom, the HHHHH sequence gives a 1 in the 3rd column, and then the sequences with 4 H’s and 1 T give 1, 0.667, 0.667, 0.667, 0.75 in the 3rd column.

For sequences of \( n = 6 \), there are then 5 sequences at the top of the list that give a 0 in the 3rd column. At the bottom, the HHHHHH sequence gives a 1 in the 3rd column, and then the sequences with 5 H’s and 1 T give 1, 0.75, 0.75, 0.75, 0.75, 0.8 in the 3rd column.

This pattern shows that as \( n \) increases, we get \( (n – 1) \) sequences at the top of the list that always give 0’s in the 3rd column. At the bottom there is always 1 sequence of all H’s that gives a 1 in the 3rd column. Then for the sequences with \( (n – 1) \) H’s and 1 T, we always have 1 sequence of THH…HH that gives a 1 in the 3rd column, then \( (n – 2) \) sequences that give a \( \frac{n – 3}{n – 2} \) in the 3rd column, and always 1 sequence of HH…HT that gives a \( \frac{n – 2}{n – 1} \) in the 3rd column. So as \( n \) becomes large, the entries in the 3rd column for these sequences with \( (n – 1) \) H’s and 1 T get closer to 1. For small \(n\), such as \(n = 3\), those entries are as low as 0.5 and 0.667. But the entries in the 3rd column for the sequences high in the list with 1 H and \( (n – 1) \) T’s remain at 0 for any \(n\). Thus, as \( n \) becomes large, the lower sequence entries in the 3rd column become larger, shifting the average probability over sequences up.

Roughly speaking, when we only have one shot make in a sequence of shots (only 1 H among \(n-1\) T’s), we have only one hot hand evaluation possible, which is the shot right after the make. Ignoring the case of TT…TH, that hot hand evaluation coming after the H will always be a miss. **Thus, when there is only one shot make in a sequence, the hot hand probability is always 0.** On the other hand, when we have only one shot miss in a sequence, ignoring the TH…HH case, we will have 1 hot hand miss and many hot hand makes. **Thus, our hot hand probability in these sequences with only 1 T will always be less than 1, and approaches 1 as \( n \) approaches \( \infty \).** In a rough way, this lack of balance between the high sequences and low sequences drags down the average probability over the sequences below 0.5, with the amount that’s dragged down mitigated by larger and larger \( n \).

A possible interesting observation or interpretation of this is **how it might lead to the human mind “feeling” the gambler’s fallacy (e.g. consecutive H’s means a T “has to come” soon) and the hot hand fallacy (e.g. consecutive H’s means more H’s to come)**. The above results show that in finite length sequences, when a human averages in their mind the probability of hot hand instances across sequences, i.e. across samples or experiences, the average probability is < 0.5. In other words, across experiences, the human mind "feels" the gambler's fallacy, that reversals after consecutive results are more likely. But when a human happens to find themselves in one of the lower sequences on a list where there are relatively more H's than T's in the 1st column, what happens is that the hot hand evaluations (the 2nd column) are likely to have a lot more H's than what you'd expect, because H's are "bunched up" towards the bottom of the 2nd column. What you expect are reversals - that's what "experience" and the gambler's fallacy that results from that experience tells us. But when we find ourselves in a sequence low in the list, the hot hand instances (the 2nd column) give us an inordinately high number of hot hand makes because H's are bunched up towards the bottom of the list. So when we're hot, it feels like we're really hot, giving us the hot hand fallacy.
An actually rigorous paper on this subject, also found in a comment from the Reddit post, is *Miller, Joshua B. and Sanjurjo, Adam, Surprised by the Gambler’s and Hot Hand Fallacies? A Truth in the Law of Small Numbers*. **One of the proofs they present is a proof that the average probability of hot hand makes across sequences is less than the standalone probability of a make** (i.e. using our example, the average of the entries in the 3rd column is less than 0.5, the probability of an individual make).

Let

$$ \boldsymbol{X} = \{X_i\}_{i=1}^n $$

be a sequence of 0’s and 1’s that is \(n\) long. An \( X_i = 0 \) represents a miss and an \( X_i = 0 \) represents a make.

From the sequence \( \boldsymbol{X} \), we excerpt out the hot hand evaluations, which are shots that occur after \( k \) made shots. In our example, we are just concerned with \( k = 1\). The hot hand evaluation \(i\)’s are

$$ I_k( \boldsymbol{X} ) := \{i : \Pi_{j=i-k}^{i-1} X_j = 1\} \subseteq \{k+1,…,n\} $$

So \( I_k( \boldsymbol{X} ) \) is defined to be the \( i \)’s where the product of the \(k\) preceding \(X\)’s is 1, and \(i\) can only be from among \( {k+1,…,n} \). So for example, let \(k=2\) and \(n=6\). Then firstly, an \(i\) that is in \( I_k(\boldsymbol{X} \) can only be among \( {3,4,5,6} \) because if \(i = 1,2\), there aren’t enough preceding shots – we need 2 preceding shots made to have the \(i\)th shot be a hot hand evaluation. Ok, so let’s look at \(i = 4\). Then,

$$ \Pi_{j=4-2}^{4-1} X_j = X_2 \cdot X_3 $$

This makes sense. If we are looking at \(i = 4\), we need to see if the 2 preceding shots, \(X_2\) and \(X_3\) are both 1.

**The theorem stated in full is**:

Let

$$ \boldsymbol{X} = \{X_i\}_{i=1}^n $$

with \( n \geq 3 \) be a sequence of independent (and identical) Bernoulli trials, each with probability of success \( 0 \lt p \lt 1 \). Let

$$ \hat{P}_k(\boldsymbol{X}) := \sum_{i \in I_k(\boldsymbol{X})} \frac{X_i}{|I_k(\boldsymbol{X})|} $$

Then, \( \hat{P} \) is a biased estimator of

$$ \mathbb{P} ( X_t = 1 | \Pi_{j=t-k}^{t-1} X_j = 1 ) \equiv p $$

for all \(k\) such that \(1 \leq k \leq n – 2\). In particular,

$$ \mathbb[E] \left[ \hat{P}_k (\boldsymbol{X}) | I_k(\boldsymbol{X}) \neq \emptyset \right] \lt p $$

We have the \(n \geq 3 \) because when \( n = 2 \), we actually won’t have the bias. We have HH, HT, TH, TT, and if \( p = 1/2 \), we have the HH giving us a hot hand evaluation of H and the HT giving us a hot hand evaluation of T, so that’s 1 hot hand make out of 2 hot hand evaluations, giving us the \( \hat{P} = 1/2 \) with no bias.

We have \( \hat{P}_k( \boldsymbol{X} ) \) as our estimator of the hot hand make probability. It’s taking the sum of all \(X_i\)’s where \(i\) is a hot hand evaluation (the preceding \(k\) shots all went in) and dividing it by the number of hot hand evaluations – in other words, the hot hand makes divided by the hot hand evaluations. Note that we are just looking at one sequence \( \boldsymbol{X} \) here.

\( \mathbb{P} (X_t = 1 | \Pi_{j=t-k}^{t-1} X_j = 1 ) \equiv p \) is the actual probability of a hot hand make. Since we are assuming that the sequence \( \boldsymbol{X} \) is \(i.i.d.\), the probability of a hot hand make is the same as the probability of any make, \(p\).

\(k\) is restricted to \(1 \leq k \leq n – 2\) since if \(k = n – 1 \) then the only possible hot hand evaluation is if all first \(n-1\) shots are made. Then we would just be evaluating at most 1 shot in a sequence, the last shot. Similar to the case above where \(n=2\), the estimator would be unbiased. if \(k = n\), then we would never even have any hot hand evaluation, as all shots made would simply satisfy the condition for the next shot to be a hot hand evaluation, where the next shot would be the \(n+1\)th shot.

\( E \left[ \hat{P}_k (\boldsymbol{X}) | I_k(\boldsymbol{X}) \neq \emptyset \right] \lt p \) is saying that the expectation of the estimator (given that we have some hot hand evaluations) underestimates the true \(p\).

Here is the rigorous proof provided by the paper in its appendix:

First,

$$ F:= \{ \boldsymbol{x} \in \{ 0,1 \}^n : I_k (\boldsymbol{x}) \neq \emptyset \} $$

\(F\) is defined to be the sample space of sequences \(\boldsymbol{x}\) where a sequence is an instance of \(\boldsymbol{X}\) that is made up of \(n\) entries of either \(0\)’s or \(1\)’s and there is a non-zero number of hot hand evaluations. In other words, \(F\) is all the possible binary sequences of length \(n\), like the lists of sequences we wrote down for \(n = 3,4,5,6\) above. By having \( I_k (\boldsymbol{x}) \neq \emptyset \), we have that \( \hat{P}_k(\boldsymbol{X}) \) is well-defined.

The probability distribution over \(F\) is given by

$$ \mathbb{P} (A|F) := \frac{ \mathbb{P} (A \cap F) } {\mathbb{P}(F)} \text{ for } A \subseteq \{0,1\}^n $$

where

$$ \mathbb{P}(\boldsymbol{X} = \boldsymbol{x})= p^{\sum_{i=1}^{n} x_i} (1 – p)^{n – \sum_{i=1}^{n} x_i} $$

So the probability of a sequence \(A\) happening given the sample space \(F\) we have is the probability of a sequence \(A\) that is in \(F\) happening divided by the probability of a sequence in \(F\) happening. If our sample space is simply the space of all possible sequences of length \(n\), then this statement becomes trivial.

The probability of some sequence \(\boldsymbol{x}\) happening is the probability that \( \sum_{i=1}^{n} x_i \) shots are makes and \( n – \sum_{i=1}^{n} x_i \) shots are misses. When we have \( p = 1/2 \), this simplifies to

$$ \mathbb{P}(\boldsymbol{X} = \boldsymbol{x})= \left( \frac{1}{2} \right)^{\sum_{i=1}^{n} x_i} \left( \frac{1}{2} \right)^{n – \sum_{i=1}^{n} x_i} = \left( \frac{1}{2} \right)^n = \frac{1}{2^n}$$

Draw a sequence \( \boldsymbol{x} \) at random from \(F\) according to the distribution \( \mathbb{P} ( \boldsymbol{X} = \boldsymbol{x} | F ) \) and then draw one of the shots, i.e. one of the trials \( \tau \) from \( \{k+1,…,n\} \) where \( _tao \) is a uniform draw from the trials of \( \boldsymbol{X} \) that come after \( k \) makes. So for

$$ \boldsymbol{x} \in F \text{ and } t \in I_k(\boldsymbol{x}) $$

we have that

$$ \mathbb{P}(\tau = t | \boldsymbol{X} = \boldsymbol{x}) = \frac{1}{|I_k(\boldsymbol{x})|} $$

So \(\boldsymbol{x}\) is some instance of a sequence from the sample space and \(t\) is one of the shots or trials from the sequence \(\boldsymbol{x}\) that is a hot hand evaluation, i.e. \(t\) is one of the hot hand evaluations from sequence \(\boldsymbol{x}\). Then the probability of \(\tau\) drawn being a particular \(t\) is like uniformly drawing from all of the possible hot hand evaluations, i.e. the probability of drawing 1 element out of the number of hot hand evaluations.

Instead, when

$$ t \in I_k(\boldsymbol{x})^C \cap \{k+1,…,n\} $$

i.e. if we are looking at trials among \(\{k+1,…,n\}\) that are *not* hot hand evaluation trials, then

$$ \mathbb{P}(\tau = t | \boldsymbol{X} = \boldsymbol{x}) = 0 $$

i.e. the random \( \tau \)th trial we draw will never pick from among those trials that are not hot hand evaluations. A \( \tau \) draw is only from among the hot hand evaluation trials.

Then, the unconditional probability distribution of \( \tau \) that can possibly follow \(k\) consecutive makes/successes, i.e. \(t \in \{k+1,…,n\}\), is

$$ \mathbb{P}(\tau = t | F ) = \sum_{x \in F} \mathbb{P}(\tau = t | \boldsymbol{X} = \boldsymbol{x}, F) \mathbb{P}( \boldsymbol{X} = \boldsymbol{x} | F) $$

So given the sample space of all sequences \(F\), i.e. we may be dealt any possible sequence from the sample space, the probability of drawing a particular hot hand evaluation trial \(\tau\) is the probability of drawing a particular hot hand trial given a certain sequence \(\boldsymbol{x}\) multiplied by the probability of drawing that sequence \(\boldsymbol{x}\) given the sample space of all possible sequences, summed over all possible sequences in the sample space.

Then, there is an identity that is shown, which is:

$$ \mathbb{E} \left[ \hat{P}_k(\boldsymbol{X}) | F \right] = \mathbb{P}(X_\tau = 1 | F) $$

From the definition above \( \hat{P}_k(\boldsymbol{X}) \), the estimator of \(p\) given a single sequence \(\boldsymbol{X}\):

$$ \hat{P}_k(\boldsymbol{X}) := \sum_{i \in I_k(\boldsymbol{X})} \frac{X_i}{|I_k(\boldsymbol{X})|} $$

we can write:

$$ \hat{P}_k(\boldsymbol{x}) = \sum_{t \in I_k(\boldsymbol{x})} \frac{x_t}{|I_k(\boldsymbol{x})|} = \sum_{t \in I_k(\boldsymbol{x})} x_t \cdot \frac{1}{|I_k(\boldsymbol{x})|} $$

$$ = \sum_{t \in I_k(\boldsymbol{x})} \left[ x_t \cdot \mathbb{P}(\tau = t | \boldsymbol{X} = \boldsymbol{x}) \right] $$

$$ = \sum_{t \in I_k(\boldsymbol{x})} x_t \cdot \mathbb{P}(\tau = t | \boldsymbol{X} = \boldsymbol{x}) + 0 = \sum_{t \in I_k(\boldsymbol{x})} x_t \cdot \mathbb{P}(\tau = t | \boldsymbol{X} = \boldsymbol{x}) + \sum_{t \notin I_k(\boldsymbol{x})} 0 $$

$$ = \sum_{t \in I_k(\boldsymbol{x})} x_t \cdot \mathbb{P}(\tau = t | \boldsymbol{X} = \boldsymbol{x}) + \sum_{t \notin I_k(\boldsymbol{x})} x_t \cdot \mathbb{P}(\tau = t | \boldsymbol{X} = \boldsymbol{x}) $$

since

$$ \text{if } \{t \notin I_k(\boldsymbol{x})\}

\text{, then } \mathbb{P}(\tau = t | \boldsymbol{X} = \boldsymbol{x}) = 0 $$

So

$$ \hat{P}_k(\boldsymbol{x}) = \sum_{t \in I_k(\boldsymbol{x})} x_t \cdot \mathbb{P}(\tau = t | \boldsymbol{X} = \boldsymbol{x}) + \sum_{t \notin I_k(\boldsymbol{x})} x_t \cdot \mathbb{P}(\tau = t | \boldsymbol{X} = \boldsymbol{x}) $$

$$ = \sum_{t = k+1}^n x_t \cdot \mathbb{P}(\tau = t | \boldsymbol{X} = \boldsymbol{x}) $$

The paper then makes a step in footnote 44 that I have not quite figured out, but the best I can make of it is this. Looking at what we’ve arrived at for \( \hat{P}_k(\boldsymbol{x}) \), we see that we sum across all trials \(t\) from \(k+1\) to \(n\). Also, we’re only summing across trials \(t\) where \(t \in I_k(\boldsymbol{x})\) because for \(t \notin I_k(\boldsymbol{x})\), we have \( \mathbb{P} (\tau = t | \boldsymbol{X} = \boldsymbol{x} = 0).

So we are to add up the \(x_t\) for \(t\)’s that, most importantly, satisfy \(t \in I_k(\boldsymbol{x})\). The logic that goes I think is that:

$$ = \sum_{t = k+1}^n x_t = \text{ some arithmetic sequence of 0’s and 1’s like } 1 + 0 + … + 1 + 0 $$

$$ = \sum_{t=k+1}^n \mathbb{P}(X_t = 1 | \text{ for each } \tau = t, \boldsymbol{X} = \boldsymbol{x} ) = \sum_{t=k+1}^n \mathbb{P}(X_t = 1 | \tau = t, \boldsymbol{X} = \boldsymbol{x} ) $$

The strange thing is that what was an instance of a random variable \(x_t\), an actual numerical value that can come about empirically and thus allows to estimate with the estimator \(\hat{P}\), has turned into a probability.

Being given a valid sequence \( \boldsymbol{x} \) only makes sense if we have a sample space, so we also write:

$$ \sum_{t=k+1}^n \mathbb{P}(X_t = 1 | \tau = t, \boldsymbol{X} = \boldsymbol{x}, F ) $$

as well as

$$ \mathbb{P}(\tau = t | \boldsymbol{X} = \boldsymbol{x}, F ) $$

We refrain from thinking we can say that \( \mathbb{P}(X_t = 1 | \tau = t, \boldsymbol{X} = \boldsymbol{x}, F) = p \) as this part of the intuitive assumption that we are examining. Instead, regarding \(p\), we restrict ourselves to only being allowed to say:

$$ \mathbb{P} ( X_t = 1 | \Pi_{j=t-k}^{t-1} X_j = 1 ) \equiv p $$

So now we have:

$$ \hat{P}_k(\boldsymbol{x}) = \sum_{t=k+1}^n \left[ \mathbb{P}(X_t = 1 | \tau = t, \boldsymbol{X} = \boldsymbol{x}, F ) \cdot \mathbb{P}(\tau = t | \boldsymbol{X} = \boldsymbol{x}, F ) \right] $$

When we take the expectation with \(F\) given, we are taking the argument above with respect to \(\boldsymbol{X}\) for all \(\boldsymbol{x} \in F\). So:

$$ \mathbb{E} \left[ \hat{P}_k(\boldsymbol{x}) | F \right] = \mathbb{E}_{\boldsymbol{X} for \boldsymbol{x} \in F} \left[ \hat{P}_k(\boldsymbol{x}) | F \right] $$

$$ = \sum_{t=k+1}^n \left[ \mathbb{E}_{\boldsymbol{X} for \boldsymbol{x} \in F} \left[ \mathbb{P}(X_t = 1 | \tau = t, \boldsymbol{X} = \boldsymbol{x}, F ) \cdot \mathbb{P}(\tau = t | \boldsymbol{X} = \boldsymbol{x}, F ) | F \right] \right] $$

$$ = \sum_{t=k+1}^n \left[ \mathbb{P}(X_t = 1 | \tau = t, F ) \cdot \mathbb{P}(\tau = t | F ) \right] $$

$$ = \mathbb{P}(X_t = 1 | F ) $$

which is our identity we were looking for. We also note that

$$ \mathbb{P}(\tau = t | F) \gt 0 \text{ for } t \in \{k+1,…,n\} $$

Next, we divide up \(t\) into \( t \lt n\) and \(t = n\). We show that

$$ \mathbb{P} (X_t = 1 | \tau = t, F) \lt p \text{ when } t \lt n $$

and

$$ \mathbb{P} (X_{t = n} = 1 | \tau = n, F) = p \text{ when } t = n $$

so that

$$ \text{when } t \in {k+1,…,n}, \text{ then } $$

$$ \mathbb{P} (X_t = 1 | \tau = t, F) = \mathbb{P} (t \lt n) \cdot q + \mathbb{P} (t = n) \cdot p \text{ where } q \lt p $$

$$ = \frac{|I_k(\boldsymbol{x})| – 1}{|I_k(\boldsymbol{x})|} \cdot q + \frac{1}{|I_k(\boldsymbol{x})|} \lt p $$

First, we write

$$ \mathbb{P} (X_t = 1 | \tau = t, F) = \mathbb{P} (X_t = 1 | \tau = t, F_t) $$

where

$$ F_t := \{\boldsymbol{x} \in \{0,1\}^n : \Pi_{i=t-k}^{t-1} x_i = 1 \} $$

So while \(F\) is the sample space of sequences \(\boldsymbol{x}\), here we have \(F_t\) being the sample space of sequences where the trial in the \(t\)th position \(x_t\) is a hot hand evaluation trial. We have that \( \tau = t \) is already given so we know that \(X_t\) is a hot hand evaluation, so going from \(F\) to \(F_t\) doesn’t change anything there.

Then, we write:

$$ \mathbb{P} (X_t = 1 | F_t) = p \text{ and } \mathbb{P} (X_t = 0 | F_t) = 1 – p $$

In the above case, the logic seems to be that with only \(F_t\) being given, and \(F_t\) meaning that all \(x_t\)’s are unconditional hot hand evaluations, it simply means that these \(X_t\)’s have a probability \(p\) of being a success.

In the above case of

$$ \mathbb{P}(X_t=1 | \tau = t, F) = \mathbb{P}(X_t=1 | \tau = t, F_t) $$

$$ \text{where } p = \mathbb{P}(X_t = 1 | F_t ) $$

$$ = \sum_{t = k+1}^{n} \left[ \mathbb{P}(X_t=1 | \tau = t, F_t) \cdot \mathbb{P}(\tau = t | F_t) \right] $$

$$ = \sum_{t = k+1}^{n} \left[ \left[ \sum_{\boldsymbol{x} \in F_t} \mathbb{P}(X_t=1 | \tau = t, \boldsymbol{X} = \boldsymbol{x}, F_t) \cdot \mathbb{P}( \boldsymbol{X} = \boldsymbol{x} | \tau = t, F_t ) \right] \cdot \mathbb{P}(\tau = t | F_t) \right] $$

My attempt at the intuition that \( \mathbb{P}(X_t=1 | \tau = t, F_t) \lt p \) (for \(t \lt n\)) is the same as what I said above. Looking at

$$ \mathbb{P}(X_t=1 | \tau = t, F_t) = \sum_{\boldsymbol{x} \in F_t} \mathbb{P}(X_t=1 | \tau = t, \boldsymbol{X} = \boldsymbol{x}, F_t) \cdot \mathbb{P}( \boldsymbol{X} = \boldsymbol{x} | \tau = t, F_t ) $$

for simplicity, let’s assume that with \(p = 1/2\), all sequences in the sample space are equally likely, i.e. a sequence is drawn uniformly. Think of the previous lists of sequences we had, where the frequency of successes or H’s from the top part of the list going down is relatively sparse and gets very frequent at the bottom. So while we draw uniformly from the list of sequences, we are more likely to draw a sequence with less successes/H’s overall than if we could consider trials from the entire sample space. Thus, the probability of drawing a success/H given some sequence ends up being \( \lt p \) on average: the H’s are “bunched up” at the bottom of the list of sequences.

Using Bayes’ Theorem, we write:

$$ \frac{ \mathbb{P} (X_t = 1 | \tau = t, F_t) }{ \mathbb{P} (X_t = 0 | \tau = t, F_t) } = \frac{ \mathbb{P} ( \tau = t | X_t = 1, F_t) \cdot \mathbb{P}(X_t = 1 | F_t) }{\mathbb{P}( \tau = t | F_t)} \cdot \frac{\mathbb{P}( \tau = t | F_t)}{ \mathbb{P} ( \tau = t | X_t = 0, F_t) \cdot \mathbb{P}(X_t = 0 | F_t) } $$

$$ = \frac{ \mathbb{P} ( \tau = t | X_t = 1, F_t) \cdot \mathbb{P}(X_t = 1 | F_t) }{ \mathbb{P} ( \tau = t | X_t = 0, F_t) \cdot \mathbb{P}(X_t = 0 | F_t) } $$

$$ = \frac{ \mathbb{P} ( \tau = t | X_t = 1, F_t) \cdot p }{ \mathbb{P} ( \tau = t | X_t = 0, F_t) \cdot (1 – p) } $$

Let’s write the denominator of the left-hand side in terms of the numerator of the left-hand side and the probability terms of the right-hand side as some unknown, say \(Y\):

$$ \frac{ \mathbb{P} (X_t = 1 | \tau = t, F_t) }{ 1 – \mathbb{P} (X_t = 1 | \tau = t, F_t) } = Y \cdot \frac{p}{1-p} $$

$$ \mathbb{P} (X_t = 1 | \tau = t, F_t) = Y \cdot \frac{p}{1-p} \cdot \left({ 1 – \mathbb{P} (X_t = 1 | \tau = t, F_t) } \right) $$

$$ = Y \cdot \frac{p}{1-p} – Y \cdot \frac{p}{1-p} \cdot \mathbb{P} (X_t = 1 | \tau = t, F_t) $$

$$ \mathbb{P} (X_t = 1 | \tau = t, F_t) + Y \cdot \frac{p}{1-p} \cdot \mathbb{P} (X_t = 1 | \tau = t, F_t) = Y \cdot \frac{p}{1-p} $$

$$ \mathbb{P} (X_t = 1 | \tau = t, F_t) \cdot \left( 1 + Y \cdot \frac{p}{1-p} \right) = Y \cdot \frac{p}{1-p} $$

$$ \mathbb{P} (X_t = 1 | \tau = t, F_t) = \frac{Y \cdot \frac{p}{1-p} } {\left( 1 + Y \cdot \frac{p}{1-p} \right)} = \frac{Y \cdot \frac{p}{1-p} } {\left( \frac{1-p}{1-p} + \frac{Y \cdot p}{1-p} \right)} $$

$$ = \frac{Y \cdot p } { ({1-p}) + Y \cdot p } = \text{ RHS (right-hand side) } $$

If \(Y=1\), then \( \mathbb{P} (X_t = 1 | \tau = t, F_t) = p \).

The derivative of the right-hand side with respect to Y is:

$$ \frac{d}{dY} \left( \frac{Y \cdot p } { ({1-p}) + Y \cdot p } \right) $$

$$ = p \cdot \left( ({1-p}) + Y \cdot p \right)^{-1} – Y \cdot p \cdot \left( ({1-p}) + Y \cdot p \right)^{-2} \cdot p $$

$$ = \frac {p \cdot \left( ({1-p}) + Y \cdot p \right) } {\left( ({1-p}) + Y \cdot p \right)^{2}} – \frac {Y \cdot p^2 } {\left( ({1-p}) + Y \cdot p \right)^{2} } = \frac { p \cdot (1 – p) } {\left( ({1-p}) + Y \cdot p \right)^{2} } $$

The derivative of the right-hand side with respect to Y is always positive for any \(Y\). So as we decrease \(Y\) from 1 so that \(Y \lt 1\), then the right-hand side decreases from \(p\) and we would have

$$ \mathbb{P} (X_t = 1 | \tau = t, F_t) = \frac{Y \cdot p } { ({1-p}) + Y \cdot p } \lt p $$

So to show that \( \mathbb{P} (X_t = 1 | \tau = t, F_t) \lt p \), we show that

$$ Y = \frac{ \mathbb{P} ( \tau = t | X_t = 1, F_t) }{ \mathbb{P} ( \tau = t | X_t = 0, F_t) } \lt 1 $$

or

$$ \mathbb{P} ( \tau = t | X_t = 1, F_t) \lt \mathbb{P} ( \tau = t | X_t = 0, F_t) $$

We write:

$$ \mathbb{P} ( \tau = t | X_t = 0, F_t) = \sum_{\boldsymbol{x} \in F_t: x_t = 0} \mathbb{P} ( \tau = t | X_t = 0, \boldsymbol{X} = \boldsymbol{x}, F_t) \cdot \mathbb{P} ( \boldsymbol{X} = \boldsymbol{x}|X_t = 0, F_t) $$

$$ = \sum_{\boldsymbol{x} \in F_t: x_t = 0} \mathbb{P} ( \tau = t | X_t = 0, \boldsymbol{X_{-t}} = \boldsymbol{x_{-t}}, F_t) \cdot \mathbb{P} ( \boldsymbol{X_{-t}} = \boldsymbol{x_{-t}}|X_t = 0, F_t) $$

where given \(\boldsymbol{x}\), we define \( \boldsymbol{x_{-t}} := (x_1,…,x_{t-1},x_{t+1},…,x_n) \). Since we are already given that \( X_t = 0 \), to say here that we are given \( \boldsymbol{X} = \boldsymbol{x} \) is more like saying that we are given \( X_t = 0 \) and \( \boldsymbol{X_{-t}} = \boldsymbol{x_{-t}} \).

We also write:

$$ \mathbb{P} ( \tau = t | X_t = 1, F_t) = \sum_{\boldsymbol{x} \in F_t: x_t = 1} \mathbb{P} ( \tau = t | X_t = 1, \boldsymbol{X} = \boldsymbol{x}, F_t) \cdot \mathbb{P} ( \boldsymbol{X} = \boldsymbol{x}|X_t = 1, F_t) $$

$$ = \sum_{\boldsymbol{x} \in F_t: x_t = 1} \mathbb{P} ( \tau = t | X_t = 1, \boldsymbol{X_{-t}} = \boldsymbol{x_{-t}}, F_t) \cdot \mathbb{P} ( \boldsymbol{X_{-t}} = \boldsymbol{x_{-t}}|X_t = 1, F_t) $$

Then we compare:

$$ \mathbb{P} ( \boldsymbol{X_{-t}} = \boldsymbol{x_{-t}}|X_t = 0, F_t) \text{ and } \mathbb{P} ( \boldsymbol{X_{-t}} = \boldsymbol{x_{-t}}|X_t = 1, F_t) $$

and see that they are equal since \(X_t\) is an i.i.d. Bernoulli trial and so \( \boldsymbol{X^{-t}} \) is a sequence of i.i.d. Bernoulli trials.

Then we compare:

$$ \mathbb{P} ( \tau = t | X_t = 0, \boldsymbol{X_{-t}} = \boldsymbol{x{-t}}, F_t ) \text{ and } \mathbb{P} ( \tau = t | X_t = 1, \boldsymbol{X_{-t}}= \boldsymbol{x{-t}}, F_t ) $$

The former is the probability of picking a particular hot hand evaluation trial, the \(t\)th trial, given that the \(t\)th trial \(X_t = 0\). The latter is the probability of picking a particular hot hand evaluation trial, the \(t\)th trial, given that the \(t\)th trial \(X_t = 1\). Note that in the latter, because \(X_t = 1\), the \((t+1)\)th trial is also a hot hand evaluation whereas in the former, because \(X_t = 0\), \((t+1)\)th trial is not a hot hand evaluation trial. (Thus here, we are assuming that \(t \lt n\).) Because of this, although the rest of the trials \( \boldsymbol{X_{-t}}= \boldsymbol{x{-t}} \) are identical in both cases, the latter has one more hot hand evaluation trial compared to the former, i.e.

$$ |I_k(\boldsymbol{x}) | \text{ where } X_t = 0 \lt |I_k(\boldsymbol{x}) | \text{ where } X_t = 1 $$

which gives us

$$ \mathbb{P} ( \tau = t | X_t = 0, \boldsymbol{X_{-t}} = \boldsymbol{x{-t}}, F_t ) \lt \mathbb{P} ( \tau = t | X_t = 1, \boldsymbol{X_{-t}}= \boldsymbol{x{-t}}, F_t ) $$

$$ \sum_{\boldsymbol{x} \in F_t: x_t = 0} \mathbb{P} ( \tau = t | X_t = 0, \boldsymbol{X_{-t}} = \boldsymbol{x_{-t}}, F_t) \cdot \mathbb{P} ( \boldsymbol{X_{-t}} = \boldsymbol{x_{-t}}|X_t = 0, F_t) $$

$$ \gt \sum_{\boldsymbol{x} \in F_t: x_t = 1} \mathbb{P} ( \tau = t | X_t = 1, \boldsymbol{X_{-t}} = \boldsymbol{x_{-t}}, F_t) \cdot \mathbb{P} ( \boldsymbol{X_{-t}} = \boldsymbol{x_{-t}}|X_t = 1, F_t) $$

This shows us that:

$$ \mathbb{P} (X_t = 1 | \tau = t, F) \lt p \text{ when } t \lt n $$

For \(t = n\), since the value of \(X_n\) doesn’t affect the number of hot hand evaluation trials, we have

$$ \mathbb{P} ( \tau = n | X_n = 0, \boldsymbol{X_{-t}} = \boldsymbol{x{-t}}, F_n ) = \mathbb{P} ( \tau = n | X_n = 1, \boldsymbol{X_{-t}}= \boldsymbol{x{-t}}, F_n ) $$

and thus we have

$$ \mathbb{P} (X_{t=n} = 1 | \tau = n, F) = p \text{ when } t = n $$

So we have

$$ \mathbb{P} (X_t = 1 | \tau = t, F) \lt p \text{ when } t = \{ k+1,…,n-1\} $$

and

$$ \mathbb{P} (X_{t=n} = 1 | \tau = n, F) = p \text{ when } t = n $$

So

$$ \mathbb{P}(X_t = 1 | F ) $$

$$ = \sum_{t=k+1}^n \left[ \mathbb{P}(X_t = 1 | \tau = t, F ) \cdot \mathbb{P}(\tau = t | F ) \right] $$

$$ = \sum_{t=k+1}^{n-1} \left[ \mathbb{P}(X_t = 1 | \tau = t, F ) \cdot \mathbb{P}(\tau = t | F ) \right] + \left[ \mathbb{P}(X_n = 1 | \tau = n, F ) \cdot \mathbb{P}(\tau = n | F ) \right]$$

and since \( \mathbb{P}(\tau = t | F ) \) is a partition over the \(t\)’s, let \( \left[ \mathbb{P}(X_t = 1 | \tau = t, F ) \text{ when } t \lt n \right] = W < p \), and we have $$ = \sum_{t=k+1}^{n-1} \left[ W \cdot \mathbb{P}(\tau = t | F ) \right] + \left[ p \cdot \mathbb{P}(\tau = n | F ) \right] < p$$ asdf

## Malthus and Ricardo, Wages and Rent

https://en.wikipedia.org/wiki/Iron_law_of_wages

https://en.wikipedia.org/wiki/Law_of_rent

https://en.wikipedia.org/wiki/An_Essay_on_the_Principle_of_Population

http://blogs.worldbank.org/health/female-education-and-childbearing-closer-look-data

Ferdinand Lassalle’s Iron Law of Wages, following from Malthus, and David Ricardo’s Law of Rent are some of the very first relatively quantitative attempts at statements or observations of economics and can IMHO be considered a sort of ancestor of modern economics.

In the Iron Law of Wages, as population increases, the labor supply increases and thus the wage price decreases – which does mean that we assume labor demand is unaffected by population and thus labor demand is effectively exogenous. Wages continue to decrease until they hit subsistence levels for laborers. A further decrease in wages is unsustainable as laborers will literally be unable to sustain themselves, which may cause a decrease in population. A decrease in population, i.e. a decrease in the labor supply, pushes wages back up to the long-term level, which is the minimum subsistence level. If the wage price is above subsistence level, population will rise (the assumption is that any wage above the subsistence level contributes to population growth) until the wage decreases to the subsistence level.

Malthus’s Iron Law of Population is the observation that given enough food, population grows exponentially or geometrically while agricultural output – which is limited by 1. the amount of new land that can be put to agricultural use and 2. the amount of additional intensification that one can do to increase the output of existing agricultural lands, which Malthus understandably assumes to have diminishing returns – grows linearly or arithmetically. For the former limit on agricultural output, his evidence is the population growth in the early United States where new land was plentiful (despite the existence of Natives on those lands) while his evidence for the latter limit on the diminishing returns of agricultural intensification is an appeal to common sense of the times (which may be understandable – we can suppose that it would be hard for someone in the early 1800s to think that agricultural output could grow to accommodate an exponentially growing population or that in the future, longer years of education would lead to declining fertility rates). Since linear growth has no hope of staying above exponential growth in the long run, Malthus’s conclusion is that once population hits the level where the masses can only afford a subsistence level of living, that will be the long run equilibrium for wages and quality of life. There may be ameliorating factors such as an increase in agricultural technology, delay in bearing children, and contraception, or direct decreases to population such as war and disease as such, but Malthus’s opinion was that none of that can overturn the Iron Law of Population. In any case, once population hits the level where people are living at subsistence levels, whether it be war, disease, or famine that keeps population from going above this long run equilibrium doesn’t change the fact that the factors that keep population from going above this equilibrium are painful to humanity.

## The Terms of Trade of Brazil

Source: https://www.nytimes.com/2018/11/09/opinion/what-the-hell-happened-to-brazil-wonkish.html

An article in the New York Times by Paul Krugman talked about a current economic downturn in Brazil. What happened:

First, the global environment deteriorated sharply, with plunging prices for the commodity exports still crucial to the Brazilian economy. Second, domestic private spending also plunged, maybe because of an excessive buildup of debt. Third, policy, instead of fighting the slump, exacerbated it, with fiscal austerity and monetary tightening even as the economy was headed down.

What didn’t happen:

Maybe the first thing to say about Brazil’s crisis is what it wasn’t. Over the past few decades those who follow international macroeconomics have grown more or less accustomed to “sudden stop” crises in which investors abruptly turn on a country they’ve loved not wisely but too well. That was the story of the Mexican crisis of 1994-5, the Asian crises of 1997-9, and, in important ways, the crisis of southern Europe after 2009. It’s also what we seem to be seeing in Turkey and Argentina now.

We know how this story goes: the afflicted country sees its currency depreciate (or, in the case of the euro countries, its interest rates soar). Ordinarily currency depreciation boosts an economy, by making its products more competitive on world markets. But sudden-stop countries have large debts in foreign currency, so the currency depreciation savages balance sheets, causing a severe drop in domestic demand. And policymakers have few good options: raising interest rates to prop up the currency would just hit demand from another direction.

But while you might have assumed that Brazil was a similar case — its 9 percent decline in real G.D.P. per capita is comparable to that of sudden-stop crises of the past — it turns out that it isn’t. Brazil does not, it turns out, have a lot of debt in foreign currency, and currency effects on balance sheets don’t seem to be an important part of the story. What happened instead?

Slowly going over the three points that Krugman made in the beginning:

1. Commodity prices went down and Brazil exports a lot of commodities.

Brazil’s exports in 2016:

From: https://atlas.media.mit.edu/en/visualize/tree_map/hs92/export/bra/all/show/2016/

At a glance, we have among commodities: vegetable products, mineral products (5% crude petroleum, 10% iron and copper ore), foodstuffs, animal products, metals, and precious metals. Though picking out these may be over or underestimating the true percentage of commodity exports among all of Brazil’s exports, let’s use these for our approximation. The total percentage of these products is about 60%, where around 36% are agricultural commodities, around 27% are metal commodities (metals + iron and copper ore), around 5% is crude petroleum, and around 2% are precious metals. These categorizations that I did are improvisational and not following any definitions – they are simplifications.

Looking at the S&P GSCI Agricultural & LiveStock Index Spot (SPGSAL):

From https://www.marketwatch.com/investing/index/spgsal/charts?countrycode=xx

we definitely do see a downtrend in the last several years in agricultural commodities.

Looking at the S&P GSCI Industrial Metals Index Spot (GYX):

From: https://markets.ft.com/data/indices/tearsheet/charts?s=GYX:IOM

there was a decline from 2011 but a rise from 2016.

Looking at the S&P GSCI Precious Metals Index Spot (SPGSPM):

From: https://markets.ft.com/data/indices/tearsheet/charts?s=SPGSPM:IOM

it’s been flat since around 2013.

Looking at S&P GSCI Crude Oil Index Spot (G39):

From: https://markets.ft.com/data/indices/tearsheet/charts?s=G39:IOM

it has been low after a decline in 2014 with volatility in 2017-2018.

But instead of eyeballing this phenomenon with a bunch of different charts, there’s a way that can mathematically eyeball this in one chart, called the **terms of trade**.

Investopedia’s definition of terms of trade:

What are ‘Terms of Trade – TOT’?

Terms of trade represent the ratio between a country’s export prices and its import prices. The ratio is calculated by dividing the price of the exports by the price of the imports and multiplying the result by 100. When a country’s TOT is less than 100%, more capital is leaving the country than is entering the country. When the TOT is greater than 100%, the country is accumulating more capital from exports than it is spending on imports.

But how exactly do you calculate the “price of exports and imports” of a country like, say Brazil, that has USD 190B exports a year and surely thousands if not more different products, and what to do about the changing quantities of each of those products every year? How do we understand the terms of trade in a way that doesn’t vaguely seem like the current account balance? (which is the total value of exports minus imports, or net value of exports: \( EX – IM = \sum_{i}^{}{p_i \cdot q_i} – \sum_{i}^{}{p’_i \cdot q’_i} \) where \( p_i\), \( q_i \) is the price and quantity of export product \(i\) and \( p’_i\), \( q’_i \) is the price and quantity of import product \(i\).

The answer is by deciding on a base year to compare the year in question. For example, for the prices of products in the year in question, we sum the values of exports for each product in that year, i.e. \( \sum_{i} {p_{i,n} \cdot q_{i,n}} \) where \(i\) is the index for each different product and \(n\) is the year in question. For the prices of products in the base year \(0\), we take the price of each product \(i\) in that base year multiplied by the quantity of that product \(i\) in the year in question \(n\). In other words, we fix the quantity of each product \(q_i\) to the quantity of each product in the year in question \(q_{i,n}\) so that we are strictly comparing prices between year \(n\) and \(0\) and not letting changes in quantity \(q\) get in the way. This is the Paasche index.

Another way we can do this is: for the prices of products in the year in question \(n\), we sum the prices of each product in that year \( p_{i,n} \) multiplied by the quantity of each product from the base year \( q_{i,0} \), and for the prices in the base year \(0\), we take the price of each product \(i\) in that base year multiplied by the quantity of that product \(i\) also in the base year \(0\). So this time, instead of fixing the quantity of each product in the year in question \(n\), we fix the quantity of each product to the base year \(0\). This is the Laspeyre index.

Paasche index:

$$ P_{\textrm{Paasche}} = \frac{\sum_{i}{p_{i,n} \cdot q_{i,n}}}{\sum_{i}{p_{i,0} \cdot q_{i,n}}} $$

Laspeyre index:

$$ P_{\textrm{Laspeyre}} = \frac{\sum_{i}{p_{i,n} \cdot q_{i,0}}}{\sum_{i}{p_{i,0} \cdot q_{i,0}}} $$

From: https://en.wikipedia.org/wiki/Price_index?oldformat=true#Paasche_and_Laspeyres_price_indices

Thus, by using such a price index calculation we “cancel out” the effect of changing export or import quantities so that we are only looking at the change of price of exports of imports between two time periods. With a base year \(0\), we can calculate the price index for exports in year \(n\), the price index for imports in year \(n\), and then divide the former by the latter to achieve the terms of trade for year \(n\).

From: https://www.nytimes.com/2018/11/09/opinion/what-the-hell-happened-to-brazil-wonkish.html

A terms of trade chart quantitatively summarizes all the above eyeballing we did with the visualization of Brazil’s exports and the charts of commodities indices as well as the eyeballing we didn’t do with Brazil’s imports. And we see what we expect in the above graph, which is a drop in Brazil’s terms of trade in the last several years.

2. Brazil’s consumer spending declined due to rising household debt (the red graph):

From: https://www.nytimes.com/2018/11/09/opinion/what-the-hell-happened-to-brazil-wonkish.html

3. Brazil implemented fiscal austerity to try to deal with “long-term solvency problems” and raised interest rates to try to deal with inflation, which was caused by depreciation in the currency. The currency depreciated due to lower commodity prices, which of course is also reflected in the terms of trade graph above.

Depreciating currency (blue) and inflation (change in or first derivative of red):

From: https://www.nytimes.com/2018/11/09/opinion/what-the-hell-happened-to-brazil-wonkish.html

Interest rates raised to combat inflation:

From: https://www.nytimes.com/2018/11/09/opinion/what-the-hell-happened-to-brazil-wonkish.html

We can see that interest rates rise in late 2015 as a response to rising inflation. Inflation drops as a response in the next couple of years, but this rise in interest rates contributed to the slow down in Brazil’s economy.

From: https://fred.stlouisfed.org/series/BRANGDPRPCH

So we have a drop in the terms of trade (due to a drop in commodity prices), a drop in consumer spending (due to a rise in household debt in preceding years), and then fiscal austerity and monetary contraction as government policy responses, causing a recession in Brazil.

## test

Centered regular text.

Centering equation by centering the latex text:

Above does not work with tables. Multiple lines with \\ look awkward. Only first line is centered.

Aligning equations using \begin{array}. To center begin latex with <p align=”center”></p>. Will need to check by going between Visual and Text modes. Note that there probably will be strange behavior by the location of </p> in the Text mode when going between the two modes and will probably have to correct it multiple times. Note that the lines need to be tight in Text mode – no extra line of space between the latex code, and this needs to be edited and checked in Text mode, not Visual mode.:

Centered table:

asdf

rcl: three columns, the first column right-justified, the middle one centered, and the third column left-justified

## Test Coin2

Suppose there are two coins and the percentage that each coin flips a Head is \(p\) and \(q\), respectively. \(p, q \in [0,1] \), \(p \neq q \), and the values are given and known. If you are free to flip one of the coins any number of times, how many times \(n\) do you have to flip the coin to decide with some significance level \( \left( \textrm{say } \alpha = 0.05 \right) \) that it’s the \(p\) coin or the \(q\) coin that you’ve been flipping?

The distribution of heads after \(n\) flips for a coin will be a binomial distribution with means at \(pn\) and \(qn\).

** Setting Up Our Hypothesis Test **

Let’s say we want to test if our coin is the \(p\) coin and let’s say we arbitrarily decide to call the smaller probability \(p\), i.e. \(p < q\). We know that coin flips give us a binomial distribution, and we know the standard deviation of a binomial random variable \(X_p\) (let \(X_p\) or \(X_{p,n}\) be a binomial random variable for the number of flips that are heads, where the probability of a head on a flip is \(p\) and we do \(n\) number of flips), which is:

$$ \textrm{Standard Deviation of }{X_p} = \sqrt{ Var\left( {X_p} \right) } = \sqrt{ np(1-p) } $$

—–

Digression: we can also split our \(n\) Bernoulli trial coin flips that make up our binomial random variable \(X_p\) into \(m\) number of binomial random variables \(X_{p,k}\) each with \(k\) trials, such that \(k \times m = n\). Then the standard error of the mean proportion of heads from \(m\) binomial random variables (each with \(k\) trials) is:

$$ \textrm{Standard error of the mean} = \sqrt{ Var\left( \overline{X_{p,k}} \right) } = \sqrt{ Var \left( {1 \over m} \sum_{i=1}^{m} {X_{p,k}} \right) } $$

$$= \sqrt{ Var(\sum_{i=1}^{m} X_{p,k}) \over m^2 } = \sqrt{ m \cdot Var(X_{p,k}) \over m^2 }= \sqrt{ {m \cdot kp(1-p) \over m^2 } } = \sqrt{ { kp(1-p) \over m} } $$

This standard error above is for the random variable \(X_{p,k}\), each of which has \(k\) Bernoulli trials. In other words, the standard deviation of \( {1 \over m} \sum_{i=1}^{m} X_{p,k} \) is \( \sqrt{ kp(1-p) \over m }\). But if you simply change \(k\) to \(km = n\) and reduce \(m\) to \(1\), you get the same result as if you took all \(km = n\) trials as the number of trials for one binomial random variable, our original \(X_p\): where we now say that the standard deviation of \( {1 \over 1} \sum_{i=1}^{1} X_{p,n} = X_{p,n} = X_p \) is \( \sqrt{ np(1-p) \over 1 } = \sqrt{ np(1-p) } \).

By going from \(m\) repetitions of \(X_{p,k}\) to \(1\) repetition of \(X_{p,n}\), both the mean and the standard deviation is multiplied by \(m\). The mean of \(X_{p,k}\) is \(kp\) and the mean of \(X_{p,n}\) is \(mkp = np\); the standard deviation of \(X_{p,k}\) is \( \sqrt{ kp(1-p) } \) and the standard deviation of \(X_{p,n}\) is \( \sqrt{ mkp(1-p) } =\sqrt{ np(1-p) } \). The standard error of the mean of \(m\) repetitions of \(X_{p,k}\) is \( \sqrt{ { kp(1-p) \over m} } \) while the mean of \(m\) repetitions of \(X_{p,k}\) is of course just \( {1 \over m} \sum_{i=1}^{m} \mathbb{E} \left[ X_{p,k} \right] = {1 \over m} m (kp) = kp \). So when going from \(1\) repetition of \(X_{p,k}\) to \(m\) repetitions of \(X_{p,k}\), the mean goes from \(kp\) to \(mkp = np\) and the standard error of the mean of \(X_{p,k}\) goes from \( \sqrt{ { kp(1-p) \over m} } \) to the standard deviation of \( X_{p,n} \) by multiplying the standard error of the mean of \( X_{p,k} \) by \(m\): \( m \cdot \sqrt{ { kp(1-p) \over m} } = \sqrt{ { m^2 \cdot kp(1-p) \over m} } = \sqrt{ { mkp(1-p)} } = \sqrt{ { np(1-p)} } \).

—–

Knowing the standard deviation of our random variable \(X_p\), a 0.05 significance level for a result that “rejects” the null would mean some cutoff value \(c\) where \(c > pn\). If \(x_p\) (the sample number of heads from \(n\) coin tosses) is “too far away” from \(pn\), i.e. we have \(x_p > c\), then we reject the null hypothesis that we have been flipping the \(p\) coin.

But note that if we choose a \(c\) that far exceeds \(qn\) as well, we are in a weird situation. If \(x_p > c\), then \(x_p \) is “too large” for \(pn\) but also is quite larger than \(qn\) (i.e. \( x_p > qn > pn \) ). This puts us in an awkward situation because while \(x_p \) is much larger than \(pn\), making us want to reject the hypothesis that we have were flipping the \(p\) coin, it is also quite larger than \(qn\), so perhaps we obtained a result that was pretty extreme “no matter which coin we had.” If we assume the null hypothesis that we have the \(p\) coin, our result \(x_p \) is very unlikely, but it is also quite unlikely even if we had the \(q\) coin, our alternative hypothesis. But still, it is more unlikely that it is the \(p\) coin than it is the \(q\) coin, so perhaps it’s not that awkward. But what if \(x_p\) does not exceed \(c\)? Then we can’t reject the null hypothesis that we have the \(p\) coin. But our sample result of \(x_p\) might in fact be closer to \(qn\) than \(pn\) – \(x_p\) might even be right on the dot of \(qn\) – and yet we aren’t allowing ourselves to use that to form a better conclusion, which is a truly awkward situation.

If \(c\) is, instead, somewhere in between \(pn\) and \(qn\), and \(x_p > c\), we may reject the null hypothesis that our coin is the \(p\) coin, while \(x_p\) is in a region close to \(q\), i.e. a region that is a more likely result if we actually had been flipping the \(q\) coin, bringing us closer to the conclusion that this is the \(q\). However, if we reverse the experiment – if we use the same critical value \(c\) and say that if \(x_p < c\) then we reject our null hypothesis that \(q\) is our coin, then the power and significance of the test for each coin is different, which is also awkward.

Above, the pink region is the probability that \(X_p\) ends in the critical region, where \(x_p > c\), assuming the null hypothesis that we have the \(p\) coin. This is also the Type I Error rate (a.k.a. false positive) – the probability that we end up falsely rejecting the null hypothesis, assuming that the null hypothesis is true.

Above, the green region is the power \(1-\beta\), the probability that we get a result in the critical region \(x_p > c\) assuming that the alternative hypothesis is true, that we have the \(q\) coin. The blue-gray region is \(\beta\), or the Type II Error rate (a.k.a. false negative) – the probability that we fail to reject the null hypothesis (that we have the \(p\) coin) when what’s actually true is the alternative hypothesis (that we have the \(q\) coin).

Now let us “reverse” the experiment with the same critical value – we want to test our null hypothesis that we have the \(q\) coin:

We have \(x_p < c\). We fail to reject the null hypothesis that we have the \(p\) coin, and on the flip side we would reject the null hypothesis that we have the \(q\) coin. but we have failed a tougher test (the first one, with a small \(\alpha_p\)) and succeeded in rejecting an easier test (the second one, with a larger \(\alpha_q\)). In hypothesis testing, we would like to be conservative, so it is awkward to have failed a tougher test but "be ok with it" since we succeeded with an easier test. Common sense also, obviously, says that something is strange when \(x_p\) is closer to \(q\) than \(p\) and yet we make the conclusion that since \(x_p\) is on the "\(p\)-side of \(c\)," we have the \(p\) coin. In reality, we wouldn't take one result and apply two hypothesis tests on that one result. But we would like the one test procedure to make sense with whichever null hypothesis we start with, \(p\) coin or \(q\) coin (since it is arbitrary which null hypothesis we choose in the beginning, for we have no knowledge of which coin we have before we start the experiment).

What we can do, then, is to select \(c\) so that the probability, under the hypothesis that we have the \(p\) coin, that \(X_p > c\) is equal to the probability that, under the hypothesis that we have the \(q\) coin, that \(X_q < c\). In our set up, we have two binomial distributions, which are discrete, as opposed to the normal distributions above. In addition, binomial distributions, unless the mean is at \(n/2\), are generally not symmetric, as can be seen in the very first figure, copied below as well, where the blue distribution is symmetric but the green one is not.

We can pretend that the blue distribution is the binomial distribution for the \(p\) coin and the green distribution for the \(q\) coin. The pmf of a binomial random variable, say \(X_p\) (that generates Heads or Tails with probability of Heads \(p\)) is:

$$ {n \choose h} p^h (1-p)^{n-h} $$

where \(n\) is the total number of flips and \(h\) is the number of Heads among those flips. We let \(c\) be the critical number of Heads that would cause us to reject the null hypothesis that the coin we have is the \(p\) coin in favor of the alternative hypothesis that we have the \(q\) coin. The area of the critical region, i.e. the probability that we get \(c_H\) heads or more assuming the hypothesis that we have the \(p\) coin, is:

$$ Pr(X_p > c) = \sum_{i=c}^{n} \left[ {n \choose i} p^i (1-p)^{n-i} \right] $$

And the reverse, the probability that we get \(c_H\) heads or less assuming the hypothesis that we have the \(q\) coin, is:

$$ Pr(X_q < c) = \sum_{i=0}^{c} \left[ {n \choose i} q^i (1-q)^{n-i} \right] $$ So we want to set these two equal to each other and solve for \(c\): $$ \sum_{i=c}^{n} \left[ {n \choose i} p^i (1-p)^{n-i} \right] = \sum_{i=0}^{c} \left[ {n \choose i} q^i (1-q)^{n-i} \right] $$ But since the binomial distribution is discrete, there may not be a \(c\) that actually works. For large \(n\), a normal distribution can approximate the binomial distribution. In that case, we can draw the figure below, which is two normal distributions, each centered on \(pn\) and \(qn\) (the means of the true binomial distributions), and since normal distributions are symmetric, the point at which the distributions cross will be our critical value. The critical regions for \(X_p\) (to the right of \(c\)) and for \(X_q\) (to the left of \(c\)) will have the same area.

If we pretend that these normal distributions are binomial distributions, i.e. if we pretend that our binomial distributions are symmetric (i.e. we pretend that \(n\) is going to be large enough that both our binomial distributions of \(X_p\) and \(X_q\) are symmetric enough), then to find \(c\) we can find the value on the horizontal axis at which, i.e. the number of Heads at which, the two binomial probability distributions are equal to each other.

$$ {n \choose c} p^c (1-p)^{n-c} = {n \choose c} q^c (1-q)^{n-c} $$

$$ p^c (1-p)^{n-c} = q^c (1-q)^{n-c} $$

$$ \left({p \over q}\right)^c \left({1-p \over 1-q}\right)^{n-c} = 1 $$

$$ \left({p \over q}\right)^c \left({1-p \over 1-q}\right)^{n} \left({1-q \over 1-p}\right)^c = 1 $$

$$ \left({p(1-q) \over q(1-p)}\right)^c = \left({1-q \over 1-p}\right)^{n} $$

$$ c \cdot log \left({p(1-q) \over q(1-p)}\right) = n \cdot log \left({1-q \over 1-p}\right) $$

$$ c = n \cdot log \left({1-q \over 1-p}\right) / log \left({p(1-q) \over q(1-p)}\right) $$

The mean of a binomial distribution \(X_p\) has mean \(pn\) with standard deviation \(\sqrt{np(1-p)}\). With a normal distribution \(X_{\textrm{norm}}\) with mean \(\mu_{\textrm{norm}}\) and standard deviation \(\sigma_{\textrm{norm}}\), the value \( c_{\alpha} = X_{\textrm{norm}} = \mu_{\textrm{norm}} = 1.645\sigma_{\textrm{norm}}\) is the value where the area from that value \(c_{\alpha}\) to infinity is \(0.05 = \alpha\). Thus, \( c_{\alpha} \) is the critical value for a normal random variable where the probability that \(X_{\textrm{norm}} > c_{\alpha} = 0.05)\). So for a binomial random variable \(X_p\), we would have \(c_{\textrm{binomial, }\alpha} = pn + 1.645\sqrt{np(1-p)}\).

Thus, we have that this critical value for a binomial random variable \(X_p\):

$$ c = n \cdot log \left({1-q \over 1-p}\right) / log \left({p(1-q) \over q(1-p)}\right) $$

must also be

$$ c_{\textrm{binomial, }\alpha} \geq pn + 1.645\sqrt{np(1-p)} $$

for the area to the right of \(c\) to be \(\leq 0.05\). To actually find the critical value \(c_{\textrm{binomial, }\alpha}\), we can just use

$$ c_{\textrm{binomial, }\alpha} \geq pn + 1.645\sqrt{np(1-p)} $$

Since we are given the values of \(p\) and \(q\), we would plug in those values to find the required \(n\) needed to reach this condition for the critical value. So we have

$$ n \cdot log \left({1-q \over 1-p}\right) / log \left({p(1-q) \over q(1-p)}\right) = pn + 1.645\sqrt{np(1-p)} $$

$$ \sqrt{n} = 1.645\sqrt{p(1-p)} / \left[ log \left({1-q \over 1-p}\right) / log \left({p(1-q) \over q(1-p)}\right) – p \right] $$

$$ n = 1.645^2p(1-p) / \left[ log \left({1-q \over 1-p}\right) / log \left({p(1-q) \over q(1-p)}\right) – p\right]^2 $$

For example, if \(p = 0.3\) and \(q = 0.7\), we have \(n = 14.2066 \), or rather, \(n \geq 14.2066 \).

Wolfram Alpha calculation of above, enter the following into Wolfram Alpha:

1.645^2 * p * (1-p) / (ln((1-q)/(1-p))/ln(p*(1-q)/(q*(1-p))) – p )^2; p = 0.3, q = 0.7

Note that if we switch the values so that \(p = 0.7\) and \(q = 0.3\), or switch the \(p\)’s and \(q\)’s in the above equation for \(n\), we obtain the same \(n_{\textrm{min}}\). This makes sense since our value for \(n_{\textrm{min}}\) depends on \(c\) and \(c\) is the value on the horizontal axis at which the two normal distributions from above (approximations of binomial distributions) with means at \(pn\) and \(qn\) cross each other. Thus, we set up the distributions so that that whole problem is symmetric.

So if we generate a sample such that the number of samples is \(n \geq 14.2066\), we can use our resulting \(x_p\) and make a hypothesis test regarding if we have the \(p\) or \(q\) coin with \(\alpha = 0.05\) significance level.

If \(p\) and \(q\) are closer, say \(p = 0.4\) and \(q = 0.5\), then we have \(n \geq 263.345\). This makes intuitive sense, where the closer the probabilities are of the two coins, the more times we have to flip our coin to be more sure that we have one of the coins rather than the other. To be more precise, the smaller the **effect size** is, the larger sample size we need in order to get the certainty about a result. An example of the effect size is **Cohen’s d** where:

$$\textrm{Cohen’s d } = {\mu_2 – \mu_1 \over \textrm{StDev / Pooled StDev}} $$

Wolfram Alpha calculation of above for \(n\) with \(p = 0.4\) and \(q = 0.5\), or enter the following into Wolfram Alpha:

1.645^2 * p * (1-p) / (ln((1-q)/(1-p))/ln(p*(1-q)/(q*(1-p))) – p )^2; p = 0.4, q = 0.5

From here, where the question is asked originally, is an answer that finds the exact values for the two \(n_{\textrm{min}}\) using R with the actual binomial distributions (not using normal distributions as approximations):

https://math.stackexchange.com/a/2033739/506042

Due to the discrete-ness of the distributions, the \(n_{\textrm{min}}\)’s found are slightly different: \(n_{\textrm{min}} = 17\) for the first case and \(n_{\textrm{min}} = 268\) for the second case. I.e., the difference comes from using the normal distribution as an approximation for the binomial distribution.

## Test Coin

Suppose there are two coins and the percentage that each coin flips a Head is \(p\) and \(q\), respectively. \(p, q \in [0,1] \) and the values are given and known. If you are free to flip one of the coins, how many times \(n\) do you have to flip the coin to decide with some significance level \( \left( \textrm{say } \alpha = 0.05 \right) \) that it’s the \(p\) coin or the \(q\) coin that you’ve been flipping?

The distribution of heads after \(n\) flips for a coin will be a binomial distribution with means at \(pn\) and \(qn\).

**The Usual Hypothesis Test**

In the usual hypothesis test, for example with data \(x_i, i=1, 2, 3, …, n\) from a random variable \(X\), to find the if the mean \( \mu \) is \(\leq\) some constant \(\mu_0\):

\begin{align}

H_0 & : \mu \leq \mu_0 ( \textrm{ and } X \sim N(\mu_0, \textrm{ some } \sigma^2 ) )

H_1 & : \mu > \mu_0

\end{align}

If the sample mean of the data points \( \overline{x} \) is “too large compared to” \( \mu_0 \), then we reject the null hypothesis \( H_0 \).

If we have the probability distribution of the random variable (even if we don’t know the true value of the mean \( \mu \)), we may be able to know something about the probability distribution of a statistic obtained from manipulating the sample data, e.g. the sample mean. This, the probability distribution of a statistic (obtained from manipulating sample data), is called the **sampling distribution**. And a property of the sampling distribution, the standard deviation of a statistic, is the **standard error**. For example, the **standard error of the mean** is:

Sample Data: \(x\) \(\qquad\) Sample Mean: \( \overline{x} \)

Variance: \( Var(x) \) \(\qquad\) Standard Deviation: \( StDev(x) = \sigma(x) = \sqrt{Var(x)} \)

Variance of the Sample Mean: \( Var( \overline{x} ) = Var \left( \frac{1}{n} \sum_{i=0}^{n}{ x_i } \right) = \frac{1}{n^2} \sum_{i=0}^{n} { Var(x_i) } = \frac{1}{n^2} n Var(x) = \frac{1}{n} Var(x) = {\sigma^2 \over n} \)

Standard Deviation of the Sample Mean, Standard Error of the Mean: \( \frac{1}{\sqrt{n}} StDev(x) = {\sigma \over \sqrt{n}} \)

Thus, if the random variable is \(i.i.d.\) (independent and identically distributed), then with the sample mean \( \overline{x} \) we obtain from the data, we can assume this \( \overline{x} \) has a standard deviation of \( \frac{\sigma}{\sqrt{n}} \). This standard deviation, being smaller than the standard deviation of the original \(X\), i.e. \(\sigma\), means that \(\overline{X}\) is narrower around the mean than \(X\). This means \(\overline{X}\) gives us a better ability to hone in on what the data says about \( \mu \) than \(X\)’s ability to hone, i.e. a narrower, more precise, “range of certainty,” from the sample data, with the same significance level.

Thus, given our sample \(x_i, i = 1, \dots, n \), we can calculate the statistic \( \overline{x} = \frac{1}{n} \sum_{i=1}^{n} {x_i} \), our sample mean. From the data (or given information), we would like to calculate the standard error of the mean, the standard deviation of this sample mean as a random variable (where the sample mean is a statistic, i.e. can be treated as a random variable): \( \frac{1}{\sqrt{n}} StDev(x) = {\sigma \over \sqrt{n}} \). This standard error of the mean gives us a “range of certainty” around the \(\overline{x}\) with which to make an inference.

**A. If we know/are given the true standard deviation \( \sigma \)**

If we are given the true standard deviation \( \sigma \) of the random variable \( X \), then we can calculate the standard error of the sample mean: \( \frac{\sigma}{\sqrt{n}} \). So under the null hypothesis \( H_0: \mu \leq \mu_0 \), we want to check if the null hypothesis can hold against a test using the sample data.

**A.a Digression about \(H_0: \mu \leq \mu_0\) and \(H_0: \mu = \mu_0\) **

If the \(\mu\) we infer from the sample data is “too extreme,” in this case “too large” compared to \(\mu_0\), i.e. the test statistic is > some critical value that depends on \(\mu_0\), i.e. \(c(\mu_0)\), we reject the null hypothesis. If we check a \(\mu_1\) that is \(\mu_1 < \mu_0\) (since our null hypothesis is \( H_0: \mu < \mu_0 \)), our critical value \(c(\mu_1)\) will be less extreme than \(c(\mu_0)\) (in other words \( c(\mu_1) < c(\mu_0) \)), and thus it would be "easier to reject" the null hypothesis if using \( c(\mu_1) \). Rejecting a hypothesis test ought to be conservative since rejecting a null hypothesis is reaching a conclusion, so we would like the test to be "the hardest to reject" that we can (a conclusion, i.e. a rejection here, should be as conservative as possible). The "hardest to reject" part of the range of \(H_0: \mu \leq \mu_0\) would be \( \mu = \mu_0 \) where the critical value \( c(\mu_0) \) would be the largest possible critical value. Testing a \(\mu_1 < \mu_0\) would mean that we may obtain a test statistic that rejects is too extreme/large) for \(\mu_1\) (i.e. \( t > c(\mu_1) \) ) but not too extreme/large for \(\mu_0\) (i.e. \( t \not> c(\mu_0) \) ). But if we test using \(\mu_0\), if the test statistic is extreme/large enough that we reject the null hypothesis of \(\mu = \mu_0\), that would also reject all other null hypotheses using \(\mu_1\) where \(\mu_1 < \mu_0\).

So under the null hypothesis \( H_0: \mu \leq \mu_0 \) or the “effective” null hypothesis \( H_0: \mu = \mu_0 \), we have that \( X \sim N(\mu_0, \sigma^2) \) with \( \sigma \) known, and we have that \( \overline{X} \sim N(\mu_0, \sigma^2/n) \). This means that

\( \frac{\overline{X} – \mu_0} { ^{\sigma}/_{\sqrt{n}} } \sim N(0, 1) \)

Then we can use a standard normal table to find where on the standard normal is the \( \alpha = 0.05 \) cutoff – for a one-tailed test, the cutoff is at \( Z_{\alpha} = 1.645 \) where \( Z \sim N(0, 1) \). So if

\( \frac{\overline{X} – \mu_0} { ^{\sigma}/_{\sqrt{n}} } > 1.645 = Z_{\alpha} \),

then this result is “too large compared to \( \mu_0 \)” so we reject the null hypothesis \( H_0: \mu \leq \mu_0 \). If \( \frac{\overline{X} – \mu_0} { ^{\sigma}/_{\sqrt{n}} } \leq 1.645 = Z_{\alpha} \), then we fail to reject the null hypothesis \( H_0: \mu \leq \mu_0 \).

** B. If we don’t know the standard deviation \( \sigma \) **

If we don’t know the value of the standard deviation \( \sigma \) of our random variable \( X \sim N( \mu, \sigma^2 ) \) (which would be somewhat expected if we already don’t know the value of the mean \(\mu\) of \( X \)), then we need to estimate \( \sigma \) from our data \( x_i, i = 1, 2, \dots, n \). We can estimate \( \sigma \) by taking the sample standard deviation of \( x_i, i = 1, \dots, n \) by doing \( s = \sqrt{ \frac{1}{n-1} \sum_{i=0}^{n} { \left[ (x_i – \overline{x})^2 \right] } } \), or rather the sample variance \( s^2 = { \frac{1}{n-1} \sum_{i=0}^{n} { \left[ (x_i – \overline{x})^2 \right] } } \) and then taking the square root of that.

However, note that while the estimator for the sample variance is unbiased:

\begin{align}

\mathbb{E}\left[s^2\right] & = \mathbb{E}\left[ \frac{1}{n-1} \sum_{i=0}^{n} { \left[ (x_i – \overline{x})^2 \right] } \right] \\

& = \frac{1}{n-1} \mathbb{E} \left[ \sum_{i=0}^{n} { \left[ (x_i – \overline{x})^2 \right] } \right] = \frac{1}{n-1} \mathbb{E} \left[ \sum_{i=0}^{n} { \left[ (x_i -\mu + \mu – \overline{x})^2 \right] } \right] \\

& = \frac{1}{n-1} \mathbb{E} \left[ \sum_{i=0}^{n} { \left[ \left( (x_i -\mu) – (\overline{x} – \mu) \right)^2 \right] } \right] \\

& = \frac{1}{n-1} \mathbb{E} \left[ \sum_{i=0}^{n} { \left[ (x_i – \mu)^2 – 2 (x_i – \mu) (\overline{x} – \mu) + (\overline{x} – \mu)^2 \right] } \right] \\

& = \frac{1}{n-1} \mathbb{E} \left[ \sum_{i=0}^{n} { \left[ (x_i – \mu)^2 – 2 (x_i – \mu) (\overline{x} – \mu) + (\overline{x} – \mu)^2 \right] } \right] \\

& = \frac{1}{n-1} \mathbb{E} \left[ \sum_{i=0}^{n} { (x_i – \mu)^2 } – 2 (\overline{x} – \mu) \sum_{i=0}^{n} { (x_i – \mu) } + \sum_{i=0}^{n} { (\overline{x} – \mu)^2 } \right] \\

& = \frac{1}{n-1} \mathbb{E} \left[ \sum_{i=0}^{n} { (x_i – \mu)^2 } – 2 (\overline{x} – \mu) (n \overline{x} – n \mu) + n (\overline{x} – \mu)^2 \right] \\

& = \frac{1}{n-1} \mathbb{E} \left[ \sum_{i=0}^{n} { (x_i – \mu)^2 } – 2 n (\overline{x} – \mu)^2 + n (\overline{x} – \mu)^2 \right] \\

& = \frac{1}{n-1} \mathbb{E} \left[ \sum_{i=0}^{n} { (x_i – \mu)^2 } – n (\overline{x} – \mu)^2 \right] \\

& = \frac{1}{n-1} \sum_{i=0}^{n} { \mathbb{E} \left[ (x_i – \mu)^2 \right] } – n \mathbb{E} \left[ (\overline{x} – \mu)^2 \right] = \frac{1}{n-1} \left( \sum_{i=0}^{n} { \mathbb{E} \left[ (x_i – \mu)^2 \right] } – n \mathbb{E} \left[ (\overline{x} – \mu)^2 \right] \right) \\

& = \frac{1}{n-1} \left( \sum_{i=0}^{n} { \mathbb{E} \left[ x_i^2 – 2 \mu x_i + \mu^2 \right] } – n \mathbb{E} \left[ \overline{x}^2 – 2 \mu \overline{x} + \mu^2 \right] \right) \\

& = \frac{1}{n-1} \left( \sum_{i=0}^{n} { \left( \mathbb{E} \left[ x_i^2 \right] – 2 \mu \mathbb{E} [x_i] + \mu^2 \right) } – n \left( \mathbb{E} \left[ \overline{x}^2 \right] – 2 \mu \mathbb{E} [\overline{x}] + \mu^2 \right) \right) \\

& = \frac{1}{n-1} \left( \sum_{i=0}^{n} { \left( \mathbb{E} \left[ x_i^2 \right] – 2 \mu^2 + \mu^2 \right) } – n \left( \mathbb{E} \left[ \overline{x}^2 \right] – 2 \mu^2 + \mu^2 \right) \right) \\

& = \frac{1}{n-1} \left( \sum_{i=0}^{n} { \left( \mathbb{E} \left[ x_i^2 \right] – \mu^2 \right) } – n \left( \mathbb{E} \left[ \overline{x}^2 \right] – \mu^2 \right) \right) \\

& = \frac{1}{n-1} \left( \sum_{i=0}^{n} { \left( \mathbb{E} \left[ x_i^2 \right] – \left( \mathbb{E} [x_i] \right)^2 \right) } – n \left( \mathbb{E} \left[ \overline{x}^2 \right] – \left( \mathbb{E} [\overline{x}] \right)^2 \right) \right) \\

& = \frac{1}{n-1} \left( \sum_{i=0}^{n} { \left( Var(x_i) \right) } – n Var(\overline{X}) \right) = \frac{1}{n-1} \left( \sum_{i=0}^{n} { \left( \sigma^2 \right) } – n \frac{\sigma^2}{n} \right) \\

& = \frac{1}{n-1} \left( n \sigma^2 – \sigma^2 \right) = \sigma^2 \\

\end{align}

that does not allow us to say that the square root of the above estimator gives us an unbiased estimator for the standard deviation \( \sigma \). In other words:

\( \mathbb{E}\left[s^2\right] = \mathbb{E}\left[ \frac{1}{n-1} \sum_{i=0}^{n} { \left[ (x_i – \overline{x})^2 \right] } \right] = \sigma^2 \)

but

\( \mathbb{E} [s] = \mathbb{E} \left[ \sqrt{ \frac{1}{n-1} \sum_{i=0}^{n} { \left[ (x_i – \overline{x})^2 \right] } } \right] \neq \sigma \)

because the expectation function and the square root function do not commute:

\( \sigma = \sqrt{\sigma^2} = \sqrt{ \mathbb{E}[s^2] } \neq \mathbb{E}[\sqrt{s^2}] = \mathbb{E}[s] \)

**B.a The sample standard deviation \( s = \sqrt{s^2} = \sqrt{ \frac{1}{n-1} \sum_{i=0}^{n} { \left[ (x_i – \overline{x})^2 \right] } } \) is a biased estimator of \( \sigma \)**

In fact, we can infer the bias of \( \mathbb{E} [s] \) to some extent. The square root function \( f(x) = \sqrt{x} \) is a concave function. A concave function \( f \) is:

$$ \forall x_1, x_2 \in X, \forall t \in [0, 1]: \quad f(tx_1 + (1 – t) x_2 ) \geq tf(x_1) + (1 – t) f(x_2) $$

The left-hand side of the inequality is the blue portion of the curve \( \{ f( \textrm{mixture of } x_1 \textrm{ and } x_2 ) \} \) and the right-hand side of the inequality is the red line segment \( \{ \textrm{a mixture of } f(x_1) \textrm{ and } f(x_2) \} \). We can see visually what it means for a function to be concave, where between to arbitrary \(x\)-values \(x_1\) and \(x_2\), the blue portion is always \(\geq\) the red portion between two \(x\)-values, .

Jensen’s Inequality says that if \( g(x) \) is a convex function, then:

$$ g( \mathbb{E}[X] ) \leq \mathbb{E}\left[ g(X) \right] $$

and if \( f(x) \) is a concave function, then:

$$ f( \mathbb{E}[X] ) \geq \mathbb{E}\left[ f(X) \right] $$

The figure above showing the concave function \(f(x) = \sqrt{x}\) gives an intuitive illustration of Jensen’s Inequality as well (since Jensen’s Inequality can be said to be a generalization of the “mixture” of \(x_1\) and \(x_2\) property of convex and concave functions to the expectation operator). The left-hand side \( f(\mathbb{E}[X]) \) is like \( f( \textrm{a mixture of } X \textrm{ values} ) \) and the right-hand side \( \mathbb{E}\left[ f(X) \right] \) is like \( {\textrm{a mixture of } f(X) \textrm{ values} } \) where the “mixture” in both cases is the “long-term mixture” of \( X \) values that is determined by the probability distribution of \( X \).

Since \( f(z) = \sqrt{z} \) is a concave function, going back to our estimation of the standard deviation of \( X \) using \(\sqrt{s^2}\), we have

\begin{align}

f( \mathbb{E}[Z] ) & \geq \mathbb{E}\left[ f(Z) \right] \longrightarrow \\

\sqrt{\mathbb{E}[Z]} & \geq \mathbb{E}\left[ \sqrt{Z} \right] \longrightarrow \\

\sqrt{ \mathbb{E}[s^2] } & \geq \mathbb{E}\left[ \sqrt{s^2} \right] \longrightarrow \\

\sqrt{ Var(X) } & \geq \mathbb{E}\left[s\right] \\

\textrm{StDev} (X) = \sigma(X) & \geq \mathbb{E}\left[s\right] \\

\end{align}

Thus, \( \mathbb{E} [s] = \mathbb{E} \left[ \sqrt{ \frac{1}{n-1} \sum_{i=0}^{n} { \left[ (x_i – \overline{x})^2 \right] } } \right] \leq \sigma \). **So \( \mathbb{E} [s] \) is biased and underestimates the true \(\sigma\). **

However, the exact bias \( \textrm{Bias}(s) = \mathbb{E} [s] – \sigma \) is not as clean to show.

https://en.wikipedia.org/wiki/Unbiased_estimation_of_standard_deviation

\( \frac{(n-1)s^2}{\sigma^2} = \frac{1}{\sigma^2} \sum_{i=0}^{n} { \left[ (x_i – \overline{x})^2 \right] } \sim \) a \( \chi^2 \) distribution with \( n-1 \) degrees of freedom. In addition, \( \sqrt{ \frac{(n-1)s^2}{\sigma^2} } = \frac{\sqrt{n-1}s}{\sigma} = \frac{1}{\sigma} \sqrt{ \sum_{i=0}^{n} { \left[ (x_i – \overline{x})^2 \right] } } \sim \) a \( \chi \) distribution with \( n-1 \) degrees of freedom. A \( \chi \) distribution with \(k\) degrees of freedom has mean \( \mathbb{E} \left[ \frac{\sqrt{n-1}s}{\sigma} \right] = \mu_{\chi} = \sqrt{2} \frac{\Gamma ( ^{(k+1)} / _2 ) } { \Gamma ( ^k / _2 )} \) where \( \Gamma(z) \) is the Gamma function.

https://en.wikipedia.org/wiki/Gamma_function

If \(n\) is a positive integer, then \( \Gamma(n) = (n – 1)! \). If \(z\) is a complex number that is not a non-positive integer, then \( \Gamma(z) = \int_{0}^{\infty}{x^{z-1} e^{-x} dx} \). For non-positive integers, \( \Gamma(z) \) goes to \(\infty\) or \(-\infty\).

From the mean of a \( \chi \) distribution above, we have:

\( \mathbb{E}[s] = {1 \over \sqrt{n – 1} } \cdot \mu_{\chi} \cdot \sigma \)

and replacing \(k\) with \(n-1\) degrees of freedom for the value of \(\mu_{\chi}\), we have:

\( \mathbb{E}[s] = \sqrt{ {2 \over n – 1} } \cdot { \Gamma(^n/_2) \over \Gamma(^{n-1}/_2) } \cdot \sigma \)

Wikipedia tells us that:

\( \sqrt{ {2 \over n – 1} } \cdot { \Gamma(^n/_2) \over \Gamma(^{n-1}/_2) } = c_4(n) = 1 – {1 \over 4n} – {7 \over 32n^2} – {19 \over 128n^3} – O(n^{-4}) \)

So we have:

\( \textrm{Bias} (s) = \mathbb{E}[s] – \sigma = c_4(n) \cdot \sigma – \sigma = ( c_4(n) – 1) \cdot \sigma \)

\( = \left( \left( 1 – {1 \over 4n} – {7 \over 32n^2} – {19 \over 128n^3} – O(n^{-4}) \right) – 1 \right) \cdot \sigma = – \left( {1 \over 4n} + {7 \over 32n^2} + {19 \over 128n^3} + O(n^{-4}) \right) \cdot \sigma \)

Thus, as \(n\) becomes large, the magnitude of the bias becomes small.

From Wikipedia, these are the values of \(n\), \(c_4(n)\), and the numerical value of \( c_4(n) \):

\begin{array}{|l|r|c|}

\hline

n & c_4(n) & \textrm{Numerical value of } c_4(n) \\

\hline

2 & \sqrt{2 \over \pi} & 0.798… \\

3 & {\sqrt{\pi} \over 2} & 0.886… \\

5 & {3 \over 4}\sqrt{\pi \over 2} & 0.940… \\

10 & {108 \over 125}\sqrt{2 \over \pi} & 0.973… \\

100 & – & 0.997… \\

\hline

\end{array}

Thus, for the most part, we don’t have to worry too much about this bias, especially with large \(n\). So we have

\( \mathbb{E}[\hat{\sigma}] \approx \mathbb{E}[s] = \mathbb{E}[\sqrt{s^2}] = \mathbb{E} \left[ \sqrt{ \frac{1}{n-1} \sum_{i=0}^{n} { \left[ (x_i – \overline{x})^2 \right] } } \right] \)

More rigorously, our estimator \( \hat{\sigma} = s = \sqrt{ \frac{1}{n-1} \sum_{i=0}^{n} { \left[ (x_i – \overline{x})^2 \right] } } \) is a consistent estimator of \( \sigma \) (even though it is a biased estimator of \( \sigma \)).

**An estimator is consistent if \( \forall \epsilon > 0 \):**

**$$ \lim\limits_{n \to \infty} \textrm{Pr } (|\hat{\theta} – \theta| > \epsilon ) = 0 $$**

In other words, as \( n \to \infty \), the probability that our estimator \( \hat{\theta} \) “misses” the true value of the parameter \(\theta\) by greater than some arbitrary positive amount (no matter how small) goes to \(0\).

For the sample standard deviation \(s\) as our estimator of the true standard deviation \(\sigma\) (i.e. let \(\hat{\sigma} = s\)),

\( \lim_{n \to \infty} (|\hat{\sigma} – \sigma|) = \lim_{n \to \infty} ( | c_4(n) \sigma – \sigma |) = (| \sigma – \sigma |) = 0 \)

so

\( \lim_{n \to \infty} \textrm{Pr } (|\hat{\sigma} – \sigma| > \epsilon) = \textrm{Pr } ( 0 > \epsilon ) = 0 \)

Since \(s\) is a consistent estimator of \(\sigma\), we are fine to use \(s\) to estimate \(\sigma\) as long as we have large \(n\).

**So back to the matter at hand: **we want to know the sampling distribution of \(\overline{X} \) to see “what we can say” about \(\overline{X}\), specifically, the standard deviation of \(\overline{X}\), i.e. the standard error of the mean of \(X\). Not knowing the true standard deviation \(\sigma\) of \(X\), we use a consistent estimator of \(\sigma\) to estimate it: \(s = \sqrt{{1 \over n-1} \sum_{i=1}^n {(x_i – \overline{x})^2}}\).

So instead of the case where we know the value of \(\sigma\)

\(\overline{X} \sim N(\mu, \sigma^2/n)\)

we have, instead something like:

\(\overline{X} \quad “\sim” \quad N(\mu, s^2/n)\)

When we know the value of \(\sigma\), we have

\({ \overline{X} – \mu \over \sigma/\sqrt{n} } \sim N(0,1) \)

When we don’t know the value of \(\sigma\) and use the estimate \(s\) instead of having something like

\({ \overline{X} – \mu \over s/\sqrt{n} } \quad “\sim” \quad N(0,1) \)

we actually have the exact distribution:

\({ \overline{X} – \mu \over s/\sqrt{n} } \sim T_{n-1} \)

the student’s t-distribution with \(n-1\) degrees of freedom.

Thus, finally, when we don’t know the true standard deviation \(\sigma\), under the null hypothesis \( H_0: \mu \leq \mu_0 \), we can use the expression above to create a test statistic

\( t = { \overline{x} – \mu_0 \over s/\sqrt{n} } ~ T_{n-1} \)

and check it against the student’s t-distribution with \(n-1\) degrees of freedom \(T_{n-1}\) with some critical value with some significance level, say \(\alpha = 0.05\).

So if the test statistic exceeds our critical value \(\alpha 0.05\):

\( t = { \overline{x} – \mu_0 \over s/\sqrt{n} } > T_{n-1, \alpha} \)

then we reject our null hypothesis \( H_0: \mu \leq \mu_0 \) at \(\alpha = 0.05\) significance level. If not, then we fail to reject our null hypothesis.

asdf

we know the standard deviation of a data point

If under the null hypothesis \( H_0 \) we have a probability distribution, the sample data gives us a sample standard deviation, i.e. the standard error.

Back to our case with 2 coins. Let’s say we want to test if our coin is the \(p\) coin and let’s say we arbitrarily decide to call the smaller probability \(p\), i.e. \(p \leq q\). We know that coin flips give us a binomial distribution, and we know the standard error of the mean proportion of heads from \(n\) flips. So a 0.05 significance level would mean some cutoff value \(c\) where \(c > p\). But note that if \(c\) ends up really big relative to \(q\), e.g. it gets close to \(q\) or even exceeds \(q\), we are in a weird situation.

we can decide on some cutoff value \(c\) between \(p\) and \(q\). If we change around \(c\), what happens is that the significance level and the power of the test, whether testing \(p\) or \(q\), changes.

## Test

Test

The Usual Colors of the Sky

Why is the sky blue? Why are sunsets red? The answers and explanations to these questions can be found fairly easily on the internet. But there are many subtle “Wait a minute…”-type questions in between the cracks that seem to necessitate subtle answers and more “difficult” searching around the internet to find those answers.

So, why is the sky blue?

No, but first, what color is sunlight?

Visible light generally is from 390 nm (violet) to 720 nm (red). Visible sunlight is a mix of these colors at the intensities we see in the figure above.

, with maximum eye sensitivity at 555 nm (green).

Sources:

1. http://wtamu.edu/~cbaird/sq/images/sunlight_wavelength.png

## Barcodes and Modular Arithmetic

**Barcodes**

Here is an example of a UPC-A barcode, taken from wikipedia:

A UPC-A barcode has 12 digits. The first digit is something that tells how the numbers are generally used – for example, a particular industry might use a certain number for certain kinds of items. The last twelfth digit is a check digit that can try to tell whether or not the numbers have an error. This check digit is constructed in a certain way at first. Later on, the check digit may be able to tell us if the numbers have an error or not.

The check digit is constructed as follows:

We have 11 digits:

$$ABCDEFGHIJK$$

So let \(L\) be the last twelfth difit. We sum the digits in the odd positions and multiply by 3, and sum that with the sum of the digits in the even positions:

$$3\cdot(A+C+E+G+I+K)+(B+D+F+H+J)$$

We take this modulo 10, or the remainder of this when divided by 10. If this is 0, that is our twelfth digit; if not, subtract this from 10 and that is our twelfth digit.

$$\text{Let}\ S = (3\cdot(A+C+E+G+I+K)+(B+D+F+H+J))$$

\begin{equation}

L=

\begin{cases}

0, & \text{if}\ S \pmod{10} \equiv 0 \\

10 – (S \pmod{10}), & \text{otherwise}

\end{cases}

\end{equation}

So the logic is that if all 12 digits are correct, they satisfy the check digit equation:

$$3\cdot(A+C+E+G+I+K)+(B+D+F+H+J+L) \equiv 0 \pmod{10}$$

If there is an error in the 12th digit, of course the check digit equation won’t be satisfied. If there is an error in any one single digit among the first 11 digits, then the check digit equation will also not be satisfied. Thus, the check digit equation will detect any single digit error.

To see that a single digit error among the first 11 digits will cause the check digit equation to not be satisfied, first note that if any of the digits in the even position are off, that will manifest in \(S\) as well as \(S \pmod{10} \) and we will have \(S \pmod{10} \not\equiv 0\). But what about the digits in the odd positions, whose sum is multiplied by 3, and why multiplied by 3?

Take a digit in one of the even positions. As long as the digit is off from the correct value, that will manifest itself in \(S\) and \(S \pmod{10} \). Now take a digit in one of the odd positions and call id \(O\). The question then is, if the digit is off from the correct value by say \(d\), how will that manifest itself in \(S\) as well as \(S \pmod{10} \)? The correct \(O\) gives a term \(3 \cdot O\) in \(S\) while an incorrect digit of say \(O + d\) gives a term \(3 \cdot O + 3 \cdot d\).

## Sets and Borel Sigma Algebra

**Showing**

\begin{equation}

\label{eq:0.99=1} \tag{1}

\large \bf 0.\bar{9} = 1

\end{equation}

kjh

\begin{equation}

\label{eq:0.99=io1} \tag{1}

0.\bar{9} = 1

\end{equation}

klh