Category: Math

A Thing About the Hot Hand Fallacy and the “Law of Small Numbers”

There was an interesting post and discussion on the NBA subreddit of Reddit on the Hot Hand phenomenon and whether or not it is a fallacy.

A Numberphile video on the topic:

An article on the topic:

https://www.scientificamerican.com/article/do-the-golden-state-warriors-have-hot-hands/

In some parts of the Numberphile video, Professor Lisa Goldberg emphasizes that issues of the “Law of Small Numbers,” which is described in the Scientific American article as:

Early in their careers, Amos Tversky and Daniel Kahneman considered the human tendency to draw conclusions based on a few observations, which they called the ‘‘law of small numbers’’.

when looking at the hot hand phenomenon, comes from the fact that we don’t get to see what happens after an H at the end of a sequence. Let a sequence be a string of shots of some length. A shot is either a make H or a miss T. So a sequence of 3 shots might be:

$$ HTH $$

A make, a miss, and then a make. So looking at that, we see that after the first H, we missed, which is evidence against the hot hand. We don’t care what happens after a miss, the T. We can’t see what happens after the last shot, which is a make. This is what’s noted as causing the “Law of Small Numbers.”

A moment from the Numberphile video illustrating the probabilities of H after an H for each possible sequence of 3 shots, and the average of those probabilities:

And here, this “Law of Small Numbers” causes the average probability of H’s after an H to be 2.5/6. When the sequence is a finite length, the probability of an H after an H (or a T after a T) is biased below 0.5. As the sequence gets longer and tends toward infinity, the probability of an H after an H (or a T after a T) goes toward 0.5.

While all this is true, let’s look a little closer at what’s going on in this illustration to understand why and how exactly this bias occurs.

All possibilities of sequences of 3 shots:

$$ n = \textrm{3} $$

\begin{tabular}{ |c|c|c| }  \hline  Sequence & After an H & Prob. of H after H \\  \hline  TTT & - & -  \\  TTH & - & - \\  THT & T & 0  \\  HTT & T & 0  \\  THH & H & 1  \\  HTH & T & 0 \\  HHT & HT & 0.5  \\  HHH & HH & 1  \\  \hline  \end{tabular}

$$ \textrm{Average probability} = \frac{2.5}{6} = 0.416\bar{6} $$

Assuming that an H and a T each appear with 0.5 probability and there is no memory, i.e. no hot hand, each of the above 8 sequences are equally probable. The average probability of the 6 cases where we can evaluate where there is a hot hand or not (cases that have an H in the first or second shot) is calculated to be 2.5/6 < 0.5. But let’s count the number of H’s and T’s in the second column. There are 4 H’s and 4 T’s! So we have:

$$ \frac {\textrm{Number of H’s}}{\textrm{Number of H’s & T’s}} = \frac {4}{8} = 0.5 $$

So it’s as if we’ve undercounted the cases where there are 2 shots that are “hot hand evaluations,” the last two sequences at the bottom of the list. In all (8) sequences of length 3, how many hot hand evaluations in total were there? (How many H’s or T’s in the 2nd column?) 8. How many of those were H’s? 4. So we have a hot hand make probability of 0.5.

It doesn’t necessarily mean that the way they counted hot hand makes in the Numberphile video is wrong. It’s just a particular way of counting it that causes a particular bias. It also may be the particular way the human instinct feels hot handedness – as an average of the probability of hot hand makes over different sequences. In other words, that way of counting may better model how we “feel” or evaluate hot handedness in real world situations.

So why is the average probability over sequences < 0.5?

When we evaluate hot-handedness, we are looking at shots that come after an H. Suppose we write down a list or table of each possible permutation of shot sequences of length \(n\) from less H’s, starting from the sequence of all T’s, down to more H’s, ending with the sequence of all H’s. We noted above that if we count all the hot hand makes H’s in all sequences (the H’s in the 2nd column), the probability of hot hand H’s among all hot hand evaluations (the number of H’s or T’s in the 2nd column) is 1/2. When we look at the list of sequences, what we notice is that a lot of the hot hand H’s (the 2nd column) are concentrated in the lower sequences toward the bottom. But these sequences heavy in so many H’s only give one probability entry in the 3rd column of 1 or near 1.

$$ n = \textrm{4} $$

\begin{tabular}{ |c|c|c| }  \hline  Sequence & After an H & Prob. of H after H \\  \hline  TTTT & - & -  \\  TTTH & - & - \\  TTHT & T & 0  \\  THTT & T & 0  \\  HTTT & T & 0  \\  TTHH & H & 1  \\  THHT & HT & 0.5  \\  HHTT & HT & 0.5  \\  THTH & T & 0  \\  HTTH & T & 0  \\  HTHT & TT & 0 \\  THHH & HH & 1  \\  HTHH & TH & 0.5  \\  HHTH & TH & 0.5  \\  HHHT & HHT & 0.667  \\  HHHH & HHH & 1  \\  \hline  \end{tabular}

$$ \textrm{Average probability} = \frac{5.6\bar{6}}{14} \approx 0.405 $$

$$ n = \textrm{5} $$

\begin{tabular}{ |c|c|c| }  \hline  Sequence & After an H & Prob. of H after H \\  \hline  TTTTT & - & -  \\  TTTTH & - & - \\  TTTHT & T & 0  \\  TTHTT & T & 0  \\  THTTT & T & 0  \\  HTTTT & T & 0  \\  TTTHH & H & 1  \\  TTHHT & HT & 0.5  \\  THHTT & HT & 0.5  \\  HHTTT & HT & 0.5  \\  TTHTH & T & 0  \\  THTTH & T & 0  \\  HTTTH & T & 0  \\  THTHT & TT & 0  \\  HTTHT & TT & 0  \\  HTHTT & TT & 0  \\  TTHHH & HH & 1  \\  THHHT & HHT & 0.667  \\  HHHTT & HHT & 0.667  \\  THTHH & TH & 0.5  \\  HTTHH & TH & 0.5  \\  THHTH & HT & 0.5  \\  HTHTH & TT & 0  \\  HHTTH & HT & 0.5  \\  HTHHT & THT & 0.333  \\  HHTHT & HTT & 0.333  \\  THHHH & HHH & 1  \\  HTHHH & THH & 0.667  \\  HHTHH & HTH & 0.667  \\  HHHTH & HHT & 0.667  \\  HHHHT & HHHT & 0.75  \\  HHHHH & HHHH & 1  \\  \hline  \end{tabular}

$$ \textrm{Average probability} = \frac{12.25}{30} \approx 0.408\bar{3} $$

Assuming equal probability of H and T on any given shot and no memory between shots: the entire list of sequences (the 1st column) will have an equal number of H’s and T’s. Additionally, all the hot hand evaluations (the 2nd column) will have an equal number of H’s and T’s.

Looking at the 1st column, we go from more T’s at the top to more H’s at the bottom in a smooth manner. Looking at the 2nd column though, we go from rows of T’s and as we go down we find that a lot of H’s are “bunched up” towards the bottom. But remember that we have a “limited” number of H’s in the 2nd column as well, namely 50% of all hot hand evaluations are H’s and 50% are T’s.

Let’s look closely at how the pattern in the 1st column causes more H’s to be bunched up in the lower sequences in the 2nd column, and also if there is any pattern to the T’s when we look across different sequences.

Higher sequences have less H’s (looking at the 1st column), which means more HT’s in those sequences as well, i.e. more hot hand misses. Lower sequences have more H’s, which means more HH’s in those sequences, i.e. more hot hand makes. This means that, looking at the 2nd column, higher sequences have more T’s and lower sequences have more H’s. Lower sequences “use up” more of the “limited amount” of H’s (limited because the number of H’s and T’s in the 2nd column are equal). Thus, H’s in the 2nd column are “bunched up” in the lower sequences as well. This causes there to be less sequences with higher probability (the 3rd column) than sequences with lower probability. Perhaps this is what brings the average probability below 0.5.

A naive look of the 2nd column shows that the highest sequences have a lone T as its hot hand evaluation, and many other hot hand evaluations of higher sequences end with a T. This makes sense since if a sequence consists of a lot of T’s, any H’s in it are unlikely to be the last two shot in the sequence, like …HH, which is what’s needed for the hot hand evaluations in the 2nd column to end with an H. And as long as a T is the last shot, the hot hand evaluation of the sequence will end with a T, since any lone H or streak of H’s in the sequence will have encountered a T as the next shot either with that last T shot in the sequence (…HHT) or meeting the first of consecutive T’s that lead up to the last T shot of the sequence (…HHTT…T).

Let’s divide up all the sequences in the 1st column into categories of how a sequence ends in its last 2 shots and use that to interpret what the last hot hand evaluation will be in the 2nd column for that sequence category. There are 4 possible ways to have the last 2 shots: TT, TH, HT, and HH. If a sequence ends in …TT, that “…” portion is either all T’s or if it has any H’s, we know that that sequence ends in a T before or at the second-to-last T in the sequence (either …H…TTT or …HTT). So in all cases but one (where the entire sequence is T’s and so there is no hot hand evaluation for the 2nd column), the last hot hand evaluation in the 2nd column will be a T. If a sequence ends in …TH, the thinking is similar to the case that ends in …TT since the very last H doesn’t provide us with an additional hot hand evaluation since the sequence ends right there, so the 2nd column also ends in a T. If a sequence ends in …HT, the last T there is our last hot hand evaluation, so the 2nd column also ends in a T. If a sequence ends in …HH, then the 2nd column ends in an H. So about 3/4 of all sequences end their 2nd column with a T. (\(3/4)n-2\) to be exact, since the sequences of all T’s and \((n-1)\) T’s followed by an H don’t have any hot hand evaluations.) Thus, the T’s in the 2nd column are “spread out more evenly” across the different sequences since (\(3/4)n-2\) of all sequences have a T for its last hot hand evaluation (the 2nd column), while the H’s are “bunched up” in the lower sequences. Thus, a relatively large number of sequences, especially sequences that are higher up, have their probabilities (the 3rd column) influenced by T’s in the 2nd column, bringing the average probability across sequences down.

$$ n = \textrm{6} $$

\begin{tabular}{ |c|c|c| }  \hline  Sequence & After an H & Prob. of H after H \\  \hline    TTTTTT   & - & -  \\  TTTTTH   & - & -  \\  TTHTTT   & T & 0  \\  THTTTT   & T & 0  \\  HTTTTT   & T & 0  \\  TTTTHT    & T & 0  \\  TTHTHT    & TT & 0  \\  THTTHT    & TT & 0  \\  TTTHTT    & T & 0  \\  TTHHTT   & HT & 1/2  \\  THTHTT    & TT & 0  \\  THHTTT   & HT & 1/2  \\  TTHTTH    & T & 0  \\  HTTTHT    & TT & 0  \\  THTTTH    & T & 0  \\  HTHTTT   & TT & 0  \\  HHTTTT   & HT & 1/2  \\  HTTHTT    & TT & 0  \\  TTTHTH    & T & 0  \\  TTTHHT    & HT & 1/2  \\  TTHHHT    & HHT & 2/3  \\  THTHHT    & THT & 1/3  \\  TTHHTH    & HT & 1/2  \\  THHHTT    & HHT & 2/3  \\  HTHHTT    & THT & 1/3  \\  HHTHTT    & HTT & 1/3  \\  THHTTH    & HT & 1/2  \\  HTTTTH    & T & 0  \\  TTTTHH   & H & 1  \\  TTHTHH   & TH & 1/2  \\  THTTHH    & TH & 1/2  \\  HHHTTT & HHT & 2/3  \\  HTHTTH    & TT & 0  \\  HHTTTH    & HT & 1/2  \\  THHTHT    & HTT & 1/3  \\  HTHTHT    & TTT & 0  \\  HHTTHT    & HTT & 1/3  \\  HTTTHH    & TH & 1/2  \\  HTTHTH    & TT & 0  \\  THTHTH    & TT & 0  \\  HTTHHT    & THT & 1/3  \\  TTTHHH    & HH & 1  \\  TTHHHH    & HHH & 1  \\  HTHHHT    & THHT & 1/2  \\  HHHTTH & HHT & 2/3  \\  HHHTHT & HHTT & 1/2  \\  THHTHH    & HTH & 2/3  \\  HTHTHH    & TTH & 1/3  \\  HHTTHH    & HTH & 2/3  \\  HHHHTT & HHHT & 3/4  \\  THHHTH    & HHT & 2/3  \\  HTHHTH    & THT & 1/3  \\  HHTHTH    & HTT & 1/3  \\  HTTHHH    & THH & 2/3  \\  THTHHH    & THH & 2/3  \\  THHHHT    & HHHT & 3/4  \\  HHTHHT    & HTHT & 1/2  \\  HHHTHH & HHTH & 3/4  \\  HHHHTH & HHHT & 3/4  \\  THHHHH    & HHHH & 1  \\  HHHHHT & HHHHT & 4/5  \\  HTHHHH    & THHH & 3/4  \\  HHTHHH    & HTHH & 3/4  \\  HHHHHH & HHHHH & 1  \\    \hline  \end{tabular}

$$ \textrm{Average probability} \approx 0.4161 $$

As \( n \) grows larger, the average probability seems to drift up.

Looking at the top of the list of sequences for \( n = 4 \), there are 3 sequences with a 0 in the 3rd column. These 3 sequences consist of 1 H and 3 T’s (and TTTH is uncounted because there is no hot hand evaluation in that sequence). At the bottom, we have the HHHH sequence giving a 1 in the 3rd column, and then 4 sequences that have 3 H’s ant 1 T. The entries in the 3rd column for these 4 sequences are 1, 0.5, 0.5, and 0.667.

For sequences of \( n = 5 \), there are then 4 sequences at the top of the list that give a 0 in the 3rd column. At the bottom, the HHHHH sequence gives a 1 in the 3rd column, and then the sequences with 4 H’s and 1 T give 1, 0.667, 0.667, 0.667, 0.75 in the 3rd column.

For sequences of \( n = 6 \), there are then 5 sequences at the top of the list that give a 0 in the 3rd column. At the bottom, the HHHHHH sequence gives a 1 in the 3rd column, and then the sequences with 5 H’s and 1 T give 1, 0.75, 0.75, 0.75, 0.75, 0.8 in the 3rd column.

This pattern shows that as \( n \) increases, we get \( (n – 1) \) sequences at the top of the list that always give 0’s in the 3rd column. At the bottom there is always 1 sequence of all H’s that gives a 1 in the 3rd column. Then for the sequences with \( (n – 1) \) H’s and 1 T, we always have 1 sequence of THH…HH that gives a 1 in the 3rd column, then \( (n – 2) \) sequences that give a \( \frac{n – 3}{n – 2} \) in the 3rd column, and always 1 sequence of HH…HT that gives a \( \frac{n – 2}{n – 1} \) in the 3rd column. So as \( n \) becomes large, the entries in the 3rd column for these sequences with \( (n – 1) \) H’s and 1 T get closer to 1. For small \(n\), such as \(n = 3\), those entries are as low as 0.5 and 0.667. But the entries in the 3rd column for the sequences high in the list with 1 H and \( (n – 1) \) T’s remain at 0 for any \(n\). Thus, as \( n \) becomes large, the lower sequence entries in the 3rd column become larger, shifting the average probability over sequences up.

Roughly speaking, when we only have one shot make in a sequence of shots (only 1 H among \(n-1\) T’s), we have only one hot hand evaluation possible, which is the shot right after the make. Ignoring the case of TT…TH, that hot hand evaluation coming after the H will always be a miss. Thus, when there is only one shot make in a sequence, the hot hand probability is always 0. On the other hand, when we have only one shot miss in a sequence, ignoring the TH…HH case, we will have 1 hot hand miss and many hot hand makes. Thus, our hot hand probability in these sequences with only 1 T will always be less than 1, and approaches 1 as \( n \) approaches \( \infty \). In a rough way, this lack of balance between the high sequences and low sequences drags down the average probability over the sequences below 0.5, with the amount that’s dragged down mitigated by larger and larger \( n \).

A possible interesting observation or interpretation of this is how it might lead to the human mind “feeling” the gambler’s fallacy (e.g. consecutive H’s means a T “has to come” soon) and the hot hand fallacy (e.g. consecutive H’s means more H’s to come). The above results show that in finite length sequences, when a human averages in their mind the probability of hot hand instances across sequences, i.e. across samples or experiences, the average probability is < 0.5. In other words, across experiences, the human mind "feels" the gambler's fallacy, that reversals after consecutive results are more likely. But when a human happens to find themselves in one of the lower sequences on a list where there are relatively more H's than T's in the 1st column, what happens is that the hot hand evaluations (the 2nd column) are likely to have a lot more H's than what you'd expect, because H's are "bunched up" towards the bottom of the 2nd column. What you expect are reversals - that's what "experience" and the gambler's fallacy that results from that experience tells us. But when we find ourselves in a sequence low in the list, the hot hand instances (the 2nd column) give us an inordinately high number of hot hand makes because H's are bunched up towards the bottom of the list. So when we're hot, it feels like we're really hot, giving us the hot hand fallacy. An actually rigorous paper on this subject, also found in a comment from the Reddit post, is Miller, Joshua B. and Sanjurjo, Adam, Surprised by the Gambler’s and Hot Hand Fallacies? A Truth in the Law of Small Numbers. One of the proofs they present is a proof that the average probability of hot hand makes across sequences is less than the probability of a make (i.e. using our example, the average of the entries in the 3rd column is less than 0.5, the probability of an individual make).

The Terms of Trade of Brazil

Source: https://www.nytimes.com/2018/11/09/opinion/what-the-hell-happened-to-brazil-wonkish.html

An article in the New York Times by Paul Krugman talked about a current economic downturn in Brazil. What happened:

First, the global environment deteriorated sharply, with plunging prices for the commodity exports still crucial to the Brazilian economy. Second, domestic private spending also plunged, maybe because of an excessive buildup of debt. Third, policy, instead of fighting the slump, exacerbated it, with fiscal austerity and monetary tightening even as the economy was headed down.

What didn’t happen:

Maybe the first thing to say about Brazil’s crisis is what it wasn’t. Over the past few decades those who follow international macroeconomics have grown more or less accustomed to “sudden stop” crises in which investors abruptly turn on a country they’ve loved not wisely but too well. That was the story of the Mexican crisis of 1994-5, the Asian crises of 1997-9, and, in important ways, the crisis of southern Europe after 2009. It’s also what we seem to be seeing in Turkey and Argentina now.

We know how this story goes: the afflicted country sees its currency depreciate (or, in the case of the euro countries, its interest rates soar). Ordinarily currency depreciation boosts an economy, by making its products more competitive on world markets. But sudden-stop countries have large debts in foreign currency, so the currency depreciation savages balance sheets, causing a severe drop in domestic demand. And policymakers have few good options: raising interest rates to prop up the currency would just hit demand from another direction.

But while you might have assumed that Brazil was a similar case — its 9 percent decline in real G.D.P. per capita is comparable to that of sudden-stop crises of the past — it turns out that it isn’t. Brazil does not, it turns out, have a lot of debt in foreign currency, and currency effects on balance sheets don’t seem to be an important part of the story. What happened instead?

Slowly going over the three points that Krugman made in the beginning:

1. Commodity prices went down and Brazil exports a lot of commodities.

Brazil’s exports in 2016:

From: https://atlas.media.mit.edu/en/visualize/tree_map/hs92/export/bra/all/show/2016/

At a glance, we have among commodities: vegetable products, mineral products (5% crude petroleum, 10% iron and copper ore), foodstuffs, animal products, metals, and precious metals. Though picking out these may be over or underestimating the true percentage of commodity exports among all of Brazil’s exports, let’s use these for our approximation. The total percentage of these products is about 60%, where around 36% are agricultural commodities, around 27% are metal commodities (metals + iron and copper ore), around 5% is crude petroleum, and around 2% are precious metals. These categorizations that I did are improvisational and not following any definitions – they are simplifications.

Looking at the S&P GSCI Agricultural & LiveStock Index Spot (SPGSAL):

From https://www.marketwatch.com/investing/index/spgsal/charts?countrycode=xx

we definitely do see a downtrend in the last several years in agricultural commodities.

Looking at the S&P GSCI Industrial Metals Index Spot (GYX):

From: https://markets.ft.com/data/indices/tearsheet/charts?s=GYX:IOM

there was a decline from 2011 but a rise from 2016.

Looking at the S&P GSCI Precious Metals Index Spot (SPGSPM):

From: https://markets.ft.com/data/indices/tearsheet/charts?s=SPGSPM:IOM

it’s been flat since around 2013.

Looking at S&P GSCI Crude Oil Index Spot (G39):

From: https://markets.ft.com/data/indices/tearsheet/charts?s=G39:IOM

it has been low after a decline in 2014 with volatility in 2017-2018.

But instead of eyeballing this phenomenon with a bunch of different charts, there’s a way that can mathematically eyeball this in one chart, called the terms of trade.

Investopedia’s definition of terms of trade:

What are ‘Terms of Trade – TOT’?

Terms of trade represent the ratio between a country’s export prices and its import prices. The ratio is calculated by dividing the price of the exports by the price of the imports and multiplying the result by 100. When a country’s TOT is less than 100%, more capital is leaving the country than is entering the country. When the TOT is greater than 100%, the country is accumulating more capital from exports than it is spending on imports.

But how exactly do you calculate the “price of exports and imports” of a country like, say Brazil, that has USD 190B exports a year and surely thousands if not more different products, and what to do about the changing quantities of each of those products every year? How do we understand the terms of trade in a way that doesn’t vaguely seem like the current account balance? (which is the total value of exports minus imports, or net value of exports: \( EX – IM = \sum_{i}^{}{p_i \cdot q_i} – \sum_{i}^{}{p’_i \cdot q’_i} \) where \( p_i\), \( q_i \) is the price and quantity of export product \(i\) and \( p’_i\), \( q’_i \) is the price and quantity of import product \(i\).

The answer is by deciding on a base year to compare the year in question. For example, for the prices of products in the year in question, we sum the values of exports for each product in that year, i.e. \( \sum_{i} {p_{i,n} \cdot q_{i,n}} \) where \(i\) is the index for each different product and \(n\) is the year in question. For the prices of products in the base year \(0\), we take the price of each product \(i\) in that base year multiplied by the quantity of that product \(i\) in the year in question \(n\). In other words, we fix the quantity of each product \(q_i\) to the quantity of each product in the year in question \(q_{i,n}\) so that we are strictly comparing prices between year \(n\) and \(0\) and not letting changes in quantity \(q\) get in the way. This is the Paasche index.

Another way we can do this is: for the prices of products in the year in question \(n\), we sum the prices of each product in that year \( p_{i,n} \) multiplied by the quantity of each product from the base year \( q_{i,0} \), and for the prices in the base year \(0\), we take the price of each product \(i\) in that base year multiplied by the quantity of that product \(i\) also in the base year \(0\). So this time, instead of fixing the quantity of each product in the year in question \(n\), we fix the quantity of each product to the base year \(0\). This is the Laspeyre index.

Paasche index:

$$ P_{\textrm{Paasche}} = \frac{\sum_{i}{p_{i,n} \cdot q_{i,n}}}{\sum_{i}{p_{i,0} \cdot q_{i,n}}} $$

Laspeyre index:

$$ P_{\textrm{Laspeyre}} = \frac{\sum_{i}{p_{i,n} \cdot q_{i,0}}}{\sum_{i}{p_{i,0} \cdot q_{i,0}}} $$

From: https://en.wikipedia.org/wiki/Price_index?oldformat=true#Paasche_and_Laspeyres_price_indices

Thus, by using such a price index calculation we “cancel out” the effect of changing export or import quantities so that we are only looking at the change of price of exports of imports between two time periods. With a base year \(0\), we can calculate the price index for exports in year \(n\), the price index for imports in year \(n\), and then divide the former by the latter to achieve the terms of trade for year \(n\).

From: https://www.nytimes.com/2018/11/09/opinion/what-the-hell-happened-to-brazil-wonkish.html

A terms of trade chart quantitatively summarizes all the above eyeballing we did with the visualization of Brazil’s exports and the charts of commodities indices as well as the eyeballing we didn’t do with Brazil’s imports. And we see what we expect in the above graph, which is a drop in Brazil’s terms of trade in the last several years.

2. Brazil’s consumer spending declined due to rising household debt (the red graph):

From: https://www.nytimes.com/2018/11/09/opinion/what-the-hell-happened-to-brazil-wonkish.html

3. Brazil implemented fiscal austerity to try to deal with “long-term solvency problems” and raised interest rates to try to deal with inflation, which was caused by depreciation in the currency. The currency depreciated due to lower commodity prices, which of course is also reflected in the terms of trade graph above.

Depreciating currency (blue) and inflation (change in or first derivative of red):

From: https://www.nytimes.com/2018/11/09/opinion/what-the-hell-happened-to-brazil-wonkish.html

Interest rates raised to combat inflation:

From: https://www.nytimes.com/2018/11/09/opinion/what-the-hell-happened-to-brazil-wonkish.html

We can see that interest rates rise in late 2015 as a response to rising inflation. Inflation drops as a response in the next couple of years, but this rise in interest rates contributed to the slow down in Brazil’s economy.

From: https://fred.stlouisfed.org/series/BRANGDPRPCH

So we have a drop in the terms of trade (due to a drop in commodity prices), a drop in consumer spending (due to a rise in household debt in preceding years), and then fiscal austerity and monetary contraction as government policy responses, causing a recession in Brazil.

Test Coin2

https://math.stackexchange.com/questions/2033370/how-to-determine-the-number-of-coin-tosses-to-identify-one-biased-coin-from-anot/2033739#2033739

 

Suppose there are two coins and the percentage that each coin flips a Head is \(p\) and \(q\), respectively. \(p, q \in [0,1] \), \(p \neq q \), and the values are given and known. If you are free to flip one of the coins any number of times, how many times \(n\) do you have to flip the coin to decide with some significance level \( \left( \textrm{say } \alpha = 0.05 \right) \) that it’s the \(p\) coin or the \(q\) coin that you’ve been flipping?

The distribution of heads after \(n\) flips for a coin will be a binomial distribution with means at \(pn\) and \(qn\).

alternative text

Two binomial distributions, n = 20. The means are pn = 10 and qn = 14.

Setting Up Our Hypothesis Test

Let’s say we want to test if our coin is the \(p\) coin and let’s say we arbitrarily decide to call the smaller probability \(p\), i.e. \(p < q\). We know that coin flips give us a binomial distribution, and we know the standard deviation of a binomial random variable \(X_p\) (let \(X_p\) or \(X_{p,n}\) be a binomial random variable for the number of flips that are heads, where the probability of a head on a flip is \(p\) and we do \(n\) number of flips), which is:

$$ \textrm{Standard Deviation of }{X_p} = \sqrt{ Var\left( {X_p} \right) } = \sqrt{ np(1-p) } $$

—–

Digression: we can also split our \(n\) Bernoulli trial coin flips that make up our binomial random variable \(X_p\) into \(m\) number of binomial random variables \(X_{p,k}\) each with \(k\) trials, such that \(k \times m = n\). Then the standard error of the mean proportion of heads from \(m\) binomial random variables (each with \(k\) trials) is:

$$ \textrm{Standard error of the mean} = \sqrt{ Var\left( \overline{X_{p,k}} \right) } = \sqrt{ Var \left( {1 \over m} \sum_{i=1}^{m} {X_{p,k}} \right) } $$
$$= \sqrt{ Var(\sum_{i=1}^{m} X_{p,k}) \over m^2 } = \sqrt{ m \cdot Var(X_{p,k}) \over m^2 }= \sqrt{ {m \cdot kp(1-p) \over m^2 } } = \sqrt{ { kp(1-p) \over m} } $$

This standard error above is for the random variable \(X_{p,k}\), each of which has \(k\) Bernoulli trials. In other words, the standard deviation of \( {1 \over m} \sum_{i=1}^{m} X_{p,k} \) is \( \sqrt{ kp(1-p) \over m }\). But if you simply change \(k\) to \(km = n\) and reduce \(m\) to \(1\), you get the same result as if you took all \(km = n\) trials as the number of trials for one binomial random variable, our original \(X_p\): where we now say that the standard deviation of \( {1 \over 1} \sum_{i=1}^{1} X_{p,n} = X_{p,n} = X_p \) is \( \sqrt{ np(1-p) \over 1 } = \sqrt{ np(1-p) } \).

By going from \(m\) repetitions of \(X_{p,k}\) to \(1\) repetition of \(X_{p,n}\), both the mean and the standard deviation is multiplied by \(m\). The mean of \(X_{p,k}\) is \(kp\) and the mean of \(X_{p,n}\) is \(mkp = np\); the standard deviation of \(X_{p,k}\) is \( \sqrt{ kp(1-p) } \) and the standard deviation of \(X_{p,n}\) is \( \sqrt{ mkp(1-p) } =\sqrt{ np(1-p) } \). The standard error of the mean of \(m\) repetitions of \(X_{p,k}\) is \( \sqrt{ { kp(1-p) \over m} } \) while the mean of \(m\) repetitions of \(X_{p,k}\) is of course just \( {1 \over m} \sum_{i=1}^{m} \mathbb{E} \left[ X_{p,k} \right] = {1 \over m} m (kp) = kp \). So when going from \(1\) repetition of \(X_{p,k}\) to \(m\) repetitions of \(X_{p,k}\), the mean goes from \(kp\) to \(mkp = np\) and the standard error of the mean of \(X_{p,k}\) goes from \( \sqrt{ { kp(1-p) \over m} } \) to the standard deviation of \( X_{p,n} \) by multiplying the standard error of the mean of \( X_{p,k} \) by \(m\): \( m \cdot \sqrt{ { kp(1-p) \over m} } = \sqrt{ { m^2 \cdot kp(1-p) \over m} } = \sqrt{ { mkp(1-p)} } = \sqrt{ { np(1-p)} } \).

—–

Knowing the standard deviation of our random variable \(X_p\), a 0.05 significance level for a result that “rejects” the null would mean some cutoff value \(c\) where \(c > pn\). If \(x_p\) (the sample number of heads from \(n\) coin tosses) is “too far away” from \(pn\), i.e. we have \(x_p > c\), then we reject the null hypothesis that we have been flipping the \(p\) coin.

But note that if we choose a \(c\) that far exceeds \(qn\) as well, we are in a weird situation. If \(x_p > c\), then \(x_p \) is “too large” for \(pn\) but also is quite larger than \(qn\) (i.e. \( x_p > qn > pn \) ). This puts us in an awkward situation because while \(x_p \) is much larger than \(pn\), making us want to reject the hypothesis that we have were flipping the \(p\) coin, it is also quite larger than \(qn\), so perhaps we obtained a result that was pretty extreme “no matter which coin we had.” If we assume the null hypothesis that we have the \(p\) coin, our result \(x_p \) is very unlikely, but it is also quite unlikely even if we had the \(q\) coin, our alternative hypothesis. But still, it is more unlikely that it is the \(p\) coin than it is the \(q\) coin, so perhaps it’s not that awkward. But what if \(x_p\) does not exceed \(c\)? Then we can’t reject the null hypothesis that we have the \(p\) coin. But our sample result of \(x_p\) might in fact be closer to \(qn\) than \(pn\) – \(x_p\) might even be right on the dot of \(qn\) – and yet we aren’t allowing ourselves to use that to form a better conclusion, which is a truly awkward situation.

If \(c\) is, instead, somewhere in between \(pn\) and \(qn\), and \(x_p > c\), we may reject the null hypothesis that our coin is the \(p\) coin, while \(x_p\) is in a region close to \(q\), i.e. a region that is a more likely result if we actually had been flipping the \(q\) coin, bringing us closer to the conclusion that this is the \(q\). However, if we reverse the experiment – if we use the same critical value \(c\) and say that if \(x_p < c\) then we reject our null hypothesis that \(q\) is our coin, then the power and significance of the test for each coin is different, which is also awkward.

Above, the pink region is the probability that \(X_p\) ends in the critical region, where \(x_p > c\), assuming the null hypothesis that we have the \(p\) coin. This is also the Type I Error rate (a.k.a. false positive) – the probability that we end up falsely rejecting the null hypothesis, assuming that the null hypothesis is true.

Above, the green region is the power \(1-\beta\), the probability that we get a result in the critical region \(x_p > c\) assuming that the alternative hypothesis is true, that we have the \(q\) coin. The blue-gray region is \(\beta\), or the Type II Error rate (a.k.a. false negative) – the probability that we fail to reject the null hypothesis (that we have the \(p\) coin) when what’s actually true is the alternative hypothesis (that we have the \(q\) coin).

Now let us “reverse” the experiment with the same critical value – we want to test our null hypothesis that we have the \(q\) coin:

We have \(x_p < c\). We fail to reject the null hypothesis that we have the \(p\) coin, and on the flip side we would reject the null hypothesis that we have the \(q\) coin. but we have failed a tougher test (the first one, with a small \(\alpha_p\)) and succeeded in rejecting an easier test (the second one, with a larger \(\alpha_q\)). In hypothesis testing, we would like to be conservative, so it is awkward to have failed a tougher test but "be ok with it" since we succeeded with an easier test. Common sense also, obviously, says that something is strange when \(x_p\) is closer to \(q\) than \(p\) and yet we make the conclusion that since \(x_p\) is on the "\(p\)-side of \(c\)," we have the \(p\) coin.   In reality, we wouldn't take one result and apply two hypothesis tests on that one result. But we would like the one test procedure to make sense with whichever null hypothesis we start with, \(p\) coin or \(q\) coin (since it is arbitrary which null hypothesis we choose in the beginning, for we have no knowledge of which coin we have before we start the experiment).


What we can do, then, is to select \(c\) so that the probability, under the hypothesis that we have the \(p\) coin, that \(X_p > c\) is equal to the probability that, under the hypothesis that we have the \(q\) coin, that \(X_q < c\). In our set up, we have two binomial distributions, which are discrete, as opposed to the normal distributions above. In addition, binomial distributions, unless the mean is at \(n/2\), are generally not symmetric, as can be seen in the very first figure, copied below as well, where the blue distribution is symmetric but the green one is not.

We can pretend that the blue distribution is the binomial distribution for the \(p\) coin and the green distribution for the \(q\) coin. The pmf of a binomial random variable, say \(X_p\) (that generates Heads or Tails with probability of Heads \(p\)) is:

$$ {n \choose h} p^h (1-p)^{n-h} $$

where \(n\) is the total number of flips and \(h\) is the number of Heads among those flips. We let \(c\) be the critical number of Heads that would cause us to reject the null hypothesis that the coin we have is the \(p\) coin in favor of the alternative hypothesis that we have the \(q\) coin. The area of the critical region, i.e. the probability that we get \(c_H\) heads or more assuming the hypothesis that we have the \(p\) coin, is:

$$ Pr(X_p > c) = \sum_{i=c}^{n} \left[ {n \choose i} p^i (1-p)^{n-i} \right] $$

And the reverse, the probability that we get \(c_H\) heads or less assuming the hypothesis that we have the \(q\) coin, is:

$$ Pr(X_q < c) = \sum_{i=0}^{c} \left[ {n \choose i} q^i (1-q)^{n-i} \right] $$ So we want to set these two equal to each other and solve for \(c\): $$ \sum_{i=c}^{n} \left[ {n \choose i} p^i (1-p)^{n-i} \right] = \sum_{i=0}^{c} \left[ {n \choose i} q^i (1-q)^{n-i} \right] $$ But since the binomial distribution is discrete, there may not be a \(c\) that actually works. For large \(n\), a normal distribution can approximate the binomial distribution. In that case, we can draw the figure below, which is two normal distributions, each centered on \(pn\) and \(qn\) (the means of the true binomial distributions), and since normal distributions are symmetric, the point at which the distributions cross will be our critical value. The critical regions for \(X_p\) (to the right of \(c\)) and for \(X_q\) (to the left of \(c\)) will have the same area.

If we pretend that these normal distributions are binomial distributions, i.e. if we pretend that our binomial distributions are symmetric (i.e. we pretend that \(n\) is going to be large enough that both our binomial distributions of \(X_p\) and \(X_q\) are symmetric enough), then to find \(c\) we can find the value on the horizontal axis at which, i.e. the number of Heads at which, the two binomial probability distributions are equal to each other.

$$ {n \choose c} p^c (1-p)^{n-c} = {n \choose c} q^c (1-q)^{n-c} $$
$$ p^c (1-p)^{n-c} = q^c (1-q)^{n-c} $$
$$ \left({p \over q}\right)^c \left({1-p \over 1-q}\right)^{n-c} = 1 $$
$$ \left({p \over q}\right)^c \left({1-p \over 1-q}\right)^{n} \left({1-q \over 1-p}\right)^c = 1 $$
$$ \left({p(1-q) \over q(1-p)}\right)^c = \left({1-q \over 1-p}\right)^{n} $$
$$ c \cdot log \left({p(1-q) \over q(1-p)}\right) = n \cdot log \left({1-q \over 1-p}\right) $$

$$ c = n \cdot log \left({1-q \over 1-p}\right) / log \left({p(1-q) \over q(1-p)}\right) $$

The mean of a binomial distribution \(X_p\) has mean \(pn\) with standard deviation \(\sqrt{np(1-p)}\). With a normal distribution \(X_{\textrm{norm}}\) with mean \(\mu_{\textrm{norm}}\) and standard deviation \(\sigma_{\textrm{norm}}\), the value \( c_{\alpha} = X_{\textrm{norm}} = \mu_{\textrm{norm}} = 1.645\sigma_{\textrm{norm}}\) is the value where the area from that value \(c_{\alpha}\) to infinity is \(0.05 = \alpha\). Thus, \( c_{\alpha} \) is the critical value for a normal random variable where the probability that \(X_{\textrm{norm}} > c_{\alpha} = 0.05)\). So for a binomial random variable \(X_p\), we would have \(c_{\textrm{binomial, }\alpha} = pn + 1.645\sqrt{np(1-p)}\).

Thus, we have that this critical value for a binomial random variable \(X_p\):

$$ c = n \cdot log \left({1-q \over 1-p}\right) / log \left({p(1-q) \over q(1-p)}\right) $$

must also be

$$ c_{\textrm{binomial, }\alpha} \geq pn + 1.645\sqrt{np(1-p)} $$

for the area to the right of \(c\) to be \(\leq 0.05\). To actually find the critical value \(c_{\textrm{binomial, }\alpha}\), we can just use

$$ c_{\textrm{binomial, }\alpha} \geq pn + 1.645\sqrt{np(1-p)} $$

Since we are given the values of \(p\) and \(q\), we would plug in those values to find the required \(n\) needed to reach this condition for the critical value. So we have

$$ n \cdot log \left({1-q \over 1-p}\right) / log \left({p(1-q) \over q(1-p)}\right) = pn + 1.645\sqrt{np(1-p)} $$

$$ \sqrt{n} = 1.645\sqrt{p(1-p)} / \left[ log \left({1-q \over 1-p}\right) / log \left({p(1-q) \over q(1-p)}\right) – p \right] $$

$$ n = 1.645^2p(1-p) / \left[ log \left({1-q \over 1-p}\right) / log \left({p(1-q) \over q(1-p)}\right) – p\right]^2 $$

For example, if \(p = 0.3\) and \(q = 0.7\), we have \(n = 14.2066 \), or rather, \(n \geq 14.2066 \).

Wolfram Alpha calculation of above, enter the following into Wolfram Alpha:

1.645^2 * p * (1-p) / (ln((1-q)/(1-p))/ln(p*(1-q)/(q*(1-p))) – p )^2; p = 0.3, q = 0.7

Note that if we switch the values so that \(p = 0.7\) and \(q = 0.3\), or switch the \(p\)’s and \(q\)’s in the above equation for \(n\), we obtain the same \(n_{\textrm{min}}\). This makes sense since our value for \(n_{\textrm{min}}\) depends on \(c\) and \(c\) is the value on the horizontal axis at which the two normal distributions from above (approximations of binomial distributions) with means at \(pn\) and \(qn\) cross each other. Thus, we set up the distributions so that that whole problem is symmetric.

So if we generate a sample such that the number of samples is \(n \geq 14.2066\), we can use our resulting \(x_p\) and make a hypothesis test regarding if we have the \(p\) or \(q\) coin with \(\alpha = 0.05\) significance level.

If \(p\) and \(q\) are closer, say \(p = 0.4\) and \(q = 0.5\), then we have \(n \geq 263.345\). This makes intuitive sense, where the closer the probabilities are of the two coins, the more times we have to flip our coin to be more sure that we have one of the coins rather than the other. To be more precise, the smaller the effect size is, the larger sample size we need in order to get the certainty about a result. An example of the effect size is Cohen’s d where:

$$\textrm{Cohen’s d } = {\mu_2 – \mu_1 \over \textrm{StDev / Pooled StDev}} $$

Wolfram Alpha calculation of above for \(n\) with \(p = 0.4\) and \(q = 0.5\), or enter the following into Wolfram Alpha:

1.645^2 * p * (1-p) / (ln((1-q)/(1-p))/ln(p*(1-q)/(q*(1-p))) – p )^2; p = 0.4, q = 0.5

From here, where the question is asked originally, is an answer that finds the exact values for the two \(n_{\textrm{min}}\) using R with the actual binomial distributions (not using normal distributions as approximations):

https://math.stackexchange.com/a/2033739/506042

Due to the discrete-ness of the distributions, the \(n_{\textrm{min}}\)’s found are slightly different: \(n_{\textrm{min}} = 17\) for the first case and \(n_{\textrm{min}} = 268\) for the second case. I.e., the difference comes from using the normal distribution as an approximation for the binomial distribution.

Test Coin

https://math.stackexchange.com/questions/2033370/how-to-determine-the-number-of-coin-tosses-to-identify-one-biased-coin-from-anot/2033739#2033739

Suppose there are two coins and the percentage that each coin flips a Head is \(p\) and \(q\), respectively. \(p, q \in [0,1] \) and the values are given and known. If you are free to flip one of the coins, how many times \(n\) do you have to flip the coin to decide with some significance level \( \left( \textrm{say } \alpha = 0.05 \right) \) that it’s the \(p\) coin or the \(q\) coin that you’ve been flipping?

The distribution of heads after \(n\) flips for a coin will be a binomial distribution with means at \(pn\) and \(qn\).

alternative text

Two binomial distributions, n = 20. The means are pn = 10 and qn = 14.

The Usual Hypothesis Test

In the usual hypothesis test, for example with data \(x_i, i=1, 2, 3, …, n\) from a random variable \(X\), to find the if the mean \( \mu \) is \(\leq\) some constant \(\mu_0\):

\begin{align}
H_0 & : \mu \leq \mu_0 ( \textrm{ and } X \sim N(\mu_0, \textrm{ some } \sigma^2 ) )
H_1 & : \mu > \mu_0
\end{align}

If the sample mean of the data points \( \overline{x} \) is “too large compared to” \( \mu_0 \), then we reject the null hypothesis \( H_0 \).

If we have the probability distribution of the random variable (even if we don’t know the true value of the mean \( \mu \)), we may be able to know something about the probability distribution of a statistic obtained from manipulating the sample data, e.g. the sample mean.  This, the probability distribution of a statistic (obtained from manipulating sample data), is called the sampling distribution.  And a property of the sampling distribution, the standard deviation of a statistic, is the standard error.  For example, the standard error of the mean is:

Sample Data: \(x\) \(\qquad\) Sample Mean: \( \overline{x} \)

Variance: \( Var(x) \) \(\qquad\) Standard Deviation: \( StDev(x) = \sigma(x) = \sqrt{Var(x)} \)

Variance of the Sample Mean: \( Var( \overline{x} ) = Var \left( \frac{1}{n}  \sum_{i=0}^{n}{ x_i } \right) = \frac{1}{n^2} \sum_{i=0}^{n} { Var(x_i) } = \frac{1}{n^2} n Var(x) = \frac{1}{n} Var(x) = {\sigma^2 \over n} \)

Standard Deviation of the Sample Mean, Standard Error of the Mean: \(  \frac{1}{\sqrt{n}} StDev(x) = {\sigma \over \sqrt{n}} \)

Thus, if the random variable is \(i.i.d.\) (independent and identically distributed), then with the sample mean \( \overline{x} \) we obtain from the data, we can assume this \( \overline{x} \) has a standard deviation of \(  \frac{\sigma}{\sqrt{n}} \).  This standard deviation, being smaller than the standard deviation of the original \(X\), i.e. \(\sigma\), means that \(\overline{X}\) is narrower around the mean than \(X\). This means \(\overline{X}\) gives us a better ability to hone in on what the data says about \( \mu \) than \(X\)’s ability to hone, i.e. a narrower, more precise, “range of certainty,” from the sample data, with the same significance level.

Thus, given our sample \(x_i, i = 1, \dots, n \), we can calculate the statistic \( \overline{x} = \frac{1}{n} \sum_{i=1}^{n} {x_i} \), our sample mean.  From the data (or given information), we would like to calculate the standard error of the mean, the standard deviation of this sample mean as a random variable (where the sample mean is a statistic, i.e. can be treated as a random variable): \(  \frac{1}{\sqrt{n}} StDev(x) = {\sigma \over \sqrt{n}} \). This standard error of the mean gives us a “range of certainty” around the \(\overline{x}\) with which to make an inference.

A. If we know/are given the true standard deviation \( \sigma \)

If we are given the true standard deviation \( \sigma \) of the random variable \( X \), then we can calculate the standard error of the sample mean: \(  \frac{\sigma}{\sqrt{n}} \).  So under the null hypothesis \( H_0: \mu \leq \mu_0 \), we want to check if the null hypothesis can hold against a test using the sample data.

A.a Digression about \(H_0: \mu \leq \mu_0\) and \(H_0: \mu = \mu_0\)

If the \(\mu\) we infer from the sample data is “too extreme,” in this case “too large” compared to \(\mu_0\), i.e. the test statistic is > some critical value that depends on \(\mu_0\), i.e. \(c(\mu_0)\), we reject the null hypothesis. If we check a \(\mu_1\) that is \(\mu_1 < \mu_0\) (since our null hypothesis is \( H_0: \mu < \mu_0 \)), our critical value \(c(\mu_1)\) will be less extreme than \(c(\mu_0)\) (in other words \( c(\mu_1) < c(\mu_0) \)), and thus it would be "easier to reject" the null hypothesis if using \( c(\mu_1) \). Rejecting a hypothesis test ought to be conservative since rejecting a null hypothesis is reaching a conclusion, so we would like the test to be "the hardest to reject" that we can (a conclusion, i.e. a rejection here, should be as conservative as possible). The "hardest to reject" part of the range of \(H_0: \mu \leq \mu_0\) would be \( \mu = \mu_0 \) where the critical value \( c(\mu_0) \) would be the largest possible critical value. Testing a \(\mu_1 < \mu_0\) would mean that we may obtain a test statistic that rejects is too extreme/large) for \(\mu_1\) (i.e. \( t > c(\mu_1) \) ) but not too extreme/large for \(\mu_0\) (i.e. \( t \not> c(\mu_0) \) ). But if we test using \(\mu_0\), if the test statistic is extreme/large enough that we reject the null hypothesis of \(\mu = \mu_0\), that would also reject all other null hypotheses using \(\mu_1\) where \(\mu_1 < \mu_0\).

So under the null hypothesis \( H_0: \mu \leq \mu_0 \) or the “effective” null hypothesis \( H_0: \mu = \mu_0 \), we have that \( X \sim N(\mu_0, \sigma^2) \) with \( \sigma \) known, and we have that \( \overline{X} \sim N(\mu_0, \sigma^2/n) \).  This means that

\( \frac{\overline{X} – \mu_0} { ^{\sigma}/_{\sqrt{n}} } \sim N(0, 1) \)

Then we can use a standard normal table to find where on the standard normal is the \( \alpha = 0.05 \) cutoff – for a one-tailed test, the cutoff is at \( Z_{\alpha} = 1.645 \) where \( Z \sim N(0, 1) \).  So if

\( \frac{\overline{X} – \mu_0} { ^{\sigma}/_{\sqrt{n}} } > 1.645 = Z_{\alpha} \),

then this result is “too large compared to \( \mu_0 \)” so we reject the null hypothesis \( H_0: \mu \leq \mu_0 \).  If \( \frac{\overline{X} – \mu_0} { ^{\sigma}/_{\sqrt{n}} } \leq 1.645 = Z_{\alpha} \), then we fail to reject the null hypothesis \( H_0: \mu \leq \mu_0 \).

B. If we don’t know the standard deviation \( \sigma \)

If we don’t know the value of the standard deviation \( \sigma \) of our random variable \( X \sim N( \mu, \sigma^2 ) \) (which would be somewhat expected if we already don’t know the value of the mean \(\mu\) of \( X \)), then we need to estimate \( \sigma \) from our data \( x_i, i = 1, 2, \dots, n \).  We can estimate \( \sigma \) by taking the sample standard deviation of \( x_i, i = 1, \dots, n \) by doing \( s = \sqrt{ \frac{1}{n-1} \sum_{i=0}^{n} { \left[ (x_i – \overline{x})^2 \right] } } \), or rather the sample variance \( s^2 = { \frac{1}{n-1} \sum_{i=0}^{n} { \left[ (x_i – \overline{x})^2 \right] } } \) and then taking the square root of that.

However, note that while the estimator for the sample variance is unbiased:

\begin{align}
\mathbb{E}\left[s^2\right] & = \mathbb{E}\left[ \frac{1}{n-1} \sum_{i=0}^{n} { \left[ (x_i – \overline{x})^2 \right] } \right] \\
& = \frac{1}{n-1} \mathbb{E} \left[ \sum_{i=0}^{n} { \left[ (x_i – \overline{x})^2 \right] } \right] = \frac{1}{n-1} \mathbb{E} \left[ \sum_{i=0}^{n} { \left[ (x_i -\mu + \mu – \overline{x})^2 \right] } \right] \\
& = \frac{1}{n-1} \mathbb{E} \left[ \sum_{i=0}^{n} { \left[ \left( (x_i -\mu) – (\overline{x} – \mu) \right)^2 \right] } \right] \\
& = \frac{1}{n-1} \mathbb{E} \left[ \sum_{i=0}^{n} { \left[  (x_i – \mu)^2 – 2 (x_i – \mu) (\overline{x} – \mu) + (\overline{x} – \mu)^2  \right] } \right] \\
& = \frac{1}{n-1} \mathbb{E} \left[ \sum_{i=0}^{n} { \left[  (x_i – \mu)^2 – 2 (x_i – \mu) (\overline{x} – \mu) + (\overline{x} – \mu)^2  \right] } \right] \\
& = \frac{1}{n-1} \mathbb{E} \left[   \sum_{i=0}^{n} { (x_i – \mu)^2 } – 2 (\overline{x} – \mu) \sum_{i=0}^{n} { (x_i – \mu) } + \sum_{i=0}^{n} { (\overline{x} – \mu)^2 }   \right]  \\
& = \frac{1}{n-1} \mathbb{E} \left[   \sum_{i=0}^{n} { (x_i – \mu)^2 } – 2 (\overline{x} – \mu)   (n \overline{x} – n \mu) + n (\overline{x} – \mu)^2    \right]   \\
& = \frac{1}{n-1} \mathbb{E} \left[   \sum_{i=0}^{n} { (x_i – \mu)^2 } – 2 n (\overline{x} – \mu)^2 + n (\overline{x} – \mu)^2    \right]   \\
& = \frac{1}{n-1} \mathbb{E} \left[   \sum_{i=0}^{n} { (x_i – \mu)^2 } – n (\overline{x} – \mu)^2    \right]   \\
& = \frac{1}{n-1}    \sum_{i=0}^{n} { \mathbb{E} \left[ (x_i – \mu)^2 \right] } – n \mathbb{E} \left[ (\overline{x} – \mu)^2 \right]  = \frac{1}{n-1} \left(    \sum_{i=0}^{n} { \mathbb{E} \left[ (x_i – \mu)^2 \right] } – n \mathbb{E} \left[ (\overline{x} – \mu)^2 \right]  \right)  \\
& = \frac{1}{n-1} \left(    \sum_{i=0}^{n} { \mathbb{E} \left[ x_i^2 – 2 \mu x_i + \mu^2 \right] } – n \mathbb{E} \left[ \overline{x}^2 – 2 \mu \overline{x} + \mu^2 \right]  \right) \\
& = \frac{1}{n-1} \left(    \sum_{i=0}^{n} { \left( \mathbb{E} \left[ x_i^2 \right] – 2 \mu \mathbb{E} [x_i] + \mu^2 \right) } – n \left( \mathbb{E} \left[ \overline{x}^2 \right] – 2 \mu \mathbb{E} [\overline{x}] + \mu^2 \right)  \right) \\
& = \frac{1}{n-1} \left(    \sum_{i=0}^{n} { \left( \mathbb{E} \left[ x_i^2 \right] – 2 \mu^2 + \mu^2 \right) } – n \left( \mathbb{E} \left[ \overline{x}^2 \right] – 2 \mu^2 + \mu^2 \right)  \right) \\
& = \frac{1}{n-1} \left(    \sum_{i=0}^{n} { \left( \mathbb{E} \left[ x_i^2 \right] – \mu^2 \right) } – n \left( \mathbb{E} \left[ \overline{x}^2 \right] –  \mu^2 \right)  \right) \\
& = \frac{1}{n-1} \left(    \sum_{i=0}^{n} { \left( \mathbb{E} \left[ x_i^2 \right] – \left( \mathbb{E} [x_i] \right)^2 \right) } – n \left( \mathbb{E} \left[ \overline{x}^2 \right] –  \left( \mathbb{E} [\overline{x}] \right)^2 \right)  \right) \\
& = \frac{1}{n-1} \left(    \sum_{i=0}^{n} { \left(  Var(x_i) \right) } – n Var(\overline{X})  \right) = \frac{1}{n-1} \left(    \sum_{i=0}^{n} { \left( \sigma^2 \right) } – n \frac{\sigma^2}{n} \right) \\
&  = \frac{1}{n-1} \left(    n \sigma^2 – \sigma^2 \right)  = \sigma^2 \\
\end{align}

that does not allow us to say that the square root of the above estimator gives us an unbiased estimator for the standard deviation \( \sigma \). In other words:

\( \mathbb{E}\left[s^2\right] = \mathbb{E}\left[ \frac{1}{n-1} \sum_{i=0}^{n} { \left[ (x_i – \overline{x})^2 \right] } \right] = \sigma^2 \)

but

\( \mathbb{E} [s] = \mathbb{E} \left[ \sqrt{ \frac{1}{n-1} \sum_{i=0}^{n} { \left[ (x_i – \overline{x})^2 \right] } } \right] \neq \sigma \)

because the expectation function and the square root function do not commute:

\( \sigma = \sqrt{\sigma^2} = \sqrt{ \mathbb{E}[s^2] } \neq \mathbb{E}[\sqrt{s^2}] = \mathbb{E}[s] \)

B.a The sample standard deviation \( s = \sqrt{s^2} = \sqrt{ \frac{1}{n-1} \sum_{i=0}^{n} { \left[ (x_i – \overline{x})^2 \right] } } \) is a biased estimator of \( \sigma \)

In fact, we can infer the bias of \( \mathbb{E} [s] \) to some extent. The square root function \( f(x) = \sqrt{x} \) is a concave function. A concave function \( f \) is:

$$ \forall x_1, x_2 \in X, \forall t \in [0, 1]: \quad f(tx_1 + (1 – t) x_2 ) \geq tf(x_1) + (1 – t) f(x_2) $$

The left-hand side of the inequality is the blue portion of the curve \( \{ f( \textrm{mixture of } x_1 \textrm{ and } x_2 ) \} \) and the right-hand side of the inequality is the red line segment \( \{ \textrm{a mixture of } f(x_1) \textrm{ and } f(x_2) \} \). We can see visually what it means for a function to be concave, where between to arbitrary \(x\)-values \(x_1\) and \(x_2\), the blue portion is always \(\geq\) the red portion between two \(x\)-values, .

Jensen’s Inequality says that if \( g(x) \) is a convex function, then:

$$ g( \mathbb{E}[X] ) \leq \mathbb{E}\left[ g(X) \right] $$

and if \( f(x) \) is a concave function, then:

$$ f( \mathbb{E}[X] ) \geq \mathbb{E}\left[ f(X) \right] $$

The figure above showing the concave function \(f(x) = \sqrt{x}\) gives an intuitive illustration of Jensen’s Inequality as well (since Jensen’s Inequality can be said to be a generalization of the “mixture” of \(x_1\) and \(x_2\) property of convex and concave functions to the expectation operator). The left-hand side \( f(\mathbb{E}[X]) \) is like \( f( \textrm{a mixture of } X \textrm{ values} ) \) and the right-hand side \( \mathbb{E}\left[ f(X) \right] \) is like \( {\textrm{a mixture of } f(X) \textrm{ values} } \) where the “mixture” in both cases is the “long-term mixture” of \( X \) values that is determined by the probability distribution of \( X \).

Since \( f(z) = \sqrt{z} \) is a concave function, going back to our estimation of the standard deviation of \( X \) using \(\sqrt{s^2}\), we have
\begin{align}
f( \mathbb{E}[Z] ) & \geq \mathbb{E}\left[ f(Z) \right] \longrightarrow \\
\sqrt{\mathbb{E}[Z]} & \geq \mathbb{E}\left[ \sqrt{Z} \right] \longrightarrow \\
\sqrt{ \mathbb{E}[s^2] } & \geq \mathbb{E}\left[ \sqrt{s^2} \right] \longrightarrow \\
\sqrt{ Var(X) } & \geq \mathbb{E}\left[s\right] \\
\textrm{StDev} (X) = \sigma(X) & \geq \mathbb{E}\left[s\right] \\
\end{align}

Thus, \( \mathbb{E} [s] = \mathbb{E} \left[ \sqrt{ \frac{1}{n-1} \sum_{i=0}^{n} { \left[ (x_i – \overline{x})^2 \right] } } \right] \leq \sigma \). So \( \mathbb{E} [s] \) is biased and underestimates the true \(\sigma\).

However, the exact bias \( \textrm{Bias}(s) = \mathbb{E} [s] – \sigma \) is not as clean to show.

https://en.wikipedia.org/wiki/Unbiased_estimation_of_standard_deviation

\( \frac{(n-1)s^2}{\sigma^2} = \frac{1}{\sigma^2} \sum_{i=0}^{n} { \left[ (x_i – \overline{x})^2 \right] } \sim \) a \( \chi^2 \) distribution with \( n-1 \) degrees of freedom. In addition, \( \sqrt{ \frac{(n-1)s^2}{\sigma^2} } = \frac{\sqrt{n-1}s}{\sigma} = \frac{1}{\sigma} \sqrt{ \sum_{i=0}^{n} { \left[ (x_i – \overline{x})^2 \right] } } \sim \) a \( \chi \) distribution with \( n-1 \) degrees of freedom. A \( \chi \) distribution with \(k\) degrees of freedom has mean \( \mathbb{E} \left[ \frac{\sqrt{n-1}s}{\sigma} \right] = \mu_{\chi} = \sqrt{2} \frac{\Gamma ( ^{(k+1)} / _2 ) } { \Gamma ( ^k / _2 )} \) where \( \Gamma(z) \) is the Gamma function.

https://en.wikipedia.org/wiki/Gamma_function

If \(n\) is a positive integer, then \( \Gamma(n) = (n – 1)! \). If \(z\) is a complex number that is not a non-positive integer, then \( \Gamma(z) = \int_{0}^{\infty}{x^{z-1} e^{-x} dx} \). For non-positive integers, \( \Gamma(z) \) goes to \(\infty\) or \(-\infty\).

From the mean of a \( \chi \) distribution above, we have:

\( \mathbb{E}[s] = {1 \over \sqrt{n – 1} } \cdot \mu_{\chi} \cdot \sigma \)

and replacing \(k\) with \(n-1\) degrees of freedom for the value of \(\mu_{\chi}\), we have:

\( \mathbb{E}[s] = \sqrt{ {2 \over n – 1} } \cdot { \Gamma(^n/_2) \over \Gamma(^{n-1}/_2) } \cdot \sigma \)

Wikipedia tells us that:

\( \sqrt{ {2 \over n – 1} } \cdot { \Gamma(^n/_2) \over \Gamma(^{n-1}/_2) } = c_4(n) = 1 – {1 \over 4n} – {7 \over 32n^2} – {19 \over 128n^3} – O(n^{-4}) \)

So we have:

\( \textrm{Bias} (s) = \mathbb{E}[s] – \sigma = c_4(n) \cdot \sigma – \sigma = ( c_4(n) – 1) \cdot \sigma \)

\( = \left( \left( 1 – {1 \over 4n} – {7 \over 32n^2} – {19 \over 128n^3} – O(n^{-4}) \right) – 1 \right) \cdot \sigma = – \left( {1 \over 4n} + {7 \over 32n^2} + {19 \over 128n^3} + O(n^{-4}) \right) \cdot \sigma \)

Thus, as \(n\) becomes large, the magnitude of the bias becomes small.

From Wikipedia, these are the values of \(n\), \(c_4(n)\), and the numerical value of \( c_4(n) \):

\begin{array}{|l|r|c|}
\hline
n & c_4(n) & \textrm{Numerical value of } c_4(n) \\
\hline
2 & \sqrt{2 \over \pi} & 0.798… \\
3 & {\sqrt{\pi} \over 2} & 0.886… \\
5 & {3 \over 4}\sqrt{\pi \over 2} & 0.940… \\
10 & {108 \over 125}\sqrt{2 \over \pi} & 0.973… \\
100 & – & 0.997… \\
\hline
\end{array}

Thus, for the most part, we don’t have to worry too much about this bias, especially with large \(n\). So we have
\( \mathbb{E}[\hat{\sigma}] \approx \mathbb{E}[s] = \mathbb{E}[\sqrt{s^2}] = \mathbb{E} \left[ \sqrt{ \frac{1}{n-1} \sum_{i=0}^{n} { \left[ (x_i – \overline{x})^2 \right] } } \right] \)

More rigorously, our estimator \( \hat{\sigma} = s = \sqrt{ \frac{1}{n-1} \sum_{i=0}^{n} { \left[ (x_i – \overline{x})^2 \right] } } \) is a consistent estimator of \( \sigma \) (even though it is a biased estimator of \( \sigma \)).

An estimator is consistent if \( \forall \epsilon > 0 \):

$$ \lim\limits_{n \to \infty} \textrm{Pr } (|\hat{\theta} – \theta| > \epsilon ) = 0 $$

In other words, as \( n \to \infty \), the probability that our estimator \( \hat{\theta} \) “misses” the true value of the parameter \(\theta\) by greater than some arbitrary positive amount (no matter how small) goes to \(0\).

For the sample standard deviation \(s\) as our estimator of the true standard deviation \(\sigma\) (i.e. let \(\hat{\sigma} = s\)),

\( \lim_{n \to \infty} (|\hat{\sigma} – \sigma|) = \lim_{n \to \infty} ( | c_4(n) \sigma – \sigma |) = (| \sigma – \sigma |) = 0 \)

so

\( \lim_{n \to \infty} \textrm{Pr } (|\hat{\sigma} – \sigma| > \epsilon) = \textrm{Pr } ( 0 > \epsilon ) = 0 \)

Since \(s\) is a consistent estimator of \(\sigma\), we are fine to use \(s\) to estimate \(\sigma\) as long as we have large \(n\).

So back to the matter at hand: we want to know the sampling distribution of \(\overline{X} \) to see “what we can say” about \(\overline{X}\), specifically, the standard deviation of \(\overline{X}\), i.e. the standard error of the mean of \(X\). Not knowing the true standard deviation \(\sigma\) of \(X\), we use a consistent estimator of \(\sigma\) to estimate it: \(s = \sqrt{{1 \over n-1} \sum_{i=1}^n {(x_i – \overline{x})^2}}\).

So instead of the case where we know the value of \(\sigma\)
\(\overline{X} \sim N(\mu, \sigma^2/n)\)
we have, instead something like:
\(\overline{X} \quad “\sim” \quad N(\mu, s^2/n)\)

When we know the value of \(\sigma\), we have
\({ \overline{X} – \mu \over \sigma/\sqrt{n} } \sim N(0,1) \)
When we don’t know the value of \(\sigma\) and use the estimate \(s\) instead of having something like
\({ \overline{X} – \mu \over s/\sqrt{n} } \quad “\sim” \quad N(0,1) \)
we actually have the exact distribution:
\({ \overline{X} – \mu \over s/\sqrt{n} } \sim T_{n-1} \)
the student’s t-distribution with \(n-1\) degrees of freedom.

Thus, finally, when we don’t know the true standard deviation \(\sigma\), under the null hypothesis \( H_0: \mu \leq \mu_0 \), we can use the expression above to create a test statistic
\( t = { \overline{x} – \mu_0 \over s/\sqrt{n} } ~ T_{n-1} \)
and check it against the student’s t-distribution with \(n-1\) degrees of freedom \(T_{n-1}\) with some critical value with some significance level, say \(\alpha = 0.05\).

So if the test statistic exceeds our critical value \(\alpha 0.05\):

\( t = { \overline{x} – \mu_0 \over s/\sqrt{n} } > T_{n-1, \alpha} \)

then we reject our null hypothesis \( H_0: \mu \leq \mu_0 \) at \(\alpha = 0.05\) significance level. If not, then we fail to reject our null hypothesis.

 

asdf

 

 

 

we know the standard deviation of a data point

 

If under the null hypothesis \( H_0 \) we have a probability distribution, the sample data gives us a sample standard deviation, i.e. the standard error.

 

Back to our case with 2 coins.  Let’s say we want to test if our coin is the \(p\) coin and let’s say we arbitrarily decide to call the smaller probability \(p\), i.e. \(p \leq q\).  We know that coin flips give us a binomial distribution, and we know the standard error of the mean proportion of heads from \(n\) flips.  So a 0.05 significance level would mean some cutoff value \(c\) where \(c > p\).  But note that if \(c\) ends up really big relative to \(q\), e.g. it gets close to \(q\) or even exceeds \(q\), we are in a weird situation.

 

we can decide on some cutoff value \(c\) between \(p\) and \(q\).  If we change around \(c\), what happens is that the significance level and the power of the test, whether testing \(p\) or \(q\), changes.

Barcodes and Modular Arithmetic

Barcodes

Here is an example of a UPC-A barcode, taken from wikipedia:

UPC-A barcode exampled

A UPC-A barcode has 12 digits.  The first digit is something that tells how the numbers are generally used – for example, a particular industry might use a certain number for certain kinds of items.  The last twelfth digit is a check digit that can try to tell whether or not the numbers have an error.  This check digit is constructed in a certain way at first.  Later on, the check digit may be able to tell us if the numbers have an error or not.

 

The check digit is constructed as follows:

We have 11 digits:

$$ABCDEFGHIJK$$

So let \(L\) be the last twelfth difit.  We sum the digits in the odd positions and multiply by 3, and sum that with the sum of the digits in the even positions:

$$3\cdot(A+C+E+G+I+K)+(B+D+F+H+J)$$

We take this modulo 10, or the remainder of this when divided by 10.  If this is 0, that is our twelfth digit; if not, subtract this from 10 and that is our twelfth digit.

$$\text{Let}\ S = (3\cdot(A+C+E+G+I+K)+(B+D+F+H+J))$$

\begin{equation}
L=
\begin{cases}
0, & \text{if}\ S \pmod{10} \equiv 0 \\
10 – (S \pmod{10}), & \text{otherwise}
\end{cases}
\end{equation}

So the logic is that if all 12 digits are correct, they satisfy the check digit equation:

$$3\cdot(A+C+E+G+I+K)+(B+D+F+H+J+L) \equiv 0 \pmod{10}$$

If there is an error in the 12th digit, of course the check digit equation won’t be satisfied.  If there is an error in any one single digit among the first 11 digits, then the check digit equation will also not be satisfied.  Thus, the check digit equation will detect any single digit error.

To see that a single digit error among the first 11 digits will cause the check digit equation to not be satisfied, first note that if any of the digits in the even position are off, that will manifest in \(S\) as well as \(S \pmod{10} \) and we will have \(S \pmod{10} \not\equiv 0\).  But what about the digits in the odd positions, whose sum is multiplied by 3, and why multiplied by 3?

Take a digit in one of the even positions.  As long as the digit is off from the correct value, that will manifest itself in \(S\) and \(S \pmod{10} \).  Now take a digit in one of the odd positions and call id \(O\).  The question then is, if the digit is off from the correct value by say \(d\), how will that manifest itself in \(S\) as well as \(S \pmod{10} \)?  The correct \(O\) gives a term \(3 \cdot O\) in \(S\) while an incorrect digit of say \(O + d\) gives a term \(3 \cdot O + 3 \cdot d\).

Portfolio Insurance and Black Monday, October 19, 1987

On the thirtieth anniversary of Black Monday, the stock market crash of October 19th and 20th in 1987, there have been mentions of “portfolio insurance” having possibly exacerbated the crash.

 

Portfolio insurance, in principle, is exactly what you might expect it to be: if you own a stock, Stock A, you insure it with a put option on Stock A.  Your position becomes equivalent to a call option on Stock A until the put option expires, with the price of this position being the premium of the put option when you bought it.

If you are managing a portfolio on behalf of clients, though, and you just need to insure the portfolio up to a certain date, after which, say, you hand over the portfolio, then to buy American put options to insure the portfolio would be unnecessary.  European put options would suffice.  So let’s suppose that we are only interested in European options.

In the article that I cite at the bottom (Abken, 1987), it seems that at the time, buying put options as insurance had a few issues.  This is assuming that the portfolio we want to insure is a stock index: the S&P 500 index.  The issues were:

  • Exchange-traded index options only had matures up to nine months
  • Exchange-traded index options had a limited number of strike prices
  • It’s implied that only American options were available (which we would expect have a premium over European options).

Thus, instead of using put options to insure the portfolio, the portfolio and put options are replicated by holding some of the money in the portfolio and some of it in bonds, Treasury bills, that we assume to provide us with the risk-free rate.

Without worrying about the math, the Black-Scholes equation gives us a way to represent our stock index S and put options P as:

$$S + P = S \cdot N_1 + K \cdot DF \cdot N_2$$

 

d

 

 

Source:

Abken, Peter A.  “An Introduction to Portfolio Insurance.”  Economic Review, November/December 1987: 2-25.

Link to articleArchived.

 

Testing MathJax-LaTeX

https://math.meta.stackexchange.com/questions/5020/mathjax-basic-tutorial-and-quick-reference

Default font size. Different font size.

At first, we sample $$f(x)$$ in the \(N\) ($N$ is odd) equidistant points around \(x^*\):

\[
f_k = f(x_k),\: x_k = x^*+kh,\: k=-\frac{N-1}{2},\dots,\frac{N-1}{2}
\]

where \(h\) is some step.

Then we interpolate points \((x_k,f_k)\) by polynomial

\begin{equation}
\label{eq:poly} \tag{1}
P_{N-1}(x)=\sum_{j=0}^{N-1}{a_jx^j}
\end{equation}

Its coefficients \(a_j\) are found as a solution of system of linear equations:
\begin{equation} \label{eq:sys} \tag{asdf}
\left\{ P_{N-1}(x_k) = f_k\right\},\quad k=-\frac{N-1}{2},\dots,\frac{N-1}{2}
\end{equation}
\begin{equation} \label{eq:sys2} \tag{asdf2}
\{ P_{N-1}(x_k) = f_k\},\quad k=-\frac{N-1}{2},\dots,\frac{N-1}{2}
\end{equation}

Backslash left and right parentheses:

\[ \left( \frac{1}{2} \right) \qquad ( \frac{1}{2} ) \\ ( \frac{1}{2} ) \]
\[ \left( \frac{1}{2} \right) \qquad \left( \frac{1}{2} ) \]

$$ 1 \quad \frac ab \quad 2 \quad \frac{c}{d} \quad 3 \quad {e \over f} \quad 4 \quad {}^g/_h \quad 5 \quad i/j \quad 6 \quad $$

$$ 1+1=2 \textrm{ centered equation } 1+1=2 $$

\( 1+1=2 \textrm{ left equation } 1+1=2 \)

\begin{align}
1 + 1 & = 2.00000000 \textrm{ aligned to character}\\
& = 2.0000000000000000 \\
& = 1.99999999999 \\
\end{align}

Here are references to existing equations: \ref{eq:poly}, \eqref{eq:sys}.
Here is reference to non-existing equation \eqref{eq:unknown}.

 

\begin{equation}
X=
\begin{cases}
0, & \text{if}\ a=1 \\
1, & \text{otherwise}
\end{cases}
\end{equation}

$$ \lim_{x\to 1} $$

\( \lim_{x\to 1} \)

 
 

$$ default, \it Italics, \bf bold,   \sf sans serif,   \tt typewriter, \rm default Roman, \it italics $$

$$horizontal spacing: back slash \ comma\, ! \! > \> : \: ; \; enspace \enspace quad \quad qquad \qquad end$$

$$hskip1point \hskip1pt hskip2point \hskip 2pt hskip10point \hskip10pt hskip3point \hskip 3pt 1ex \hspace{1ex} 1em \hspace{1em} 2em \hskip2em lengthofasdf \hphantom{<asdf>} backslash \ tilde ~ end$$

$$\tiny tiny$$

$$default$$

$$\scriptsize scriptsize, \small small, \normalsize normalsize \ or \ default, \large large$$

$$\normalsize normalsize \ or \ default, \large large$$

$$\Large Large \LARGE LARGE \huge huge \Huge Huge1$$

$$\Large  \LARGE  \huge  \Huge Huge2$$

$$\Huge Huge3$$

Brainteaser: The Monty Hall Problem

You are on a game show and presented with 3 doors.  Behind one is a car and behind the other 2 are goats.  You want to choose the door with a car behind it, as if you do so, you win the car.  You choose one door.  Then, the host opens one of the other doors, which reveals a goat behind it.  The host gives you a choice to either switch your door to the other one that’s still closed or keep your original choice.  Should you switch doors?

 


 

Answer:

If your strategy is to stick to your original choice, your probability of choosing the door with the car behind it is 1/3.  Let’s see what happens if you switch.  So you choose a door, the host reveals one of the other doors with a goat behind it, and asks if you want to switch.  What has happened up to this point?  There’s a 1/3 chance that you picked the door with the car behind it, which means that if you switch, you are switching to a door with a goat behind it.  There’s a 2/3 chance that you picked a door with a goat behind it, which means that if you switch, you are switching to a car behind it.  So if your strategy is to always switch, there’s a 1/3 chance you get a goat in the end (because you happened to choose a car on your first choice, which has a probability of 1/3) and a 2/3 chance you get a car in the end (because you happened to choose a goat on your first choice, which has a probability of 2/3).  So the best strategy is to switch.

The host revealing one of the doors gives you additional information.  Switching lets you use that information, assuming that it was unlikely that you got a car on your original choice.

Perhaps a more intuitive answer is if there are 100 doors.  One has a car behind it and 99 of them have goats behind them.  Choose one door, the hosts reveals another door with a goat behind it, and asks if you want to switch.  If you don’t switch, there’s a 1/100 chance that you chose the door with a car behind it.  But if you switch, assuming that you probably didn’t choose the right door on your first try (because 1/100 is small), now, you have a 1/98 chance of choosing the right door (because the host as revealed one door with a goat behind it and you’re giving up your original door).  Of course 1/98 is better than 1/100.  The exact probability of getting the right door with the switching strategy is 99/100 × 1/98 (probability that you chose the wrong door on the first try × probability of choosing the right door after accepting the offer to switch).  99/100 × 1/98 = 1/100 × (99/98) > 1/100 where 1/100 is the probability of getting the car with not switching, and so switching is better than not switching.

Brainteaser: 100 Prisoners in a Line and Extension

There are 100 prisoners.  An executioner tells them that tomorrow morning, he will line them up so that each of them is facing the back of the head of another prisoner, except for one prisoner at the end of the line.  In other words, prisoner 1 sees the back of the head of prisoner 2 as well as the backs of the heads of prisoners 3-100, prisoner 2 sees the back of the heads of prisoners 3-100, …, prisoner 99 only sees the back of the head of prisoner 100, and prisoner 100 doesn’t see any prisoners in front of him.  The executioner tells them that he will put either a red or blue hat on each prisoner, then starting with prisoner 1 (the one who can see 99 other prisoners in front of him), will ask him what color hat he is wearing.  The prisoner says a color and if he is wrong, he will silently kill that prisoner (prisoner 1 would be killed in a way that prisoners 2-100 won’t know if he was killed or not).  If he is right, he will keep him alive.  Then, the executioner will move to prisoner 2, ask the same question, and kill if he’s wrong, keep him alive if he is right.  The executioner keeps doing this for every prisoner to prisoner 100.  The prisoners are allowed to discuss together in the night what to do for the next day.  What should their plan be in order to maximize the number of survivors?  For clarity, what should their plan be in order maximize the number of survivors in the worst case scenario (any random guess by a prisoner ends up being wrong)?

 


 

Not the answer:

A sort of baseline answer is that prisoner 1 says the color of the hat worn by the prisoner right in front of him or her, thus sacrificing his or her life with a guess.  Prisoner 2 is guaranteed to live.  Repeat this for every pair of prisoners, giving us at least 50 prisoners alive at the end.  With 2 colors of hats, it makes intuitive sense that this would be an answer.  Unintuitively, this is far from the answer :-/

 


 

Answer:

One key, or hint, that may remain unemphasized when this brainteaser is presented to people, is that when a prisoner makes and says his guess for his own color, that guess is heard by all the other prisoners.  If each guess is correct, that provides valuable information to the later prisoners.

Let’s say there are only 3 prisoners and we are the middle prisoner.  We see only one prisoner in front of us and say he is wearing a red hat.  From the perspective of the prisoner behind us, either 1 or 2 red hats are seen.  So it’s possible for the prisoner behind us to announce through some code (e.g. “Red” = there is 1 red hat in front of me, “Blue” = there are 2 red hats in front of me) to tell us this.  This allows us to answer our own hat correctly.  Additionally, the prisoner in front of us will have gained two pieces of information: how many red hats there are with the 2 last prisoners and what hat the middle prisoner was wearing.  In other words, initially, there were either 1 or 2 hats worn by the last two prisoners.  The middle prisoner has the ability to answer correctly after the first prisoner sacrifices himself or herself by announcing the code.  If the first prisoners announces that there are 2 red hats in front of him, the middle prisoner will definitely say that he himself is wearing a red hat, leaving 1 red hat for the last prisoner.  If the first prisoner announces that there is 1 red hat in front of him, and then the middle prisoner says “Red,” the last prisoner knows that they are Blue, while if the middle prisoners says “Blue,” the last prisoner knows that they are Red.

Let’s say there are 4 prisoners in a line.  The first prisoner sees 1, 2, or 3 red hats in front of him or her.  But as long as the second prisoner announces his or her own hat color correctly, that will provide information for the later prisoners.  So how can the first prisoner announce information so that at least just the second prisoner will get his or her own hat color correctly?  The second prisoner sees 1 or 2 hats in front of him or her.  The answer is that the first prisoner announces the oddness or evenness of the number of red hats he or she sees.  From the second prisoner’s perspective, whatever he sees in front of him and whatever the last prisoner sees in front of him can only differ by 0 red hats or 1 red hat (whatever hat the second prisoner is wearing).  Thus, the key is, when there is only a difference of one change at each increment, oddness and evenness conveys enough information to tell us what has changed.  So the first prisoner sacrifices himself by announcing, say “Red” for an even number of red hats and “Blue” for an odd number of red hats that he sees in front of him.  This allows the second person to say his hat color correctly.  The third person has information that among the last 3 people, the number of red hats was either odd or even, plus what exact hat color the second person has, plus, of course, what exact hat color the first person, the person in front of him, has.  Effectively, the second person knows the hat colors of all 3 people at the end of the line except his own color plus the information that the first person provides, what the oddness or evenness of the number of red hats was for those 3 people.  This is enough information for the second person to figure out what color hat he has.  It’s the same with the last person. 

So with 100 people, the first person sacrifices himself by announcing the oddness or evenness of one of the colors that exist by code.  The second person has exact knowledge of the colors of the 98 people in front of him plus the oddness or evenness of one of the colors for all 99 people excluding the first person (i.e. the 98 people in front of him plus himself), giving him correct knowledge of his own color.  The third person know has exact knowledge of the color of the person behind him and the colors of the 97 people in front of him, plus the oddness or evenness of one of the colors for the 99 people that includes him, giving him enough information to figure out his own color.  This continues until the whole line is finished.  Thus, at least 99 out of 100 people can be saved with this strategy.

 


 

Extension:

What if the executioner uses more colors?

In our above case, we had 2 colors, and we sacrificed 1 prisoner at the beginning of the line to announce the oddness or evenness of one of the colors for the 99 people he sees in front of him.  Since all prisoners know the number of prisoners that the first prisoner sees (99), everyone only needs to keep track of one of the colors, say red.  The first prisoner announces the oddness or evenness of red, and each subsequent prisoner counts how many reds there are in the 99 to see if they also have a red hat or not.

If we have 3 colors, the first prisoner that can be saved would see x prisoners in front with 3 different colors and needs to figure out what color hat he has on.  Extending the strategy from above, if we sacrifice two prisoners before him, they can announce the oddness or evenness of two of the colors.  This is enough information for the first prisoner we save what color hat he has.  All subsequent prisoners will then have exact knowledge of the hat colors of all prisoners that can be saved except for their own, which they deduce by the oddness or evenness of the 2 colors that that first two prisoner we sacrifice first announce.  So in this case, we sacrifice 2 prisoners at the start and the 98 subsequent prisoners can be saved.

Let us apply the same logic to more colors.  If the executioner uses y different colors where 1 ≤ y ≤ 100, the first y – 1 prisoners sacrifice themselves by announcing the oddness or evenness of y – 1 colors.  The remaining 100 – (y – 1) prisoners will have enough information to correctly state their hat color.  If the executioner uses more colors than there are prisoners, we don’t have enough prisoners we can sacrifice to convey accurate information about the oddness or evenness of the colors we have to prisoners at the end.  In addition, we can always default back to the “baseline” solution, where each pair works together by sacrificing one prisoner (who simply announces the color of the hat in front of him) and saving the other one (who simply says the color that was announced by the prisoner before him), and guarantee at least 50 prisoners saved.  Thus, for 1 ≤ y ≤ 49, the “sacrifice for odd or even” strategy saves 99 to 51 people.  For y = 50, the strategy saves 50 people, which is the same as the result for the “default pair sacrifice” strategy.  For y > 50 (and even if y > 100), the “default pair sacrifice” strategy can always save 50 people and becomes better than the “sacrifice for odd or even” strategy.

Brainteaser: Blue Foreheads

100 people are in a room.

 

  1. All 100 of them are perfect logicians.
  2. They are told that at least one person in the room has blue paint on their forehead.
  3. They are told that once you deduce that you have blue paint on your forehead, the next time that the lights are turned off, leave the room.

 

All 100 people have actually had their foreheads painted blue (but of course, each of them don’t know this at this point – they can only see the other people’s foreheads).  The light is turned off, then on, then off, on, etc.  What happens?

 


 

Answer:

So each person sees 99 other people with blue paint on their heads.  While this is the situation we begin with, it doesn’t seem to help with solving the problem at all.  The key for this problem is to start as small as possible and then expand.

 

Start with 1 person.  1 person in a room sees 0 other people.  Thus, if there is at least 1 person in the room with blue paint, he or she must be it.  The light goes off, and then on, and we see 0 people in the room, as the person has left.

 

Let’s say we have 2 people.  Put ourselves in to the shoes of one of them.  They see 1 person in the room with blue paint on their forehead, and don’t now if there is blue paint on their own forehead.  But if there was no blue paint on their forehead, then the other person should deduce that they must be the one with blue paint on their forehead, and will be gone by the next light.  The light is turned off, then on.  Since both people see the other person with blue paint, both remain.  Now, each person knows that the other person looked at their forehead and saw blue paint, and so each person knows that they have blue paint on their own forehead.  The lights turns off and on, and there are 0 people in the room.

 

I think you know where this is going (although I find the logic the most difficult from here).  3 people in the room.  Each person sees 2 other people with blue paint on their foreheads.  The additional key here is, each person needs to think, “What if I don’t have blue paint?  If what happens then is a contradiction, then I must have blue paint.”  Choosing one person’s perspective – our “first” person – we first posit that we don’t have blue paint.  In that case, each of the other 2 people sees 1 person without blue paint and 1 person with blue paint.  Our existence as someone without blue paint doesn’t matter in their calculations.  Each of them thinks, “There is one other person in this room with blue paint.  If they see me without blue paint as well, then they should disappear by the next light.  The light turns off, then on.  All 3 people are still there.  So each of the other 2 people think, “Since that other person didn’t leave, I must have blue paint.  So I will leave by the next light.  The light turns off and on.  But since the truth is that all 3 people have blue paint, the other 2 people won’t disappear.  Instead, each of them are thinking the same thing about the other 2 people in the room that they see have blue paint on their foreheads.  Everyone waited two turns to see if the other people would make a move.  Since they didn’t, everyone has found a contradiction to “If I had blue paint,” and thus everyone deduces that they have blue paint on their own forehead.  Thus, the third time that the light goes off and on, the 3 people have left the room.

 

4 people in the room.  Assume you don’t have blue paint, so your being there doesn’t affect the others’ logic.  There are 3 people wondering if they have blue paint and the each see 2 other people with blue paint.  After 3 turns of the light going off and on, they should all leave.  If they don’t, we have a contradiction, so we have blue paint.  So on the 4th light, all 4 people leave.

 

5 people in the room.  Described another way: Let’s say we don’t have blue paint.  There are 4 other people with blue paint.  Let’s label them A, B, C, and D.  D is wondering if he or she has blue paint, looking at A, B, and C.  D first assumes he has no paint and is thinking, “C is thinking if he doesn’t have blue paint, then after 2 turns, A and B will disappear.”  After 2 turns, A and B remain.  D is thinking, “So now, C will conclude that he has blue paint.  So on the 3rd turn, A, B, and C should leave.”  After the 3rd turn, A, B, and C remain.  D is thinking, “OK, so there’s a contradiction to the assumption that I don’t have blue paint.  Thus, I have blue paint, and will disappear on the 4th turn.”  On the 4th turn, we see that A, B, C, and D still remain.  Thus, we have a contradiction to our first assumption that we have no blue paint.  We have blue paint, so on the 5th turn, we leave.  Everyone else also has the same logic process, so on the 5th turn, everyone leaves.

 

If there are 100 people in the room, all with blue paint on their foreheads, first assume that you don’t have blue paint on your forehead.  So then, your existence shouldn’t matter to the other 99 people’s logic.  Let’s label us A.  There are 100 people in the room: A, B, C, …, X, Y, Z, AA, AB, …, CV.  Person A first assumes they have no paint, and thinks, “B must be thinking, if I don’t have paint, then, C would think, if I don’t have blue paint… etc.”  Basically, we are testing the assumption that everyone first assumes that they themselves don’t have blue paint on their forehead.  It doesn’t make intuitive sense since anyone can see that there are at least 99 other people with paint, but it’s the key step.  Assume, what if everyone from A to CV thought that they didn’t have blue paint?  Or rather that A assumes they don’t have blue paint and that B assumes that B doesn’t have blue paint and B assumes that … CU assumes CU doesn’t have blue paint and that CV assumes that they don’t have blue paint?  Well, this is a contradiction, because at least 1 person must have blue paint.  Now, let’s assume A to CU thinks that they don’t have blue paint and CU sees CV has blue paint and must assume that CV sees everyone else with no paint.  After 1 turn, CV doesn’t leave (because it’s not true that the other 99 people don’t have blue paint), and thus we have a contradiction and CU must believe that they have blue paint on their forehead as well.  After turn 2, CU doesn’t leave though (because it’s not true that the 98 other people other than CV and CU don’t have blue paint), so we have a contradiction and CT must believe that they have blue paint.  Keep going until turn 99, where B doesn’t leave because it’s not true that A doesn’t have blue paint (if B saw that A doesn’t have blue paint, B should have left on turn 99).  We have a contradiction, so A concludes that they have blue paint, and so on turn 100, everyone leaves.

 

It’s a lot easier to rely on the formula we built from the smaller examples that “With a room of x people, they all leave at once after x turns.”  But I find the intuition disappears with large numbers.  The above paragraph is an attempt to describe the intuition, the key being that we assume that all x people assume that they don’t have blue paint, and then one by one contradict that (because in reality, everyone has blue paint), until we’ve contradicted all cases down to 1 person assuming they have no paint.  Once that is contradicted on the xth turn, after that, everyone leaves at once, since everyone has the same logic process.